# Pooled Cross Sections

* Policy evaluation
* Collected across time, but not necessarily the same individual

# Difference-in-Difference

* Example: the opening of MTR station
* HKU station was open in Dec, 2014
* Control group: HKUST
 

1. Analysis of the grouped average
2. Regression analysis

$$y = \beta_1 + \beta_2 d_{ust} + \beta_3 d_{hku} + \beta_4 d_{ust} \times d_{hku} + e$$
* Variables indexed by $i$ and $t$
* Convenience of the latter approach: statistical inference and covariate control

# Panel Data


* Economists mostly work with observational data. 
* The data generation process is out of the researchers' control.
* Difficult to control heterogeneity among the individuals in cross-sectional data.
* panel data offers a chance

* Panel data track the same individuals across time $t=1,\ldots,T$.
* assume the observations are independent across $i=1,\ldots,n$,
* allow of dependence for $i$ across $t=1,\ldots,T$. 

# Linear Equation
$$y_{it}=\beta_{1}+x_{it}\beta_{2}+u_{it},\ i=1,\ldots,n;t=1,\ldots,T\label{eq:basic_eq}$$

* **composite error**:  $u_{it}=\alpha_{i}+\epsilon_{it}$ is called the.
* $\alpha_{i}$ is ime-invariant unobserved heterogeneity
* $\epsilon_{it}$ varies across individuals and time periods.

# Motivating Example

* Air pollution at city level

$$\mathrm{PM2.5}_{it}= \alpha_i + \beta_1 \mathrm{GdpGrwoth}_{it} + \beta_2' \mathrm{OtherControls}_{it} + e_{it}$$



* $\alpha_i$ is used to control the geographic composition

# Real Data Example

* a dataset from [NBER-CES Manufacturing Industry Database](http://www.nber.org/nberces/). 
* contains annual information of 473 USA industries during 1958 to 2009.  

In [1]:
g0 <- read.csv("naics5809.csv")
g0[c(1:10, 50:60), 1:10]

Unnamed: 0,naics,year,emp,pay,prode,prodh,prodw,vship,matcost,vadd
1,311111,1958,18.0,81.3,12.0,25.7,49.8,1042.4,752.4,266.9
2,311111,1959,17.9,82.5,11.8,25.5,49.4,1051.0,758.9,268.7
3,311111,1960,17.7,84.8,11.7,25.4,50.0,1050.2,752.8,269.9
4,311111,1961,17.5,87.4,11.5,25.4,51.4,1119.7,803.6,287.8
5,311111,1962,17.6,90.2,11.5,25.2,52.1,1175.7,853.3,294.5
6,311111,1963,17.1,89.8,11.0,23.9,52.1,1249.1,893.6,328.7
7,311111,1964,16.6,90.8,10.6,23.5,52.2,1245.6,890.2,326.8
8,311111,1965,16.0,90.8,10.2,22.7,51.8,1283.5,928.1,324.7
9,311111,1966,16.1,96.1,10.2,22.6,53.9,1428.8,1049.9,344.8
10,311111,1967,16.7,105.0,11.0,23.9,61.3,1544.1,1101.6,410.0


# R package

`plm`: panel data

In [2]:
library(plm)
g <- pdata.frame( g0, index = c("naics", "year") )

Loading required package: Formula


In [3]:
# the regression equation
equation <- emp~invest+cap

# Nothing prevents from running an OLS. 
g.ols <- lm(equation, data=g)
summary(g.ols)


Call:
lm(formula = equation, data = g)

Residuals:
    Min      1Q  Median      3Q     Max 
-364.61  -17.88   -9.57    6.42  416.23 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.430e+01  2.655e-01  91.509  < 2e-16 ***
invest      -5.393e-03  8.766e-04  -6.152 7.77e-10 ***
cap          4.120e-03  6.341e-05  64.971  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.89 on 24164 degrees of freedom
  (429 observations deleted due to missingness)
Multiple R-squared:  0.2927,	Adjusted R-squared:  0.2926 
F-statistic:  5000 on 2 and 24164 DF,  p-value: < 2.2e-16


In [4]:
# The OLS coefficient estimates are exactly the same as the pooled OLS. 
# The only difference in the summary is that the later shows the panel structure 
# of the data.

g.pool <- plm(equation,data=g,model="pooling")
summary(g.pool)

Pooling Model

Call:
plm(formula = equation, data = g, model = "pooling")

Unbalanced Panel: n = 473, T = 13-52, N = 24167

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-364.6116  -17.8760   -9.5675    6.4165  416.2347 

Coefficients:
               Estimate  Std. Error t-value  Pr(>|t|)    
(Intercept)  2.4296e+01  2.6550e-01  91.509 < 2.2e-16 ***
invest      -5.3929e-03  8.7660e-04  -6.152 7.771e-10 ***
cap          4.1200e-03  6.3413e-05  64.971 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    49051000
Residual Sum of Squares: 34694000
R-Squared:      0.2927
Adj. R-Squared: 0.29264
F-statistic: 4999.87 on 2 and 24164 DF, p-value: < 2.22e-16

# Panel Data Models

* Fixed effect
* Random effect

* Estimation: OLS 

# Fixed Effect

* FE model allows $\alpha_{i}$ and $x_{it}$ to be arbitrarily
correlated. 
* Need to eliminate $\alpha_{i},i=1,\ldots,n$ to restore consistency. 


Averaging the $T$ equations of for the
same $i$, 
$$\overline{y}_{i}=\beta_{1}+\overline{x}_{i}\beta_{2}+\bar{u}_{it}=\beta_{1}+\overline{x}_{i}\beta_{2}+\alpha_{i}+\bar{\epsilon}_{it}.\label{eq:group_mean}$$
where $\overline{y}_{i}=\frac{1}{T}\sum_{t=1}^{T}y_{it}$. 

Subtracting
the average, 
$$\tilde{y}_{it}=\tilde{x}_{it}\beta_{2}+\tilde{\epsilon}_{it}$$
where $\tilde{y}_{it}=y_{it}-\overline{y}_{i}$. 

Run OLS with the
demeaned data, and obtain the within estimator
$$\widehat{\beta}_{2}^{FE}=\left(\tilde{X}'\tilde{X}\right)^{-1}\tilde{X}'\tilde{y},$$
where $\tilde{y}=\left(y_{it}\right)_{i,t}$ stacks all the $nT$
observations into a vector, and similarly defined is $\tilde{X}$ as an
$nT\times K$ matrix, where $K$ is the dimension of $\beta_{2}$.

# Assumptions


**Assumption FE.1**
$E\left[\epsilon_{it}|\alpha_{i},\mathbf{x}_{i}\right]=0$ where
$\mathbf{x}_{i}=\left(x_{i1},\ldots,x_{iT}\right)$. (*strict exogeneity*)
* The error $\epsilon_{it}$ is mean
independent of the past, present and future explanatory variables.

# Consistency

* Asymptotic framework: $n\to\infty$ while $T$ stays fixed. 
* appropriate for panel datasets with many individuals but only a few time periods.

**Proposition** If FE.1 is satisfied, then $\widehat{\beta}_{2}^{FE}$ is consistent.

# Asymptotic Normality

**Assumption FE.2**
$\mathrm{var}\left(\epsilon_{i}|\alpha_{i},\mathbf{x}_{i}\right)=\sigma_{\epsilon}^{2}I_{T}$.

* Under FE.1 and FE.2,
$\widehat{\sigma}_{\epsilon}^{2}=\frac{1}{n\left(T-1\right)}\sum_{i=1}^{n}\sum_{t=1}^{T}\widehat{\tilde{\epsilon}}_{it}^{2}$
is a consistent estimator of $\sigma_{\epsilon}^{2}$.



If FE.1 and FE.2 are satisfied, then
$$\left(\widehat{\sigma}_{\epsilon}^{2}\left(\tilde{X}'\tilde{X}\right)^{-1}\right)^{-1/2}\left(\widehat{\beta}_{2}^{FE}-\beta_{2}^{0}\right)\stackrel{d}{\to} N\left(0,I_{K}\right).$$

# Limitation

* FE eliminates all time-invariant explanatory variables, including the intercept.
* From FE we cannot obtain the coefficient estimates of these time-invariant variables.

**Data Example** In reality we do not need to compute the estimator or the variance by hand. `R` handles them automatically.

In [5]:
g.fe <- plm(equation, data=g, model="within") 
# statisticians call the FE estimator 'within' estimator as it carries out
# a within-group transformation
summary(g.fe)

Oneway (individual) effect Within Model

Call:
plm(formula = equation, data = g, model = "within")

Unbalanced Panel: n = 473, T = 13-52, N = 24167

Residuals:
       Min.     1st Qu.      Median     3rd Qu.        Max. 
-212.735344   -3.948681   -0.020028    3.965494  233.204238 

Coefficients:
          Estimate  Std. Error t-value  Pr(>|t|)    
invest -3.8758e-03  5.5301e-04 -7.0086 2.471e-12 ***
cap     2.0277e-03  6.9677e-05 29.1009 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    8420700
Residual Sum of Squares: 7953600
R-Squared:      0.055468
Adj. R-Squared: 0.036571
F-statistic: 695.667 on 2 and 23692 DF, p-value: < 2.22e-16

# Publication Example

* Lin, Justin Yifu (1992): [Rural Reforms and Agricultural
Growth in China](http://www.jstor.org/stable/2117601), *The American
Economic Review*, Vol.82, No.1, pp.34-51.



* 改革开放40周年
* Chinese agricultural industry witnessed a dramatic growth
during 1978-1984. 
* Was the growth was attributed to the household-responsibility system (HRS) reform? 

* Lin (1992): panel data of 28 mainland provinces from 1970 to 1987. 
* He estimates the following FE model by OLS.

$$
\begin{aligned}
ln Y_{it} & = \alpha_1 + \alpha_2
+ \ln(\mathrm{Land}_{it}) + \alpha_3 \ln (\mathrm{Labor}_{it}) \\
    & +
\alpha_4 \ln (\mathrm{Capital}_{it}) + \alpha_5 \ln
(\mathrm{Fert}_{it} ) + \alpha_6 \mathrm{HRS}_{it} \\ 
 & +
\alpha_7 \mbox{MP}_{t-1} 
+ \alpha_8 \mathrm{GP}_t + \alpha_9
\mbox{NGCA}_{it} + \alpha_{10} \mbox{MCI}_{it} + \alpha_{11}
T_t + \sum_{j=12}^{39} \alpha_{j} D_j +
\epsilon_{it}.
\end{aligned}$$

* The empirical findings are robust
* The importance of HRS is supported across specifications

## Random Effect


* RE allows time-invariant explanatory variables. 
* Knife-edge special case $\mathrm{cov}\left(\alpha_{i},x_{it}\right)=0$. 
* FE is consistent when $\alpha_{i}$ and $x_{it}$ are uncorrelated.
* OLS is also consistent.
* But neither is inefficient.

# Assumptions

**Assumption RE.1**
$E\left[\epsilon_{it}|\alpha_{i},\mathbf{x}_{i}\right]=0$ and
$E\left[\alpha_{i}|\mathbf{x}_{i}\right]=0$.

RE.1 obviously implies $\mathrm{cov}\left(\alpha_{i},x_{it}\right)=0$,
so
$$S=\mathrm{var}\left(u_{i}|\mathbf{x}_{i}\right)=\sigma_{\alpha}^{2}\mathbf{1}_{T}\mathbf{1}_{T}'+\sigma_{\epsilon}^{2}I_{T},\ \mbox{for all }i=1,\ldots,n.$$

* Ghe covariance matrix is not a scalar multiplication of the
identity matrix.
* OLS is inefficient.

# Estimation

* Rewrite $$y_{it}=w_{it}\boldsymbol{\beta}+u_{it}$$ 

* Had we known $S$, the GLS estimator would be
$$\widehat{\boldsymbol{\beta}}^{RE}=\left(\sum_{i=1}^{n}\mathbf{w}_{i}'S^{-1}\mathbf{w}_{i}\right)^{-1}\sum_{i=1}^{n}\mathbf{w}_{i}'S^{-1}\mathbf{y}_{i}=\left(W'\mathbf{S}^{-1}W\right)^{-1}W'\mathbf{S}^{-1}y$$
* In practice, software computes FGLS

In [6]:
g.re <- plm(equation, data=g, model="random")
summary(g.re)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = equation, data = g, model = "random")

Unbalanced Panel: n = 473, T = 13-52, N = 24167

Effects:
                  var std.dev share
idiosyncratic  335.71   18.32  0.24
individual    1061.41   32.58  0.76
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8459  0.9222  0.9222  0.9218  0.9222  0.9222 

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-191.390   -4.824   -1.260   -0.007    3.535  242.581 

Coefficients:
               Estimate  Std. Error t-value  Pr(>|t|)    
(Intercept)  2.9718e+01  1.5116e+00 19.6597 < 2.2e-16 ***
invest      -4.2847e-03  5.5075e-04 -7.7798 7.553e-15 ***
cap          2.1374e-03  6.8567e-05 31.1729 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    8671600
Residual Sum of Squares: 8141400
R-Squared:      0.061142
Adj. R-Squared: 0.061064
F-statistic: 786.825 on 2 and 2416

In [7]:
# Which model is preferred? 
# The Hausman test favors the fixed-effect model.
phtest(g.re, g.fe)


	Hausman Test

data:  equation
chisq = 65.835, df = 2, p-value = 5.059e-15
alternative hypothesis: one model is inconsistent


# Dynamic Panel Model

* Example: Stock price is influence by the fundamental indicators in the quarterly finance report, but also by yesterday's price.

$$y_{it}=\beta_{1}+\beta_{2}y_{it-1}+\beta_{3}x_{it}+\alpha_{i}+\epsilon_{it}$$



First-difference (FD): for periods $t$ and
$t-1$,  
$$
\left(y_{it}-y_{it-1}\right)=\beta_{2}\left(y_{it-1}-y_{it-2}\right)+\beta_{3}\left(x_{it}-x_{it-1}\right)+\left(\epsilon_{it}-\epsilon_{it-1}\right).
$$

For simplicity, assume  $E\left[\left(x_{it}-x_{it-1}\right)\left(\epsilon_{it}-\epsilon_{it-1}\right)\right]=0$,
but 
$$
E\left[\left(y_{it-1}-y_{it-2}\right)\left(\epsilon_{it}-\epsilon_{it-1}\right)\right]
=-E\left[y_{it-1}\epsilon_{it-1}\right]=-E\left[\epsilon_{it-1}^{2}\right]\neq0.
$$ 
