# Chapter 14. Advanced Panel Data Methods 

In this chapter, we focus on two methods for estimating unobserved effects panel data models that are at least as common as first differencing. Although these methods are somewhat harder to describe and implement, several econometrics packages support them.

In Section 14-1, we discuss the fixed effects estimator, which, like first differencing, uses a transformation to remove the unobserved effect $a_i$ prior to estimation. Any time-constant explanatory variables are removed along with $a_i$.

The random effects estimator in Section 14-2 is attractive when we think the unobserved effect is uncorrelated with all the explanatory variables. If we have good controls in our equation, we might believe that any leftover neglected heterogeneity only induces serial correlation in the composite error term, but it does not cause correlation between the composite errors and the explanatory variables. Estimation of random effects models by generalized least squares is fairly easy and is routinely done by many econometrics packages.

Section 14-3 introduces the relatively new correlated random effects approach, which provides a synthesis of fixed effects and random effects methods, and has been shown to be practically very useful. In Section 14-4, we show how panel data methods can be applied to other data structures, including matched pairs and cluster samples.

## 14-1 Fixed Effects Estimation

First differencing is just one of the many ways to eliminate the fixed effect, $a_i$ . An alternative method, which works better under certain assumptions, is called the fixed effects transformation . To see what this method involves, consider a model with a single explanatory variable: for each i,

\begin{equation}
y_{it}=\beta_1x_{it}+a_i+u_{it},    t=1,2,\ldots,T \tag{14.1}
\end{equation}

Now for each i, average this equation over time. We get

\begin{equation}
\bar y_{i}=\beta_1\bar x_{i}+a_i+u_{i} \tag{14.2}
\end{equation}

Where $\bar y_i=T^{-1}\sumy_{it}$, and so on. Because $a_i$ is fixed over time, it appears both in (14.1) and (14.2). If we substract (14.2) from (14.1) for each t, we wind up with

\begin{equation}
y_{it}-\bar y_{i}=\beta_1(x_{it}-\bar x_i)+u_{it}-\bar u_i , t=1,2,\ldots,T
\end{equation}

or

\begin{equation}
\ddot{y}_{it}=\beta_1 \ddot x_{it}+\ddot u_{it}, t=1,2,\ldots,T \tag{14.3}
\end{equation}

Where $\ddot{y}_{it}= y_{it}-\bar y_{i}$ is the time-demeaned data on y, and similarly for $\ddot{x}_{it}$ and $\ddot{u}_{it}$. The fixed effects transformation is also called the whithin transformation. The important thing about equation (14.3) is that the unobserved effect, $a_i$, has disappeared. This suggests we should estimate (14.3) by pooled OLS. A pooled OLS estimator that is based on the time-demeaned variables is called the fixed effects estimator or the within estimator. The latter name comes from the fact that OLS on (14.3) uses the time variation in y and within each cross-sectional observation.

The between estimator is obtained as the OLS estimator on the cross-sectional equation (14.2) (where we include an intercept, $\beta_0$: we use the time averages for both y and x and then run a cross-sectional regression. We will not study the between estimator in detail because it is biased when $a_i$ is correlated with $\bar x_i$ (see Problem 2). If we think $a_i$ is uncorrelated with $x_{it}$ , it is better to use the random effects estimator, which we cover in Section 14-2. The between estimator ignores important information on how the variables change over time.

Under a strict exogeneity assumption on the explanatory variables, the fixed effects estimator is unbiased: roughly, the idiosyncratic error $u_{it}$ should be uncorrelated with each explanatory variable across all time periods. (See the chapter appendix for precise statements of the assumptions.) The fixed effects estimator allows for arbitrary correlation between $a_i$ and the explanatory variables in any time period, just as with first differencing. Because of this, any explanatory variable that is constant over time for all i gets swept away by the fixed effects transformation. Therefore we cannot include variables such as gender or a city's distance from a river.

### Wooldridge Example 14.2

The data in WAGEPAN are from Vella and Verbeek (1998). Each of the 545 men in the sample worked in every year from 1980 through 1987. Some variables in the data set change over time: experience, marital status, and union status are the three important ones. Other variables do not change: race and education are the key examples. If we use fixed effects (or first differencing), we cannot include race, education, or experience in the equation. However, we can include interactions of educ with year dummies for 1981 through 1987 to test whether the return to education was constant over this time period. We use log( wage) as the dependent variable, dummy variables for marital and union status, a full set of year dummies, and the interaction terms d81*educ ,d82*educ,...,d87*educ.

In [3]:
install.packages('plm')

Installing package into '/home/nbuser/R'
(as 'lib' is unspecified)
also installing the dependency 'bdsmatrix'



In [4]:
library(foreign);library(plm)
wagepan<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wagepan.dta?raw=true")
# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )

pdim(wagepan.p)

# Estimate FE model
summary( plm(lwage~married+union+factor(year)*educ, 
                                        data=wagepan.p, model="within") )



Loading required package: Formula


Balanced Panel: n = 545, T = 8, N = 4360

Oneway (individual) effect Within Model

Call:
plm(formula = lwage ~ married + union + factor(year) * educ, 
    data = wagepan.p, model = "within")

Balanced Panel: n = 545, T = 8, N = 4360

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-4.1500 -0.1260  0.0109  0.1610  1.4800 

Coefficients:
                        Estimate Std. Error t-value  Pr(>|t|)    
married                0.0548205  0.0184126  2.9773  0.002926 ** 
union                  0.0829785  0.0194461  4.2671 2.029e-05 ***
factor(year)1981      -0.0224159  0.1458885 -0.1537  0.877893    
factor(year)1982      -0.0057613  0.1458558 -0.0395  0.968494    
factor(year)1983       0.0104296  0.1458579  0.0715  0.942999    
factor(year)1984       0.0843743  0.1458518  0.5785  0.562966    
factor(year)1985       0.0497251  0.1458602  0.3409  0.733191    
factor(year)1986       0.0656064  0.1458917  0.4497  0.652958    
factor(year)1987       0.0904447  0.1458505  0.6201  0.535217    
factor(year)1981:educ  0.0115854  0.0122

The estimates on these interaction terms are all positive, and they generally get larger for more recent years. The largest coefficient of .030 is on d87*educ, with t=2.48. In other words, the return to education is estimated to be about 3 percentage points larger in 1987 than in the base year, 1980. (We do not have an estimate of the return to education in the base year for the reasons given earlier.) The other significant interaction term is d86*educ (coefficient=.027, t=2.23). The estimates on the earlier years are smaller and insignificant at the 5% level against a two-sided alternative. If we do a joint F test for significance of all seven interaction terms, we get p-value=.28: this gives an example where a set of variables is jointly insignificant even though some variables are individually significant. Generally, the results are consistent with an increase in the return to education over this period.

## 14-1b Fixed Effects of First Differencing ?

So far, setting aside pooled OLS, we have seen two competing methods for estimating unobserved effects models. One involves differencing the data, and the other involves time-demeaning. How do we know which one to use?

We can eliminate one case immediately: when T=2, the FE and FD estimates, as well as all test statistics, are identical , and so it does not matter which we use. Of course, the equivalence between the FE and FD estimates requires that we estimate the same model in each case. In particular, as we discussed in Chapter 13, it is natural to include an intercept in the FD equation; this intercept is actually the intercept for the second time period in the original model written for the two time periods. Therefore, FE estimation must include a dummy variable for the second time period in order to be identical to the FD estimates that include an intercept.

With T=2, FD has the advantage of being straightforward to implement in any econometrics or statistical package that supports basic data manipulation, and it is easy to compute heteroskedasticity-robust statistics after FD estimation (because when T=2, FD estimation is just a cross-sectional regression).

When $T>2$, the FE and FD estimators are not the same. Since both are unbiased under Assumptions FE.1 through FE.4, we cannot use unbiasedness as a criterion. Further, both are consistent (with T fixed as N becomes large) under FE.1 through FE.4. For large N and small T , the choice between FE and FD hinges on the relative efficiency of the estimators, and this is determined by the serial correlation in the idiosyncratic errors, $u_it$ . (We will assume homoskedasticity of the $u_it$ , since efficiency comparisons require homoskedastic errors).

When the $u_it$ are serially uncorrelated, fixed effects is more efficient than first differencing (and the standard errors reported from fixed effects are valid).If we expect the unobserved factors that change over time to be serially correlated. If $u_{it}$ follows a random walk (which means that there is very substantial, positive serial correlation) then the difference $\Delta u_{it}$ is serially uncorrelated, and first differencing is better.

When T is large, and especially when N is not very large (for example, N=20 and T=30), we must exercise caution in using the fixed effects estimator. Although exact distributional results hold for any N and T under the classical fixed effects assumptions, inference can be very sensitive to violations of the assumptions when N is small and T is large.

First differencing has the advantage of turning an integrated time series process into a weakly dependent process. Therefore, if we apply first differencing, we can appeal to the central limit theorem even in cases where T is larger than N . Normality in the idiosyncratic errors is not needed, and heteroskedasticity and serial correlation can be dealt with as we touched on in Chapter 13. Inference with the fixed effects estimator is potentially more sensitive to nonnormality, heteroskedasticity, and serial correlation in the idiosyncratic errors.

Generally, it is difficult to choose between FE and FD when they give substantively different results. It makes sense to report both sets of results and to try to determine why they differ.

## 14-2 Random Effects Models

We begin witht the same unobserved effects model as before,

\begin{equation}
y_{it}=\beta_0+\beta_1x_{it1}+\ldots+\beta_kx_{itk}+a_i+u_{it},   \tag{14.7}
\end{equation}

where we explicitly include an intercept so that we can make the assumption that the unobserved effect, $a_i$ , has zero mean (without loss of generality). We would usually allow for time dummies among the explanatory variables as well. In using fixed effects or first differencing, the goal is to eliminate $a_i$ because it is thought to be correlated with one or more of the $x_{itj}$. But suppose we think $a_i$ is uncorrelated with each explanatory variable in all time periods. Then, using a transformation to eliminate $a_i$ results in inefficient estimators.

Equation (14.7) becomes a random effects model when we assume that the unobserved effect $a_i$ is uncorrelated with each explanatory variable.

In fact, the ideal random effects assumptions include all of the fixed effects assumptions plus the additional requirement that $a_i$ is independent of all explanatory variables in all time periods. If we think the unobserved effect $a_i$ is correlated with any explanatory variables, we should use first differencing or fixed effects.

Define

\begin{equation}
\theta=1-[\sigma^2_u/(\sigma^2_u+T\sigma^2_a)]^{1/2}  \tag{14.10}
\end{equation}

which is between zero and one. Then, the transformed equation turns out to be

\begin{equation}
y_{it}-\theta \bar y_i = \beta_0(1-\theta)+\beta_1(x_{it1}-\theta \bar x_{i1})+\ldots+\beta_k(x_{itk}-\theta \bar x_{ik})+(v_{it}-\theta \bar v_i)   \tag{14.11}
\end{equation}

where the overbar again denotes the time averages. This is a very interesting equation, as it involves quasi-demeaned data on each variable. The fixed effects estimator subtracts the time averages from the corresponding variable. The random effects transformation subtracts a fraction of that time average, where the fraction depends on $\sigma^2_u$ , $\sigma^2_a$ , and the number of time periods, T . The GLS estimator is simply the pooled OLS estimator of equation (14.11). It is hardly obvious that the errors in (14.11) are serially uncorrelated, but they are.

The transformation in (14.11) allows for explanatory variables that are constant over time, and this is one advantage of random effects (RE) over either fixed effects or first differencing. This is possible because RE assumes that the unobserved effect is uncorrelated with all explanatory vari- ables, whether the explanatory variables are fixed over time or not. Thus, in a wage equation, we can include a variable such as education even if it does not change over time. But we are assuming that education is uncorrelated with a i , which contains ability and family background. In many applica- tions, the whole reason for using panel data is to allow the unobserved effect to be correlated with the explanatory variables.

The parameter $\theta$ is never known in practice, but it can always be estimated. There are different ways to do this, which may be based on pooled OLS or fixed effects, for example.

Many econometrics packages support estimation of random effects models and automatically compute some version of $\hat \theta$ . The feasible GLS estimator that uses $\hat \theta$ in place of $\theta$ is called the random effects estimator. Under the random effects assumptions (refer to Wooldridge 2016 Appending), the estimator is consistent (not unbiased) and asymptotically normally distributed as N gets large with fixed T . The properties of the random effects (RE) estimator with small N and large T are largely unknown, although it has certainly been used in such situations.

### Wooldridge Example 14.4. A Wage equation Using Panel Data

We again use the data in WAGEPAN to estimate a wage equation for men. We use three methods: pooled OLS, random effects, and fixed effects. In the first two methods, we can include educ and race dummies ( black and hispan ), but these drop out of the fixed effects analysis. The time-varying variables are exper , exper*exper , union , and married . As we discussed in Section 14-1, exper is dropped in the FE analysis (although exper*exper remains). Each regression also contains a full set of year dummies. The estimation results are as follows

In [8]:
install.packages('stargazer')
library(foreign);library(plm);library(stargazer)

Installing package into '/home/nbuser/R'
(as 'lib' is unspecified)


In [10]:

wagepan<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wagepan.dta?raw=true")

# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )

pdim(wagepan.p)

# Check variation of variables within individuals
pvar(wagepan.p)

Balanced Panel: n = 545, T = 8, N = 4360

no time variation:       nr black hisp educ 
no individual variation: year d81 d82 d83 d84 d85 d86 d87 

In [13]:
# Estimate different models
wagepan.p$yr<-factor(wagepan.p$year)

reg.ols<- (plm(lwage~educ+black+hisp+exper+I(exper^2)+married+union+yr, 
                                      data=wagepan.p, model="pooling") )
reg.re <- (plm(lwage~educ+black+hisp+exper+I(exper^2)+married+union+yr, 
                                      data=wagepan.p, model="random") )
reg.fe <- (plm(lwage~                      I(exper^2)+married+union+yr, 
                                      data=wagepan.p, model="within") )

# Pretty table of selected results (not reporting year dummies)
stargazer(reg.ols,reg.re,reg.fe, type="text", 
          column.labels=c("OLS","RE","FE"),keep.stat=c("n","rsq"),
          keep=c("ed","bl","hi","exp","mar","un"))

# Note that the estimates "reg.fe" and "reg.re" are calculated in
# Example 14.4. The scripts have to be run first.

# Hausman test of RE vs. FE:
phtest(reg.fe, reg.re)


                  Dependent variable:     
             -----------------------------
                         lwage            
                OLS       RE        FE    
                (1)       (2)       (3)   
------------------------------------------
educ         0.091***  0.092***           
              (0.005)   (0.011)           
                                          
black        -0.139*** -0.139***          
              (0.024)   (0.048)           
                                          
hisp           0.016     0.022            
              (0.021)   (0.043)           
                                          
exper        0.067***  0.106***           
              (0.014)   (0.015)           
                                          
I(exper2)    -0.002*** -0.005*** -0.005***
              (0.001)   (0.001)   (0.001) 
                                          
married      0.108***  0.064***   0.047** 
              (0.016)   (0.017)   (0.018) 
          


	Hausman Test

data:  lwage ~ I(exper^2) + married + union + yr
chisq = 26.361, df = 10, p-value = 0.003284
alternative hypothesis: one model is inconsistent


The coefficients on educ , black , and hispan are similar for the pooled OLS and random effects esti- mations. The pooled OLS standard errors are the usual OLS standard errors, and these underestimate the true standard errors because they ignore the positive serial correlation; we report them here for comparison only.

The experience profile is somewhat different, and both the marriage and union premiums fall notably in the random effects estimation. When we eliminate the unobserved effect entirely by using fixed effects, the marriage premium falls to about 4.7%, although it is still statistically significant. The drop in the marriage premium is consistent with the idea that men who are more able, as captured by a higher unobserved effect $a_i$, are more likely to be married. Therefore, in the pooled OLS estimation, a large part of the marriage premium reflects the fact that men who are married would earn more even if they were not married. The remaining 4.7% has at least two possible explanations: (1) marriage really makes men more productive or (2) employers pay married men a pre- mium because marriage is a signal of stability. We cannot distinguish between these two hypotheses.

The estimate of $\theta$ for the random effects estimation is $\hat \theta=.643$, which helps explain why, on the time-varying variables, the RE estimates lie closer to the FE estimates than to the pooled OLS estimates.

### 14-2a Random Effects or Fixed Effects

Because fixed effects allows arbitrary correlation between $a_i$ and the $x_{itj}$ , while random effects does not, FE is widely thought to be a more convincing tool for estimating ceteris paribus effects. Still, random effects is applied in certain situations. Most obviously, if the key explanatory variable is constant over time, we cannot use FE to estimate its effect on y . For example, in Table 14.2, we must rely on the RE (or pooled OLS) estimate of the return to education. RE is preferred to pooled OLS because RE is generally more efficient.

If our interest is in a time-varying explanatory variable, is there ever a case to use RE rather than FE? Yes, but situations in which $Cov(x_{itj,a_i})=0$ should be considered the exception rather than the rule. If the key policy variable is set experimentally&#8212;say, each year, children are randomly assigned to classes of different sizes&#8212;then random effects would be appropriate for estimating the effect of class size on performance. Unfortunately, in most cases the regressors are themselves outcomes of choice processes and likely to be correlated with individual preferences and abilities as captured by $a_i$ .

It is still fairly common to see researchers apply both random effects and fixed effects, and then formally test for statistically significant differences in the coefficients on the time-varying explanatory variables. (So, in Table 14.2, these would be the coefficients on exper*exper , married , and union). Hausman (1978) first proposed such a test, and some econometrics packages routinely compute the Hausman test under the full set of random effects assumptions. The idea is that one uses the random effects estimates unless the Hausman test rejects (14.8). In practice, a failure to reject means either that the RE and FE estimates are sufficiently close so that it does not matter which is used, or the sampling variation is so large in the FE estimates that one cannot conclude practically significant differences are statistically significant. In the latter case, one is left to wonder whether there is enough information in the data to provide precise estimates of the coefficients. A rejection using the Hausman test is taken to mean that the key RE assumption, (14.8), is false, and then the FE estimates are used. As an example the following code computes the Haussman test for the example 14.4. The p-value of 0.0033 suppports rejecting RE as a valid model.

In the next section we discuss an alternative, computationally simpler approach to choosing between the RE and FE approaches.

Whether or not we engage in the philosophical debate about the nature of a i , FE is almost always much more convincing than RE for policy analysis using aggregated data.

## 14-3  The Correlated Randon Effects Approach

In applications where it makes sense to view the $a_i$ (unobserved effects) as being random variables, along with the observed variables we draw, there is an alternative to fixed effects that still allows $a_i$ to be correlated with the observed explanatory variables. To describe the approach, consider again the simple model in equation (14.1), with a single, time-varying explanatory variable $x_{it}$.

Rather than assume $a_i$ is uncorrelated with ${x_{it}:t=1,2,\ldots,T}$, which is the random effects approach, or take away time averages to remove $a_i$, the fixed effects approach, we might instead model correlation between $a_i$ and ${x_{it}:t=1,2,\ldots,T}$. Because $a_i$ is, by definition, constant over time, allowing it to be correlated with the average level of the x it has a certain appeal. More specifically, let $\bar x_i=T^{-1} \sum {x_{it}}$ it be the time average, as before. Suppose we assume the simple linear relationship

\begin{equation}
a_i=\alpha+\gamma \bar x_i +r_i   \tag{14.12}
\end{equation}

where we assume $r_i$ is uncorrelated with each $x_{it}$. Because $\bar x_i$ is a linear function of the $x_{it}$,

\begin{equation}
Cov(\bar x_i,r_i)=0   \tag{14.13}
\end{equation}

Equations (14.12) and (14.13) imply that $a_i$ and $x_i$ are correlated whenever $\gamma \neq 0$.

The correlated random effects (CRE) approach uses (14.12) in conjunction with (14.1): substi- tuting the former in the latter gives

\begin{equation}
y_{it}=\beta x_{it}+\alpha+\gamma \bar x_i +r_i+u_{it}=\alpha+\beta x_{it}+\gamma \bar x_i +r_i+u_{it}  \tag{14.14}
\end{equation}

which is like the usual equation underlying RE estimation with the important addition of the time- average variable, $\bar x_i$ . It is the addition of $\bar x_i$ that controls for the correlation between $a_i$ and the sequence ${x_{it}:t=1,2,\ldots,T}$ . What is left over, $r_i$, is uncorrelated with the $x_{it}$.

The CRE approach provides a simple, formal way of choosing between the FE and RE approaches. As we just discussed, the RE approach sets $\gamma=0$ while FE estimates $\gamma$ . Because we have $\hat \gamma_{CRE}$ and its standard error [obtained from RE estimation of (14.14)], we can construct a t test of $H_0:\gamma=0$ against $H_1:\gamma \neq 0$. If we reject $H_0$ at a sufficiently small significance level, we reject RE in favor of FE. As usual, especially with a large cross section, it is important to distinguish between a statistical rejection and economically important differences.

A second reason to study the CRE approach is that it provides a way to include time-constant explanatory variables in what is effectively a fixed effects analysis. For example, let $z_i$ be a variable that does not change over time, it could be gender, say, or an IQ test score determined in childhood. We can easily augment (14.14) to include $z_i$:

\begin{equation}
y_{it}=\alpha+\beta x_{it}+\gamma \bar x_i +\delta z_i+r_i+u_{it}  \tag{14.17}
\end{equation}

where we do not change the notation for the error term (which no longer includes $z_i$ ). If we estimate this expanded equation by RE, it can still be shown that the estimate of $\beta$ is the FE estimate from (14.1). In fact, once we include \bar \x_i , we can include any other time-constant variables in the equation, estimate it by RE, and obtain $\hat \beta_{FE}$ as the coefficient on $x_{it}$ . In addition, we obtain an estimate of $\delta$ , although the estimate should be interpreted with caution because it does not necessarily estimate a causal effect of $z_i$ on $y_{it}$ .

The same CRE strategy can be applied to models with many time-varying explanatory variables (and many time-constant variables). When the equation augmented with the time averages is estimated by RE, the coefficients on the time-varying variables are identical to the FE estimates.

### Wooldridge Example 14.4 (REVISITED). A Wage equation Using Panel Data

In this example we use WAGEPAN.dta again. We estimate the FE parameters using the within transformation (reg.fe) and the CRE approach (reg.cre). We also estimate the RE version of this model (reg.fe).

In [16]:
wagepan<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wagepan.dta?raw=true")

# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )

# Estimate FE parameter in several different ways:
wagepan.p$yr<-factor(wagepan.p$year)
reg.fe <-(plm(lwage~married+union+yr*educ,data=wagepan.p, model="within"))


reg.cre<-(plm(lwage~married+union+yr*educ+Between(married)+Between(union)
                                         ,data=wagepan.p, model="random"))

# Estimate RE
reg.re <-(plm(lwage~married+union+yr*educ,data=wagepan.p, model="random"))

In [18]:
stargazer(reg.fe,reg.cre,reg.re,type="text",model.names=FALSE,
          keep=c("married","union",":educ"),keep.stat=c("n","rsq"),
          column.labels=c("Within","Dummies","CRE","RE"))



                      Dependent variable:     
                 -----------------------------
                             lwage            
                  Within    Dummies     CRE   
                    (1)       (2)       (3)   
----------------------------------------------
married          0.055***  0.055***  0.078*** 
                  (0.018)   (0.018)   (0.017) 
                                              
union            0.083***  0.083***  0.108*** 
                  (0.019)   (0.019)   (0.018) 
                                              
Between(married)           0.127***           
                            (0.044)           
                                              
Between(union)             0.160***           
                            (0.050)           
                                              
yr1981:educ        0.012     0.012     0.011  
                  (0.012)   (0.012)   (0.012) 
                                              
yr1982:educ 

Given we have estimated the CRE model, it is easy to test the null Hypothesis that the RE estimator is consistent. The additional assumptions needed are $\gamma_1+\ldots+\gamma_k=0$. They can be easily tested using an F test as following demonstrated. Like the Hausman test, we clearly reject the null hypothesis that the RE model is appropriate with a tiny p value of about 0.00005.

In [19]:
# Note that the estimates "reg.cre" are calculated in
# Script "Example-Dummy-CRE-1.R" which has to be run first.

# RE test as an F test on the "Between" coefficients 
library(car)
linearHypothesis(reg.cre, matchCoefs(reg.cre,"Between"))


Res.Df,Df,Chisq,Pr(>Chisq)
4342,,,
4340,2.0,19.81394,4.98261e-05


As previously suggested another advantage of the CRE approach is that we can add time-constant regressors to the model. The following code estimates another version of the wage equation using the CRE approach. The variables married and union vary over time, so we can control for their between effects. The variables educ, black and hisp do not vary. For a causal interpretation of their coefficients, we have to rely on uncorrelatedness with $a_i$. Given $a_i$ includes intelligence and other labor market success factors, this uncorrelatedness is more plausible for some variables (like gender or race) than for other variables (like education).

In [21]:
library(foreign);library(plm)
wagepan<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wagepan.dta?raw=true")

# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )

# Estimate CRE parameters
wagepan.p$yr<-factor(wagepan.p$year)
summary(plm(lwage~married+union+educ+black+hisp+Between(married)+
                         Between(union), data=wagepan.p, model="random"))


Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = lwage ~ married + union + educ + black + hisp + 
    Between(married) + Between(union), data = wagepan.p, model = "random")

Balanced Panel: n = 545, T = 8, N = 4360

Effects:
                 var std.dev share
idiosyncratic 0.1426  0.3776 0.577
individual    0.1044  0.3231 0.423
theta: 0.6182

Residuals:
   Min. 1st Qu.  Median 3rd Qu.    Max. 
-4.5300 -0.1620  0.0266  0.2030  1.6500 

Coefficients:
                   Estimate Std. Error t-value  Pr(>|t|)    
(Intercept)       0.6325629  0.1081545  5.8487 5.317e-09 ***
married           0.2416845  0.0176735 13.6750 < 2.2e-16 ***
union             0.0700438  0.0207240  3.3798 0.0007316 ***
educ              0.0760374  0.0087787  8.6616 < 2.2e-16 ***
black            -0.1295163  0.0488981 -2.6487 0.0081094 ** 
hisp              0.0116700  0.0428188  0.2725 0.7852172    
Between(married) -0.0797385  0.0442674 -1.8013 0.0717258 .  
Betwe

## 14-4 Applying panel data methods to other data structures

The various panel data methods can be applied to certain data structures that do not involve time. For example, it is common in demography to use siblings (sometimes twins) to account for unobserved family and background characteristics. Usually we want to allow the unobserved "family effect" which is common to all siblings within a family, to be correlated with observed explanatory variables. If those explanatory variables vary across siblings within a family, differencing across sibling pairs- or, more generally, using the within transformation within a family- is preferred as an estimation method. By removing the unobserved effect, we eliminate potential bias caused by confounding family background characteristics. Implementing fixed effects on such data structures is rather straight- forward in regression packages that support FE estimation.

Ashenfelter and Krueger (1994) used the differencing methodology to estimate the return to education. They obtained a sample of 149 identical twins and collected information on earnings, education, and other variables. Identical twins were used because they should have the same underlying ability. This can be differenced away by using twin differences, rather than OLS on the pooled data. Because identical twins are the same in age, gender, and race, these factors all drop out of the differenced equation. Therefore, Ashenfelter and Krueger regressed the difference in log(earnings) on the difference in education and estimated the return to education to be about 9.2% (t=3.83).

The samples used by Ashenfelter and Krueger (1994) are examples of matched pairs samples . More generally, fixed and random effects methods can be applied to a cluster sample . A cluster sample has the same appearance as a cross-sectional data set, but there is an important difference: clusters of units are sampled from a population of clusters rather than sampling individuals from the population of individuals. In the previous examples, each family is sampled from the population of families, and then we obtain data on at least two family members. Therefore, each family is a cluster.

As another example, suppose we are interested in modeling individual pension plan participation decisions. One might obtain a random sample of working individuals -say, from the United States- but it is also common to sample firms from a population of firms. Once the firms are sampled, one might collect information on all workers or a subset of workers within each firm. In either case, the resulting data set is a cluster sample because sampling was first at the firm level. Unobserved firm- level characteristics (along with observed firm characteristics) are likely to be present in participation decisions, and this within-firm correlation must be accounted for. Fixed effects estimation is preferred when we think the unobserved cluster effect --an example of which is $a_i$ in (14.12)-- is correlated with one or more of the explanatory variables. Then, we can only include explanatory variables that vary, at least somewhat, within clusters. The cluster sizes are rarely the same, so we are effectively using fixed effects methods for unbalanced panels.

Educational data on student outcomes can also come in the form of a cluster sample, where a sample of schools is obtained from the population of schools, and then information on students within each school is obtained. Each school acts as a cluster, and allowing a school effect to be correlated with key explanatory variables --say, whether a student participates in a state-sponsored tutoring program --is likely to be important. Because the rate at which students are tutored likely varies by school, it is probably a good idea to use fixed effects estimation. One often sees authors use, as a shorthand, "I included school fixed effects in the analysis."

The correlated random effects approach can be applied immediately to cluster samples because, for the purposes of estimation, a cluster sample acts like an unbalanced panel. Now, the averages that are added to the equation are within-cluster averages --for example, averages within schools. The only difference with panel data is that the notion of serial correlation in idiosyncratic errors is not relevant.

In some cases, the key explanatory variables --often policy variables-- change only at the level of the cluster, not within the cluster. In such cases the fixed effects approach is not applicable. For example, we may be interested in the effects of measured teacher quality on student performance, where each cluster is an elementary school classroom. Because all students within a cluster have the same teacher, eliminating a "class effect" also eliminates any observed measures of teacher quality. If we have good controls in the equation, we may be justified in applying random effects on the unbalanced cluster. As with panel data, the key requirement for RE to produce convincing estimates is that the explanatory variables are uncorrelated with the unobserved cluster effect. Most econometrics packages allow random effects estimation on unbalanced clusters without much effort.

### Wooldridge Example 14.4 (REVISITED). County Crime Rates in North Carolina

In [None]:
The following code repeats the FD regression from Example 13.9, this time reporting the regression table with clustered standard errors and respective t statistics.

In [None]:
install.packages("plm")
library(foreign);library(plm);library(lmtest)

In [23]:
crime4<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/crime4.dta?raw=true")

crime4.p <- pdata.frame(crime4, index=c("county","year") )

In [27]:
# Estimate FD model:
reg <- ( plm(log(crmrte)~d83+d84+d85+d86+d87+lprbarr+lprbconv+ 
                   lprbpris+lavgsen+lpolpc,data=crime4.p, model="fd") )

# regression table with "clustered" SE:
summary(reg,vcovHC)


Oneway (individual) effect First-Difference Model

Note: Coefficient variance-covariance matrix supplied: vcovHC

Call:
plm(formula = log(crmrte) ~ d83 + d84 + d85 + d86 + d87 + lprbarr + 
    lprbconv + lprbpris + lavgsen + lpolpc, data = crime4.p, 
    model = "fd")

Balanced Panel: n = 90, T = 7, N = 630
Observations used in estimation: 540

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.65900 -0.07490  0.00521  0.00117  0.07550  0.68200 

Coefficients:
          Estimate Std. Error t-value  Pr(>|t|)    
d83      -0.092014   0.014432 -6.3758 3.964e-10 ***
d84      -0.132355   0.017726 -7.4668 3.401e-13 ***
d85      -0.129328   0.022948 -5.6357 2.836e-08 ***
d86      -0.093869   0.020665 -4.5425 6.890e-06 ***
d87      -0.044943   0.023324 -1.9269 0.0545289 .  
lprbarr  -0.326327   0.055400 -5.8904 6.849e-09 ***
lprbconv -0.237521   0.038712 -6.1355 1.663e-09 ***
lprbpris -0.164511   0.045073 -3.6498 0.0002884 ***
lavgsen  -0.024691   0.024924 -0.9906 0.3223138  