# Chapter 8: Panel Data

https://mixtape.scunning.com/08-panel_data

## Intro

Panel data estimators (models) are among the most important tools in the causal inference toolkit. The estimators are designed explicitly for longitudinal data—the repeated observing of a unit over time. Under certain situations, repeatedly observing the same unit over time can overcome a particular kind of omitted variable bias, though not all kinds. While it is possible that observing the same unit over time will not resolve the bias, there are still many applications where it can, and that’s why this method is so important

## DAG Example

Before I dig into the technical assumptions and estimation methodology for panel data techniques, I want to review a simple DAG illustrating those assumptions. This DAG comes from [Imai and Kim (2017)](https://mixtape.scunning.com/references#ref-Imai2017). 

Let’s say that we have data on a column of outcomes $Y_i$, which appear in three time periods (here notated as $Y_{i1}$, $Y_{i2}$ and $Y_{i3}$). In our notation $t = 1,2,3$ indexes the time period where each $i$ unit is observed. Likewise, we have a matrix of covariates $D_i$. These also vary over time (noted as $D_{i1}$, $D_{i2}$ and $D_{i3}$). 

Also, there exists a single unit-specific unobserved variable $u_i$, which varies across units, but which does not vary over time for that unit. Hence the reason that there is no $t = 1,2,3$ subscript for our $u_i$ variable. Key to this variable is (a) it is unobserved in the data set, (b) it is unit-specific, and (c) it does not change over time for a given unit . 

Finally there exists some unit-specific time-invariant variable, $X_i$ . Notice that it doesn’t change over time, just like $u_i$, but unlike $u_i$ it **is** observed.

<img src="img/panel_dag.png" alt="DAG" style="width: 350px;"/>

- First, let us note that $D_{i1}$ causes both $Y_{i1}$ as well as the next period’s treatment value, $D_{i2}$. 

- Second, note that an unobserved confounder, $u_i$, determines all $Y$ and all $D$ variables. Consequently, $D$ is **endogenous** since $u_i$ is unobserved and absorbed into the structural error term of the regression model. 

- Thirdly, there is no time-varying unobserved confounder correlated with $D_{it}$—the only confounder is $u_i$, which we call the **unobserved heterogeneity**. 

- Fourth, past outcomes do not directly affect current outcomes (i.e., no direct edge between the $Y_{it}$ variables). 

- Fifth, past outcomes do not directly affect current treatments (i.e., no direct edge from $Y_{i,t-1}$ to $D_{it}$). 

- And finally, past treatments, $D_{i,t-1}$ do not directly affect current outcomes, $Y_{it}$ (i.e., no direct edge from  $Y_{i,t-1}$ and $D_{it}$).

It is under these assumptions that we can use a particular panel method called fixed effects to isolate the causal effect of $D$ on $Y$.

What might an example of this be? Let’s return to our story about the returns to education. Let’s say that we are interested in the effect of schooling on earnings, and schooling is partly determined by unchanging genetic factors which themselves determine unobserved ability, like intelligence, contentiousness, and motivation (Conley and Fletcher 2017). If we observe the same people’s time-varying earnings and schoolings over time, then if the situation described by the above DAG describes both the directed edges and the missing edges, then we can use panel fixed effects models to identify the causal effect of schooling on earnings.

## Estimation
When we use the term “panel data,” what do we mean? We mean a data set where we observe the same units (e.g., individuals, firms, countries, schools) over more than one time period. Often our outcome variable depends on several factors, some of which are observed and some of which are unobserved in our data, and insofar as the unobserved variables are correlated with the treatment variable, then the treatment variable is endogenous and correlations are not estimates of a causal effect. This chapter focuses on the conditions under which a correlation between  and  reflects a causal effect even with unobserved variables that are correlated with the treatment variable. Specifically, if these omitted variables are constant over time, then even if they are heterogeneous across units, we can use panel data estimators to consistently estimate the effect of our treatment variable on outcomes.

There are several different kinds of estimators for panel data, but we will in this chapter only cover two: pooled ordinary least squares (POLS) and fixed effects (FE).

## Pooled OLS

The first estimator we will discuss is the pooled ordinary least squares, or POLS estimator. When we ignore the panel structure and regress $Y_{it}$ on $D_{it}$ we get:

$$
Y_{it} = \delta D_{it} + \eta_{it}; t = 1,2,...,T
$$

with composite error:

$$
\eta_{it} = c_i + \epsilon_{it}
$$

While our DAG did not include $\epsilon_{it}$, this would be equivalent to assuming that the unobserved heterogeneity, $c_i$, was uncorrelated with $D_{it}$ for all time periods.

**But this is not an appropriate assumption in our case because our DAG explicitly links the unobserved heterogeneity to both the outcome and the treatment in each period**. Or using our schooling-earnings example, schooling is likely based on unobserved background factors, $u_i$, and therefore without controlling for it, we have omitted variable bias and $\hat{\delta}$ is biased. No correlation between $D_{it}$ and $\eta_{it}$ necessarily means no correlation between the unobserved $u_i$ and $D_{it}$ for all $t$ and that is just probably not a credible assumption. An additional problem is that $\eta_{it}$ is serially correlated for unit $i$ since $u_i$ is present in each $t$ period. And thus heteroskedastic robust standard errors are also likely too small.

## Fixed effects (within Estimator)

Let’s rewrite our unobserved effects model so that this is still firmly in our minds:

$$
Y_{it} = \delta D_{it} + u_i + \epsilon_{it}; t = 1,2,...,T
$$

If we have data on multiple time periods, we can think of $u_i$ as fixed effects to be estimated. OLS estimation with fixed effects yields:

$$
\big(\widehat{\delta}, \widehat{u}_1, \dots, \widehat{u}_N\big) = \underset{b,m_1,\dots,m_N}{\arg\min} \sum_{i=1}^N\sum_{t=1}^T (Y_{it}-D_{it}b- m_i)^2
$$

<p>In case it isn’t clear, though, running a regression with the time-demeaned variables <span class="math inline">\(\ddot{Y}_{it}\equiv Y_{it} - \overline{Y}_i\)</span> and <span class="math inline">\(\ddot{D}_{it} \equiv D_{it} - \overline{D}\)</span> is <em>numerically equivalent</em> to a regression of <span class="math inline">\(Y_{it}\)</span> on <span class="math inline">\(D_{it}\)</span> and unit-specific dummy variables. Hence the reason this is sometimes called the “within” estimator, and sometimes called the “fixed effects” estimator. And when year fixed effects are included, the “twoway fixed effects” estimator. They are the same thing.</p>

Where’d the unobserved heterogeneity go?! It was deleted when we time-demeaned the data. And as we said, including individual fixed effects does this time demeaning automatically so that you don’t have to go to the actual trouble of doing it yourself manually

## Data Exercise: Survey of Adult Service Providers

Next I’d like to introduce a Stata exercise based on data collection for my own research: a survey of sex workers. You may or may not know this, but the Internet has had a profound effect on sex markets. It has moved sex work indoors while simultaneously breaking the traditional link between sex workers and pimps. It has increased safety and anonymity, too, which has had the effect of causing new entrants. The marginal sex worker has more education and better outside options than traditional US sex workers (Cunningham and Kendall 2011, 2014, 2016). The Internet, in sum, caused the marginal sex worker to shift towards women more sensitive to detection, harm, and arrest.

In 2008 and 2009, I surveyed (with Todd Kendall) approximately 700 US Internet-mediated sex workers. The survey was a basic labor-market survey; I asked them about their illicit and legal labor-market experiences, and about demographics. The survey had two parts: a “static” provider-specific section and a “panel” section. The panel section asked respondents to share information about each of the previous four sessions with clients.

I have created a shortened version of the data set and uploaded it to Github. It includes a few time-invariant provider characteristics, such as race, age, marital status, years of schooling, and body mass index, as well as several time-variant session-specific characteristics including the log of the hourly price, the log of the session length (in hours), characteristics of the client himself, whether a condom was used in any capacity during the session, whether the client was a “regular,” etc.

In this exercise, you will estimate three types of models: a pooled OLS model, a fixed effects (FE), and a demeaned OLS model. The model will be of the following form:

$$
   Y_{is}  =\beta_i X_i + \gamma_{is} Z_{is} + u_i + \varepsilon_{is}
$$
$$
   \ddot{Y}_{is} = \gamma_{is} \ddot{Z}_{is} + \ddot \eta_{is} 
$$
where $u_i$ is both unobserved and correlated with $Z_{is}$.


The first regression model will be estimated with pooled OLS and the second model will be estimated using both fixed effects and OLS. In other words, I’m going to have you estimate the model using canned routines in Stata and R with individual fixed effects, as well as demean the data manually and estimate the demeaned regression using OLS.

First notice that the second regression has a different notation on the dependent and independent variable; it represents the fact that the variables are columns of demeaned variables. Thus $\ddot{Y}_{is} = Y_{is} - \overline{Y}_i$.
 
Secondly, notice that the time-invariant $X_i$ variables are missing from the second equation. Do you understand why that is the case? These variables have also been demeaned, but since the demeaning is across time, and since these time-invariant variables do not change over time, the demeaning deletes them from the expression. Notice also, that the unobserved individual specific heterogeneity, $u_i$, has disappeared. It has disappeared for the same reason that the $X_i$ terms are gone—because the mean of $u_i$ over time is itself, and thus the demeaning deletes it.


Let’s examine these models:

In [1]:
import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
from itertools import combinations 
import plotnine as p

In [2]:
# read data
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
def read_data(file): 
    return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)



sasp = read_data("sasp_panel.dta")

In [3]:
#-- Delete all NA
sasp = sasp.dropna().copy()

#-- order by id and session 
sasp.sort_values('id', inplace=True)


In [4]:
sasp

Unnamed: 0,id,session,age,age_cl,appearance_cl,bmi,schooling,asq_cl,provider_second,asian_cl,...,hispanic,other,white,asq,cohab,married,divorced,separated,nevermarried,widowed
462,1.0,4.0,23.0,61.0,4.0,20.482830,14.0,3721.00,1. No,0.0,...,0.0,1.0,0.0,529.0,0.0,0.0,0.0,0.0,1.0,0.0
460,1.0,2.0,23.0,33.0,6.0,20.482830,14.0,1089.00,1. No,1.0,...,0.0,1.0,0.0,529.0,0.0,0.0,0.0,0.0,1.0,0.0
459,1.0,1.0,23.0,46.0,5.0,20.482830,14.0,2116.00,1. No,0.0,...,0.0,1.0,0.0,529.0,0.0,0.0,0.0,0.0,1.0,0.0
1659,6.0,3.0,29.0,45.0,4.0,30.893555,16.0,2025.00,1. No,0.0,...,0.0,0.0,1.0,841.0,1.0,0.0,0.0,0.0,0.0,0.0
1660,6.0,1.0,29.0,32.5,6.0,30.893555,16.0,1056.25,1. No,0.0,...,0.0,0.0,1.0,841.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
312,688.0,3.0,21.0,35.0,5.0,18.559458,14.0,1225.00,1. No,0.0,...,0.0,0.0,1.0,441.0,0.0,0.0,0.0,0.0,1.0,0.0
1626,690.0,3.0,37.0,35.0,5.0,19.366392,14.0,1225.00,1. No,0.0,...,0.0,0.0,1.0,1369.0,0.0,0.0,1.0,0.0,0.0,0.0
1627,690.0,2.0,37.0,30.0,6.0,19.366392,14.0,900.00,1. No,0.0,...,0.0,0.0,1.0,1369.0,0.0,0.0,1.0,0.0,0.0,0.0
1628,690.0,4.0,37.0,45.0,8.0,19.366392,14.0,2025.00,1. No,0.0,...,0.0,0.0,1.0,1369.0,0.0,0.0,1.0,0.0,0.0,0.0


In [5]:
#-- Balance Data
times = len(sasp.session.unique())
in_all_times = sasp.groupby('id')['session'].apply(lambda x : len(x)==times).reset_index()
in_all_times.rename(columns={'session':'in_all_times'}, inplace=True)
balanced_sasp = pd.merge(in_all_times, sasp, how='left', on='id')
balanced_sasp = balanced_sasp[balanced_sasp.in_all_times]
balanced_sasp.shape

provider_second = np.zeros(balanced_sasp.shape[0])
provider_second[balanced_sasp.provider_second == "2. Yes"] = 1
balanced_sasp.provider_second = provider_second

In [6]:
balanced_sasp

Unnamed: 0,id,in_all_times,session,age,age_cl,appearance_cl,bmi,schooling,asq_cl,provider_second,...,hispanic,other,white,asq,cohab,married,divorced,separated,nevermarried,widowed
3,6.0,True,3.0,29.0,45.0,4.0,30.893555,16.0,2025.00,0.0,...,0.0,0.0,1.0,841.0,1.0,0.0,0.0,0.0,0.0,0.0
4,6.0,True,1.0,29.0,32.5,6.0,30.893555,16.0,1056.25,0.0,...,0.0,0.0,1.0,841.0,1.0,0.0,0.0,0.0,0.0,0.0
5,6.0,True,2.0,29.0,30.0,8.0,30.893555,16.0,900.00,0.0,...,0.0,0.0,1.0,841.0,1.0,0.0,0.0,0.0,0.0,0.0
6,6.0,True,4.0,29.0,21.0,6.0,30.893555,16.0,441.00,0.0,...,0.0,0.0,1.0,841.0,1.0,0.0,0.0,0.0,0.0,0.0
9,8.0,True,3.0,25.0,37.0,5.0,22.886999,14.0,1369.00,0.0,...,0.0,1.0,0.0,625.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1485,684.0,True,2.0,28.0,30.0,6.0,27.435720,12.0,900.00,0.0,...,0.0,0.0,1.0,784.0,1.0,0.0,0.0,0.0,0.0,0.0
1495,690.0,True,3.0,37.0,35.0,5.0,19.366392,14.0,1225.00,0.0,...,0.0,0.0,1.0,1369.0,0.0,0.0,1.0,0.0,0.0,0.0
1496,690.0,True,2.0,37.0,30.0,6.0,19.366392,14.0,900.00,0.0,...,0.0,0.0,1.0,1369.0,0.0,0.0,1.0,0.0,0.0,0.0
1497,690.0,True,4.0,37.0,45.0,8.0,19.366392,14.0,2025.00,0.0,...,0.0,0.0,1.0,1369.0,0.0,0.0,1.0,0.0,0.0,0.0


In [7]:
#-- Demean Data
features = balanced_sasp.columns.to_list()
features = [x for x in features if x not in ['session', 'id', 'in_all_times']]
demean_features = [f"demean_{x}" for x in features]

balanced_sasp[demean_features] = balanced_sasp.groupby('id', group_keys=False)[features].apply(lambda x : x - np.mean(x))

In [9]:
##### Pooled OLS

formula = """lnw ~ age + asq + bmi + hispanic + black + other + asian + schooling + cohab + 
            married + divorced + separated + age_cl + unsafe + llength + reg + asq_cl + 
            appearance_cl + provider_second + asian_cl + black_cl + hispanic_cl + 
           othrace_cl + hot + massage_cl"""
ols = sm.OLS.from_formula(formula, data=balanced_sasp).fit()
ols.summary()

0,1,2,3
Dep. Variable:,lnw,R-squared:,0.303
Model:,OLS,Adj. R-squared:,0.285
Method:,Least Squares,F-statistic:,17.39
Date:,"Thu, 08 Feb 2024",Prob (F-statistic):,3.97e-62
Time:,15:08:18,Log-Likelihood:,-570.0
No. Observations:,1028,AIC:,1192.0
Df Residuals:,1002,BIC:,1320.0
Df Model:,25,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0627,0.316,22.385,0.000,6.444,7.682
age,0.0028,0.012,0.235,0.814,-0.020,0.026
asq,-0.0001,0.000,-0.828,0.408,-0.000,0.000
bmi,-0.0217,0.002,-9.296,0.000,-0.026,-0.017
hispanic,-0.2259,0.091,-2.472,0.014,-0.405,-0.047
black,0.0284,0.075,0.379,0.705,-0.119,0.175
other,-0.1116,0.061,-1.838,0.066,-0.231,0.008
asian,0.0862,0.154,0.559,0.576,-0.216,0.389
schooling,0.0198,0.010,1.997,0.046,0.000,0.039

0,1,2,3
Omnibus:,62.662,Durbin-Watson:,1.095
Prob(Omnibus):,0.0,Jarque-Bera (JB):,115.62
Skew:,0.425,Prob(JB):,7.820000000000001e-26
Kurtosis:,4.405,Cond. No.,64900.0


In [10]:
# #### Fixed Effects

balanced_sasp['y'] = balanced_sasp.lnw

formula = """lnw ~ -1 + C(id) + age + asq + bmi + hispanic + black + other + asian + schooling + 
                      cohab + married + divorced + separated + 
                      age_cl + unsafe + llength + reg + asq_cl + appearance_cl + 
                      provider_second + asian_cl + black_cl + hispanic_cl + 
                      othrace_cl + hot + massage_cl"""

ols = sm.OLS.from_formula(formula, data=balanced_sasp).fit(cov_type='cluster', 
                                                           cov_kwds={'groups': balanced_sasp['id']})
ols.summary()  

0,1,2,3
Dep. Variable:,lnw,R-squared:,0.832
Model:,OLS,Adj. R-squared:,0.773
Method:,Least Squares,F-statistic:,
Date:,"Thu, 08 Feb 2024",Prob (F-statistic):,
Time:,15:08:38,Log-Likelihood:,162.25
No. Observations:,1028,AIC:,215.5
Df Residuals:,758,BIC:,1548.0
Df Model:,269,,
Covariance Type:,cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
C(id)[6.0],-0.4465,0.035,-12.683,0.000,-0.515,-0.377
C(id)[8.0],0.3310,0.027,12.240,0.000,0.278,0.384
C(id)[10.0],1.0513,0.040,26.539,0.000,0.974,1.129
C(id)[11.0],-0.5627,0.026,-21.608,0.000,-0.614,-0.512
C(id)[18.0],0.5518,0.034,16.025,0.000,0.484,0.619
C(id)[23.0],-0.0312,0.038,-0.827,0.408,-0.105,0.043
C(id)[25.0],-0.1525,0.035,-4.403,0.000,-0.220,-0.085
C(id)[29.0],1.6517,0.068,24.285,0.000,1.518,1.785
C(id)[31.0],-0.0586,0.020,-2.953,0.003,-0.098,-0.020

0,1,2,3
Omnibus:,79.779,Durbin-Watson:,2.528
Prob(Omnibus):,0.0,Jarque-Bera (JB):,361.895
Skew:,0.172,Prob(JB):,2.6000000000000002e-79
Kurtosis:,5.886,Cond. No.,2.61e+21
