# ECON 490: Difference in Differences (18)

## Prerequisites

1. Run OLS regressions.
2. Run panel data regressions.

## Learning Outcomes

1. Understand the parallel trends (PT) assumption.
2. Run the according OLS regression that retrieves the causal estimand.
3. Implement these regressions in the two-period case and in multiple time periods (a.k.a event studies).
4. Conduct a test on the plausibility of the PT whenever there are more than 1 pre-treatment periods.

## 18.1 What is the Relationship with Panel Data? 

Difference-in-differences is a **research design** that relies on the use of multiple (at least two) time periods. The idea is that in typical cross-sectional settings (i.e. where the variables are all measured at a single point in time) it is hard to defend a `selection on observables` assumption. However, panel data allows us to control for unobserved time invariant heterogeneity. 

Consider the following example. Earnings $y_{it}$ of worker $i$ at time $t$ can be split into two components:

$$
y_{it} = e_{it} + \alpha_{i}
$$

where $\alpha_i$ is a measure of worker quality and $e_{it}$ are the part of earnings not explained by $\alpha_i$. This says that a bad quality worker (low $\alpha_i$) will receive lower earnings *at any time period*. Notice that worker quality is typically unobserved and is usually part of our error term (the one that we don't want to be correlated with treatment!). In many cases, this invariant heterogeneity is the cause of endogeneity bias. In this example, it can be that workers who attend a training program also tend to be the ones that perform poorly at their job and `select` into this program. 

However, notice that if we take time differences, we get rid of this heterogeneity. Suppose we subtract earnings at time $1$ from earnings at time $0$: 

$$
y_{i1} - y_{i0} =   e_{i1} - e_{i0}
$$

where our new equation no longer depends on $\alpha_i$! However, our model now has *changes* rather than levels. This is going to be the trick used implicitly throughout this module.

## 18.2 Potential Outcomes Framework and Causality 

Suppose there is a binary treatment denoted $D_i$ (whether worker $i$ enrolled in training at some point) and an outcome $y_{it}$ (earnings at time $t$). For simplicity, let's assume there are only 2 time periods: $t=0$ and $t=1$.  We will denote $y_{it}(d)$ as the earnings of a worker at time $t$ *had they received treatment equal to $D_i=d$*. For instance, for a worker who enrolled in the training program, we observe $y_{i1}(1)$ because she indeed has $D=1$; however, we will never observe the case where she has $y_{i1}(0)$. Notice that this already provides a notion of *treatment effects*: 

$$
\text{Individual Treatment Effects:   } y_{i1}(1) - y_{i1}(0)
$$

These individual treatment effects are by definition unobservable! To see this, notice that $y_{i1}(0)$ for those who are treated, $D_i=1$, is unobserved. At the same time, $y_{i1}(1)$ for those who were not treated, $D_i=0$, is also unobservable.

Despite this, there might be a way to know something about some notion of mean effects:

$$
\text{Average Treatment Effects (ATE):   } E[y_{i1}(1) - y_{i1}(0)]
$$

$$
\text{Average Treatment Effects on the Treated (ATT):   } E[y_{i1}(1) - y_{i1}(0) \mid D_i=1]
$$

$$
\text{Conditional Average Treatment Effects on the Treated (ATT):   } E[y_{i1}(1) - y_{i1}(0) \mid D_i=1, X_i]
$$


Differences-in-differences is a research design that will tackle the notion of average effects on treated units. It will focus on the population that did receive treatment $D_i$ and will attempt to *impute* what would be the average outcomes had they not been treated $E[y_{i1}(0) \mid D_i=1]$ or $E[y_{i1}(0) \mid D_i=1, X_i]$.

In the first year $t=0$ no one was treated, so that 
$$
y_{i0} = y_{i0}(0) =  y_{i0}(1)
$$

whereas in period $t=1$ some workers get treatment
$$
y_{i1} = y_{i1}(1) D_i +  y_{i1}(0) (1-D_i) \tag{1}
$$

The last equation is equivalent to saying that, at $t=1$, if worker had $D_i=1$ we observe $y_{i1}(1)$ and otherwise $y_{i1}(0)$. 

#### 18.2.1 Parallel trends

The **parallel trends assumption (PT)** states that the expected value of the difference between $y_{i}$ in period 1 and 0 is the same between treated and untreated people. If we remember basic algebra,  $y_{i1}(0) - y_{i0}(0)$ is the slope of line. Therefore, the PT assumption states that the trends of both the treated and untreated between time period 0 and 1 is the same.


$$
E[  y_{i1}(0) - y_{i0}(0) \mid D_i=1 ] = E[  y_{i1}(0) - y_{i0}(0) \mid D_i=0 ] \tag{2}
$$


The **conditional *parallel trends assumption* (PT)** states the same, after conditioning on observables $X_i$. We use the conditional PT when we are unable to state that regardless of the controls, the trends of both the treated and untreated between time period 0 and 1 are the same. This assumption gives us flexibility to say that, conditional on a characteristic of the people (for example, if they are male or female), the trends of both the treated and untreated between time period 0 and 1 are the same.

$$
E[  y_{i1}(0) - y_{i0}(0) \mid D_i=1 , X_i] = E[  y_{i1}(0) - y_{i0}(0) \mid D_i=0, X_i ] \tag{3}
$$

Now, we can see that using the parallel trends assumption allows us to solve for the unknown counterfactual quantity. For instance, we can rearrange equation 2 into

$$
E[  y_{i1}(0) \mid D_i=1 ] = E[y_{i0}(0) \mid D_i=1 ] + E[  y_{i1}(0) - y_{i0}(0) \mid D_i=0 ]
$$

which can be *imputed* on the ATT definition to give rise to 

$$
\begin{align}
E[y_{i1}(1) - y_{i1}(0) \mid D_i=1] &= E[y_{i1}(1) \mid D_i=1] - \left( \underbrace{ E[y_{i0}(0) \mid D_i=1 ] + E[  y_{i1}(0) - y_{i0}(0) \mid D_i=0 ] }_{\text{from previous equation}}  \right) \\
&= \left(E[y_{i1}(1) \mid D_i=1] - E[y_{i0}(0) \mid D_i=1 ] \right) - \left( E[  y_{i1}(0) - y_{i0}(0) \mid D_i=0 ]
 \right),
\end{align}
$$ 

where we ended up with a difference in trends between treated units and control units. That's where this design gets its name from! More importantly, notice that we never introduced a linear model in any part of this argument. That is why we say that this is a non-parametric result.

However, we know from [Module 12](econometrics/econ490-stata/12_Linear_Reg.ipynb) that OLS regression retrieves a weighted mean of the outcome, which means we can carefully construct an OLS regression that gives us this desired quantity.

## 18.3 Difference-in-Differences and Regression

Whenever we talk about difference-in-differences, we refer to a **research design** that relies on some version of the parallel trend assumption. To connect this design to regression, we need to first build a model. To begin, we will assume a case where no control variables are involved. We typically rely on a linear model of the form: 

$$
y_{it}(0) =  \lambda_t + \alpha_i + e_{it},
$$

which combined with equation (1) yields the following model: 

$$
\begin{align}
y_{i1} &= y_{i1}(1) D_i +  y_{i1}(0) (1-D_i) \\
&= \underbrace{ \left(y_{i1}(1) -y_{i1}(0)\right)}_\text{assume constant}  D_i  +  y_{i1}(0) \\
&= \beta D_i  +  \lambda_1 + \alpha_i + e_{i1}
\end{align}
$$

In period $t=0$ we know nobody was treated, so we can combine both time periods into a single model: 

$$
y_{it} = \beta D_i \mathbf{1}\{t=1\}  +  \lambda_t + \alpha_i + e_{it} \tag{4}
$$ 


where $\beta$ provides the average treatment effect (on the treated) at period $t=1$ (i.e. the effect activates for those with $D_i=1$ and at $t=1$). The $\alpha_i$ and $\lambda_t$ are also known as fixed effects of unit and time, respectively. 

You may notice that Equation 4 does not have an intercept. The reason for this is that we are including a set of dummy variables for every unit $i$, which would be perfectly collinear with the intercept because they all sum to 1. How does this approach relate to the previous section? Let's answer this algebraically:

$$
\begin{align}
E[  y_{i0}(0) \mid D_i=0 ] &= \lambda_0 + \alpha_i +  E[ e_{i0} \mid D_i=0] \\
E[  y_{i1}(0) \mid D_i=0 ] &= \lambda_1 + \alpha_i +  E[ e_{i1} \mid D_i=0] \\
\implies E[  y_{i1}(0) \mid D_i=0 ] &- E[  y_{i0}(0) \mid D_i=0 ]   = \lambda_1 - \lambda_0 +  E[ e_{i1} - e_{i0} \mid D_i=0] 
\end{align}
$$

Similarly, 

$$
\begin{align}
E[  y_{i0}(0) \mid D_i=1 ] &= \lambda_0 + \alpha_i +  E[ e_{i0} \mid D_i=1] \\
E[  y_{i1}(0) \mid D_i=1 ] &= \beta + \lambda_1 + \alpha_i +  E[ e_{i1} \mid D_i=1] \\
\implies E[  y_{i1}(0) \mid D_i=1 ] &- E[  y_{i0}(0) \mid D_i=1 ]   = \beta + \lambda_1 - \lambda_0 +  E[ e_{i1} - e_{i0} \mid D_i=1] 
\end{align}
$$


Then by subtracting both results (i.e. doing a double difference) we obtain the desired parameter of interest:

$$
\left( E[  y_{i1}(0) \mid D_i=1 ] - E[  y_{i0}(0) \mid D_i=1 ]  \right) - \left( E[  y_{i1}(0) \mid D_i=0 ] - E[  y_{i0}(0) \mid D_i=0 ] \right) = \beta
$$

In [None]:
clear* 

use fake_data, clear 

For this module we will be using the fake data data set. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. We'll keep one year prior and one year after the program, to keep things consistent with the previous section.

In [None]:
keep if year==2002 | year==2003

In [None]:
gen logearn = log(earnings)

Recall that $\alpha_i$ and $\lambda_t$ are also known as fixed effects, and we can use either the command `areg` or `reghdfe` to run this type of regression. In either case, the command requires an extra argument indicating what fixed effects are being used.

In [None]:
areg logearn treated#2003.year i.year, absorb(workerid)

This says that *on average* workers who entered the program received 20 percentage points more earnings relative to a counterfactual scenario where they never entered the program (which in this case is captured by the control units). How did we get this interpretation? Recall that OLS estimates are interpreted as a 1 unit increase in the independent variable: a 1 unit increase of $ D_i \mathbf{1}\{t=1\}$ corresponds to those who started receiving treatment at $t=1$. Furthermore, the dependent variable is in log scale, so a 0.2 increase corresponds to a 20 percentage point increase in earnings. 

#### 18.3.1  Adding covariates 

The first thing to notice about covariates is that our regression specification in Equation 4 involves $\alpha_i$. This will absorb every characteristic that is fixed over time. For instance, adding characteristics such as sex and race will end up in those variables being omitted from the regression due to perfect collinearity. 

This means that we can add covariates to the extent that they are time varying by nature (e.g. tenure, experience) or are trends based on fixed characteristics (e.g. time dummies interacted with sex). We refer to the latter as covariate-specific trends. 

Algebraically, the idea is very similar. We will model the untreated potential outcomes as:

$$
y_{it}(0) = \gamma_t X_{i} + \Gamma X_{it} + \lambda_t + \alpha_i + e_{it},
$$

where $\gamma_t X_{i}$ captures things such as earnings trends over time that are different across male and females, while $ X_{it}$ captures characteristics that evolve over time such as labour market experience. You may wonder why we split these two different ways to capture covariates. The reason for this is that we can have a version of the parallel trends assumption like Equation 3 that depends on time-invariant characteristics $X_i$ or includes time-varying characteristics $X_{it}$. If it includes both time-invariant and time-varying characteristics, the whole algebraic procedure works the same way as in the previous section.

If our parallel trends assumption depends only on $X_i$ that do not change over time, notice how the algebra changes:

$$
\begin{align}
E[  y_{i0}(0) \mid D_i=0, X_i ] &= \gamma_0 X_{i} + \Gamma E[X_{i0} \mid D_i=0, X_i] +  \lambda_0 + \alpha_i +  E[ e_{i0} \mid D_i=0] \\
E[  y_{i1}(0) \mid D_i=0, X_i ] &= \gamma_1 X_{i} + \Gamma E[X_{i1} \mid D_i=0, X_i]+ \lambda_1 + \alpha_i +  E[ e_{i1} \mid D_i=0] \\
\implies E[  y_{i1}(0) \mid D_i=0, X_i ] &- E[  y_{i0}(0) \mid D_i=0, X_i]   = (\gamma_1 - \gamma_0)X_i  + \Gamma E[X_{i1} - X_{i0} \mid D_i=0, X_i] + \lambda_1 - \lambda_0 +  E[ e_{i1} - e_{i0} \mid D_i=0] 
\end{align}
$$

Similarly, 

$$
\begin{align}
E[  y_{i0}(0) \mid D_i=1, X_i ] &= \gamma_0 X_{i} + \Gamma E[X_{i0} \mid D_i=0, X_i] +  \lambda_0 + \alpha_i +  E[ e_{i0} \mid D_i=1] \\
E[  y_{i1}(0) \mid D_i=1, X_i ] &= \beta + \gamma_1 X_{i} + \Gamma E[X_{i1} \mid D_i=0, X_i]+ \lambda_1 + \alpha_i +  E[ e_{i1} \mid D_i=1] \\
\implies E[  y_{i1}(0) \mid D_i=1, X_i ] &- E[  y_{i0}(0) \mid D_i=1, X_i]   = \beta + (\gamma_1 - \gamma_0)X_i  + \Gamma E[X_{i1} - X_{i0} \mid D_i=1, X_i] + \lambda_1 - \lambda_0 +  E[ e_{i1} - e_{i0} \mid D_i=1] 
\end{align}
$$

If we subtract both of these results from each other, we get $\beta$ only if there is a parallel trends condition on the time-varying covariates. That is,

$$
E[X_{i1} - X_{i0} \mid D_i=1, X_i] = E[X_{i1} - X_{i0} \mid D_i=0, X_i]
$$

In practice, applied econometricians tend to include many time-varying characteristics as control variables. Hopefully this section has shed some lights on the implicit assumptions that must be true when we interpret our parameter of interest.

## 18.4 Multiple Time Periods

#### 18.4.1 Two-Way fixed effects 

A very natural approach to extending this to multiple time periods is to attempt to get the average effect across all post-treatment time periods (i.e. maybe the effects of the training program decay over time, but we are interested in the average over time). We may think of maintaining the parallel trends assumption in a model like this: 

$$
y_{it} = \beta D_i \mathbf{1}\{t\geq 1\}  + \lambda_t + \alpha_i + e_{it}
$$ 

where the $\beta$ corresponds now to all time periods following treatment $t\geq 1$. Some people rename $ D_i \mathbf{1}\{t\geq 1\}$ to $D_{it}$, where $D_{it}$ is simply a variable that takes 0 before any treatment and 1 for those who are being treated at that particular time $t$. This is known as the *Two-way Fixed Effects Model* . It receives this name because we are including unit fixed effects, time fixed effects, and our treatment status. 

In [None]:
clear* 

use fake_data, clear 
gen logearn = log(earnings)

gen post2003 = year>=2003

In [None]:
areg logearn 1.treated#1.post2003 i.year, absorb(workerid)

In this fake data set, everyone either starts treatment at year 2003 or does not enter the program at all. However, when there is variation in the timing of the program (i.e. people entering the training program earlier than others), regression using this model may fail to capture the true parameter of interest. For a reference, see this [paper](https://www.sciencedirect.com/science/article/abs/pii/S0304407621001445).

The results say that a 1 unit increase in $D_i \mathbf{1}\{t\geq 1\}$ corresponds to a 0.07 increase in log-earnings *on average*. That 1 unit increase only occurs for those who start receiving treatment after 2003. Given that the outcome is in a log scale, we interpret these results in percentage points. Therefore, the coefficient of interest says that those who start treatment receive a 7 percentage point increase in earnings.

#### 18.4.2 Event studies

The natural extension of the previous section, which is the standard approach today, is to estimate different treatment effects depending on the time period. For example, we compute treatment effects 1 year after entering the program or 10 years after entering the program. This allows us to capture the evolution of treatment effects over time. This is a very powerful tool because we can also compute whether there exists any "treatment effects" prior to the program (i.e. we can test whether the parallel trend assumption holds or not). This is often known as a pre-trends test.

We begin by constructing a variable that identifies the time relative to the event. For instance, if a person enters the training program in 2003, the observation corresponding to 2002 is time -1 relative to the event, the observation corresponding to 2003 is time 0 relative to the event, and so on.

In this fake data set, everyone enters the program in year 2003, so it is very easy to construct the event time. Otherwise, we must have in our data set a variable which states the year in which every person enters treatment.

In [None]:
cap drop time_entering_treatment 
gen time_entering_treatment = 2003 if treated==1 
replace time_entering_treatment = . if treated==0

cap drop event_time
gen event_time = year - time_entering_treatment

In [None]:
tab event_time , m

We then decide which *window* of time around the treatment we want to focus on. For instance, we may want to focus on 2 years prior to the treatment and 2 years after the treatment and estimate those treatment effects. 

To begin, we code untreated units as if they always belonged to event time -1. Later on, when we see the regression specification, we will see that we do not include a dummy for this event time. In other words, we recode: 

In [None]:
replace event_time = -1 if treated==0

We then pretend that those who are observed before event time -2 were actually observed in event time -2. That is:

In [None]:
replace event_time = -2 if event_time<-2 & treated==1

We also pretend that those who are observed after event time 2 were actually observed in event time 2. That is:

In [None]:
replace event_time = 2 if event_time>2 & treated==1

This is called *binning* the window around treatment. To understand why we do this, we need to introduce the extended model that we will use. 

Consider the following equation

$$
y_{it} = \sum_{k=-2,k\neq1}^2 \beta_k \mathbf{1}\{K_{it} = k\}  + \lambda_t + \alpha_i + e_{it},
$$ 

where $K_{it}$ are event time dummies (i.e. whether person $i$ is observed at event time $k$ in time $t$). Notice that, for workers who never enter treatment, it is as if the event time is $\infty$. Due to multicollinearity, we need to omit one category of event time dummies $k$. The typical choice is $k=-1$ (one year prior to treatment), which will serve as our reference group. This means that we are comparing changes relative to event time -1. 

To see how it compares algebraically to our simpler models with 2 time periods, consider the treatment effects the year after the policy at time $t$, i.e. $k=0$ . 

$$
E[y_{it} \mid K_{it}=0] -  E[y_{it-1} \mid K_{it}=0] = \beta_0 + \lambda_{t} - \lambda_{t-1} +   E[e_{it} \mid K_{it}=0] -  E[e_{it-1} \mid K_{it}=0]
$$

and 

$$
E[y_{it} \mid K_{it}=\infty] -  E[y_{it-1} \mid K_{it}=\infty] =  \lambda_t - \lambda_{t-1} +   E[e_{it-1} \mid D_i=0] -  E[e_{it-1} \mid D_i=0]
$$


Therefore, the difference will capture $\beta_0$ provided the parallel trends condition holds for treated units $K_{it}=0$ and control units $K_{it}=\infty$ (or, equivalently, $D_i=0$). By omiting the event time dummy at $k=-1$, we are implicitly constructing these differences relative to the year prior to the treatment so as to create our trends among treated and control units. Notice that in this case we are using never-treated units as the comparison group $K_{it}=\infty$; however, there are cases where we don't have such units in our data set. If this is the case, we can use not-yet-treated units as the comparison group $K_{it}=-1$ to capture $\beta_0$.

In [None]:
tab event_time, gen(event_time_dummy)

We have new dummies generated for each event time under the names *event_time_dummy1*,*event_time_dummy2*, and so on.

In [None]:
d

In [None]:
areg logearn event_time_dummy1 event_time_dummy3 event_time_dummy4 event_time_dummy5 i.year , absorb(workerid)

Notice that *event_time_dummy2* is the one that corresponds to event time -1, our omited category. We will exclude that dummy from the regression.

Again, the interpretation is the same as before, only now we have dynamic effects. The coefficient on the *event_time1* dummy says that 2 years prior to entering treatment, treated units experienced a 0.8 percentage point increase in earnings relative to control units. 

Should we worry that we are finding a difference between treated and control units prior to the policy? Notice we cannot reject that the "effect" of the policy at event time -2 (*event_time_dummy1*, when there was no training program) is statistically different than zero at the 95% confidence level. 

This confirms that our parallel trend assumption is supported by the data. In other words, there are no observable differences in trends prior to the enactment of the training program. *Checking the p-value of those coefficients prior to the treatment is called the pre-trend test* and does not require any fancy work. A mere look at the regression results suffices!

Furthermore, we can observe how the policy effect evolves over time. At the year of entering the training program, earnings are boosted by 20 percentage points. The next year decreases to 11 percentage points, and 2+ years after the policy the effect significantly decreases towards 4 percentage points. 

## 18.5 Wrap Up 

In this module we've seen how the difference-in-differences design relies on two components: 

1. Panel data, in which units are observed over time.
2. Including time and unit fixed effects

These two components make regressions mathematically equivalent to taking time-differences that eliminate any time-invariant components of the error term creating endogeneity. Furthermore, when we have access to more than 2 time periods, we are able to construct dynamic treatment effects and test whether the parallel trends condition holds.

In the final module, we will look at a final research design: instrumental variables.

## References

[Difference in differences using Stata](https://www.youtube.com/watch?v=OQCKafoCb9Q)