# ECON 490: Causal Inference and Matching (16)

## Prerequisites

1. Run OLS Regressions.

## Learning Outcomes

1. Understand the potential outcomes notation.
2. Recognize the Conditional Independence Assumption and when it can hold.
3. Construct propensity scores and do nearest neighbor matching on observables to estimate average treatment effects.

## 16.1 Potential Outcomes Framework and Causality 

So far we've learned that linear regression is a powerful tool that requires uncorrelatedness between the error term and the independent variables. Whenever there is a correlation between the two we say that such a variable is endogenous. 

In this section we will study the well-known Potential Outcomes Framework, also known as the Neyman-Rubin causal model. With the new notation, we will be able to establish the sufficient conditions that must hold so that we can infer average causal effects from our data. The focus in this module will be on binary treatment, where it takes the value of 1 when the person/unit receives treatment and 0 otherwise. An example of this can be a work training program, a conditional cash-transfer program towards disadvantaged households, or other actions. 

We will denote the binary treatment of individual/unit $i$ by $D_i$, the outcome as $y_i$, and some observable characteristics as $X_i$. The notation we are using right now implies that we observe units at a single point in time (cross-sectional data). The fundamental problem of causal inference is that we cannot *simultaneously* observe a person being treated and not treated. Once a unit takes one of these two paths, that is all we observe from that unit. The outcomes had units been treated or not (i.e. the outcomes we cannot observe) are called potential outcomes:

- $Y_i(1)$ is the outcome had unit $i$ received treatment.
- $Y_i(0)$ is the outcome had unit $i$ not received treatment.

Formally, we write

$$
Y_i = Y_i(1) D_i + Y_i(0) \left(1 - D_i \right). \tag{1}
$$ 

Equation 1 states that for those who receive treatment ($D_i = 1$), we observe the treated potential outcomes, while for those who didn't receive treatment ($D_i=0$), we observe the untreated potential outcomes. We cannot observe $Y_i(0)$ for those who ended up receiving $D_i=1$, or $Y_i(1)$ for those who didn't end up receiving treatment. This equation is known as SUTVA (Stable Unit Treatment Value Assumption) and states that there is no interference between treatment of other units: you can see that the $D_i$ of others does not matter in determining outcome $Y_i$.

Notice that this notation already provides a notion of *treatment effects*: what would have been the increase in $Y$ if a person had gone from not being treated to being treated.  

$$
\text{Individual Treatment Effects:   } y_{i}(1) - y_{i}(0)
$$

But as we discussed early, these are *by definition* unobservable! Despite this, there might be a way to know something about some notion of average effects:

$$
\text{Average Treatment Effects (ATE):   } E[y_{i}(1) - y_{i}(0)]
$$

$$
\text{Average Treatment Effects on the Treated (ATT):   } E[y_{i}(1) - y_{i}(0) \mid D_i=1]
$$

$$
\text{Average Treatment Effects on the Untreated (ATU):   } E[y_{i}(1) - y_{i}(0) \mid D_i=0]
$$

where the following equation always hold by the law of total expectation:

$$
\underbrace{E[y_{i}(1) - y_{i}(0)]}_\text{ATE} = \underbrace{E[y_{i}(1) - y_{i}(0) \mid D_i=1]}_\text{ATT} P(D_i=1) + \underbrace{E[y_{i}(1) - y_{i}(0) \mid D_i=0]}_\text{ATU} P(D_i=0).
$$

We can see in the previous equation that it is harder to know ATE relative to the other two. In some cases it will only be feasible to infer ATT. This will be further explored in a module called Difference-in-Differences. 


We can also define treatment effects after we condition on some observables $X_i$. For instance, we can define average effects such as:

$$
\text{Conditional Average Treatment Effects (CATE):   } E[y_{i}(1) - y_{i}(0) \mid X_i]
$$

$$
\text{Conditional Average Treatment Effects on the Treated (CATT):   } E[y_{i}(1) - y_{i}(0) \mid D_i=1, X_i]
$$

and so on.

## 16.2 What does this have to do with regression?

Consider Equation 1 once again. We'll rewrite it as:

$$ 
\begin{align}
Y_i &= Y_i(1) D_i + Y_i(0) \left(1 - D_i \right) \\
    &= Y_i(0) +  \left( Y_i(1) - Y_i(0) \right) D_i \\
    &= \underbrace{E[Y_i(0)]}_{\beta_0} + \underbrace{E[ Y_i(1) - Y_i(0) ]}_{\beta_1} D_i  +  \underbrace{ \{ \left( Y_i(1) - Y_i(0) \right) - E[ Y_i(1) - Y_i(0) ] \} D_i + Y_i(0) - E[Y_i(0)] }_\text{$\epsilon_i$}
\end{align}
$$ 

Notice that if $D_i$ was independent of $(Y_i(1) , Y_i(0))$, then $E[\epsilon_i \mid D_i ] = 0 $ (you can double check this result!). Let's think a little more about what the independence requirement means. It says that entering treatment must have nothing to do with the determinants of both potential outcomes. One extreme example would be a lottery, where based on a number you draw (which has nothing to do with your outcomes and their determinants) you either get treated or don't. We refer to this as *random assignment*.  Under these circumstances, we are able to regress using OLS and obtain a sample analogue of $\beta_1$ (ATE).

The other option, which imposes a perhaps stronger assumption is that $Y_i(1) - Y_i(0)$ are constant, hence the name *constant treatment effects*. If that's the case, we see that $\epsilon_i$ will have expectation zero by construction. For the rest of this module, we will assume this is not the case. 

At this point, you might think that this is a strong assumption, and you would be correct. A more credible approach would be to say that once we condition/control for individual characteristics $X_i$, we then obtain independence between treatment assignment and potential outcomes. This condition is known as the *Conditional Independence Assumption*.

## 16.3 Conditional Independence Assumption


Recall the regression model we had in the previous section:

$$ 
\begin{align}
Y_i  &= \underbrace{E[Y_i(0)]}_{\beta_0} + \underbrace{E[ Y_i(1) - Y_i(0) ]}_{\beta_1} D_i  +  \underbrace{ \{ \left( Y_i(1) - Y_i(0) \right) - E[ Y_i(1) - Y_i(0) ] \} D_i + Y_i(0) - E[Y_i(0)] }_\text{$\epsilon_i$}
\end{align}
$$ 

where we have the following condition:


$$
\text{Conditional Independence Assumption (CIA):   } D_i \perp \left( Y_i(1), Y_i(0) \right) \mid X_i.
$$

Consider the case where workers can be affected by a mass-layoff at their firm (i.e. when the firm is subject to significant downsizing) versus not. A mass-layoff event could be thought of as essentially random given two *similar* (this is the key!) workers (i.e. having similar age, being at the same industry, etc.).

One approach relies on running an OLS regression including $X_i$ as additional covariates. This is completely valid, but it is perhaps good to know what this approach really does. When we estimate this model with additional covariates, it will "substract" the effect of the covariates from both the outcomes and treatment (Frisch-Waugh-Lovell Theorem as seen in [Module 12](econometrics/econ490-stata/12_Linear_Reg.ipynb)). For instance, the regression will use those who are incredibly old in our data as controls by substracting a linear term of the effect on age to adjust for differences. This will, however, depend on how likely the potential outcomes $Y_i(1)$ and $Y_i(0)$ can be split into linear terms of $X_i$. 

## 16.4 Matching as a way to make groups comparable

The goal when we assume CIA is to find a comparable unit for every treated unit. A naive approach would be to find an untreated unit with exactly the same characteristics as a treated unit. This is called *exact matching*. The problem with this approach is that when we have characteristics that take many possible values, it may be hard to find two people who share every $X_i$ exactly.

The most common approach is to construct the probability of being treated given our observable characteristics, then find an untreated unit with a high probability for every treated unit. We refer to this probability as a propensity score, denoted $p(X)$. Although there are many ways to match units based on a propensity score, we will only describe the case in which we match the closest propensity score neighbor without replacement. This procedure is called Propensity Score Nearest-neighbor Matching.

The theoretical justification to do so is given by the following theorem:

**Propensity Score Theorem.-**
$$
\text{Suppose the CIA holds, which is defined as } D_i \perp \left( Y_i(1), Y_i(0) \right) \mid X_i.\text{ Then }D_i \perp \left( Y_i(1), Y_i(0) \right) \mid p(X_i). 
$$

To estimate a propensity score, we run the `probit` or `logit` command where the dependent variable is the treatment indicator and the independent variables are the individual characteristics $X_i$. Without going into the details of the procedure, this will help us predict a propensity score for both treated and untreated units.

In [6]:
use fake_data, clear
keep if year==2003
gen female = sex=="F"



(129,659 observations deleted)



In [4]:
d


Contains data from fake_data.dta
  obs:         8,479                          
 vars:             9                          22 Aug 2022 15:02
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
workerid        long    %12.0g                Worker Identifier
year            int     %8.0g                 Calendar Year
sex             str1    %9s                   Sex
age             byte    %9.0g                 Age (years)
start_year      int     %9.0g                 Initial year worker is observed
region          byte    %9.0g                 group(prov)
treated         byte    %8.0g                 Treatment Dummy
earnings        float   %9.0g                 Earnings
sample_weight   float   %9.0g                 
-----------------------------------------------------------

In [8]:
probit treated i.region age female 


Iteration 0:   log likelihood = -5070.3698  
Iteration 1:   log likelihood = -4834.8433  
Iteration 2:   log likelihood = -4834.6624  
Iteration 3:   log likelihood = -4834.6624  

Probit regression                               Number of obs     =      8,479
                                                LR chi2(6)        =     471.41
                                                Prob > chi2       =     0.0000
Log likelihood = -4834.6624                     Pseudo R2         =     0.0465

------------------------------------------------------------------------------
     treated |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      region |
          2  |   .0953624   .0379457     2.51   0.012     .0209901    .1697347
          3  |  -.1211564   .0755966    -1.60   0.109    -.2693229    .0270102
          4  |   .0756149   .0469747     1.61   0.107    -.0164537    .1676836
          5 

In [9]:
predict pscore

(option pr assumed; Pr(treated))


In [10]:
%browse 10

Unnamed: 0,workerid,year,sex,age,start_year,region,treated,earnings,sample_weight,female,pscore
1,2,2003,M,56,2001,4,0,111797.26,0.70438099,0,0.26855189
2,6,2003,M,49,1995,5,0,132451.77,0.56865126,0,0.23547712
3,14,2003,M,52,1995,4,0,29224.723,0.13044044,0,0.25645399
4,17,2003,M,48,1995,3,0,599702.44,0.41867602,0,0.1872151
5,19,2003,M,45,1998,2,1,136680.91,0.76891863,0,0.24211197
6,20,2003,M,44,1995,1,0,20351.242,0.88601494,0,0.21065027
7,39,2003,M,48,1995,2,0,76223.641,0.56896472,0,0.25089157
8,40,2003,M,45,1995,2,0,595965.69,0.92006618,0,0.24211197
9,41,2003,M,44,1995,2,0,41871.207,0.46651092,0,0.23922288
10,45,2003,M,52,1995,4,0,3202.1738,0.84977078,0,0.25645399


We can see that there is a new variable called _pscore_ which stores the propensity score. This is generated for all observations. Now we need a command that does the matching using those propensity scores. The ideal command is `psmatch2`, which needs to be installed via this line:

In [None]:
ssc instal psmatch2

To use the command, we indicate which is our outcome of interest (_earnings_, in this example) and whether we have a propensity score in the data set (which we do).

In [11]:
psmatch2 treated, out(earnings) pscore(pscore)

--------------------------------------------------------------------------------
> --------
        Variable     Sample |    Treated     Controls   Difference         S.E. 
>   T-stat
----------------------------+---------------------------------------------------
> --------
        earnings  Unmatched |  52426.854   102788.845  -50361.9911   3651.52013 
>   -13.79
                        ATT |  52426.854   112196.327  -59769.4733   21611.5303 
>    -2.77
----------------------------+---------------------------------------------------
> --------
Note: S.E. does not take into account that the propensity score is estimated.

           | psmatch2:
 psmatch2: |   Common
 Treatment |  support
assignment | On suppor |     Total
-----------+-----------+----------
 Untreated |     6,059 |     6,059 
   Treated |     2,420 |     2,420 
-----------+-----------+----------
     Total |     8,479 |     8,479 


This command shows the difference in outcomes without doing the matching, which is roughly 50,361 dollars. The second line gives the difference after the matching on observables has been done, which is a bigger difference.

## 16.5 Wrap Up

Propensity score matching can be a convincing research design when we don't have access to any other method such as IV or Difference-in-differences (which will be explored in upcoming modules). If we choose to pursue this approach, we must remember the following:

- This method is based on binary treatments (i.e. those that take value 1 or 0).
- We must defend the Conditional Independence Assumption. That is, we must discuss why we think that, for units with similar characteristics $X_i$, the treatment assignment is as-good-as-random.