# ECON 490: Instrumental Variables (19)

## Prerequisites:
---
1. Run OLS regressions.

## Learning objectives:
---

1. Understand what an instrumental variable is and the conditions it must satisfy to address the endogeneity problem.
2. Implement a Two Stage Least Squares (2SLS) regression-based approach using an instrument. 
3. Describe what the weak instrument problem is.
4. Interpret the first stage test of whether the instrument is weak or not.



## 19.1 What problem are we trying to fix to begin with?

Consider a case where we want to know the Average Treatment Effects (ATE), defined as 
$$E[ Y_i(1) - Y_i(0) ] $$

where $Y_i(1)$ is the outcome had the individual been treated and $Y_i(0)$ had she not been treated. The problem of causal inference is that we cannot observe both potential outcomes at the same time, i.e. a unit was either treated or not treated. When we have treatment $D_i$ that is good as random, these potential outcomes do not depend on $D_i$ (otherwise $D_i$ would have been correlated with these $Y$s). Formally,

$$\begin{align} 
E[Y_i(0) \mid D_i=1] &= E[Y_i(0) \mid D_i=0] = E[Y_i(0)], \quad \text{    and } \\
E[Y_i(1) \mid D_i=1] &= E[Y_i(1) \mid D_i=0] = E[Y_i(1)].
\end{align}
$$

Notice that the previous set of equations say that we can condition on any value of $D_i$ and it would be the same as if we didn't condition. When this holds, we can infer the average treatment effects 

$$\begin{align} 
E[ Y_i(1) - Y_i(0) ] &= E[ Y_i(1)] - E[ Y_i(0)] \\
&= E[ Y_i(1) \mid D_i=1] - E[ Y_i(0) \mid D_i=0], \quad \text{by independence of $D_i$.}\\
&= E[ Y_i \mid D_i=1] - E[ Y_i \mid D_i=0], \quad \text{because those are the outcomes that are observed for those groups.}
\end{align}
$$


A regression-based approach to that same issue can be formulated with the following model, 

$$Y_{i} = \alpha + \beta D_i + \epsilon_i$$

where we assume $E[\epsilon_i \mid D_i] =0$. This condition is equivalent to the potential outcome model, it states that there are no unobserved differences across treated and untreated units. In other words, treatment is (mean) independent to whatever explains the outcome.  


The punchline is that randomization is a great way to tackle the fundamental problem of causal inference: observing averages of counterfactuals. However, in *most* economic applications this will not be the case. The instrumental variables approach relies on finding something that is as good as random that affects the treatment and thus indirectly affect the outcome. The trick is to split the treatment into two pieces: one that is as-good-as-random part and one that is non-random. We then use the former to estimate causal effects.

## 19.2 The Linear IV Model

Consider the following model 

$$
\begin{align}
Y_i &= \alpha_1 + \beta D_i + \gamma_1 X_i + u_i  \quad \text{(Structural Equation)}\\
D_i &= \alpha_2 + \gamma_2 Z_i + \gamma_3 X_i + e_i  \quad \text{(First Stage Equation)}
\end{align}
$$

where $Z_i$ is called an instrumental variable. An instrumental variable must satisfy two conditions:
- It must affect treatment assignment ($\gamma_2 \neq 0$).
- It must be uncorrelated with $u_i$, i.e. it should not be part of the Structural Equation. 

These two conditions imply that the instrument must affect the outcome $Y$ *only through* its effect on treatment $D$. A well-known example of this model is studied by Angrist and Krueger (1991). It is the case where $Y$ are earnings, $D$ is years of schooling and $Z$ is the quarter of birth. The idea is that students are required to enter school in the year where they turn 6, creating this relationship between quarter of birth and schooling. At the same time, the time of the year you are born shouldn't really affect your earnings aside from it's effect on schooling.


### 19.2.1 Two-Stage Least Squares (2SLS) 

The 2SLS approach is simple at heart. The two steps are the following.

1. Estimate the First Stage Equation by OLS, and obtain the predicted value of $D_i$. We have effectively splitted $D_i$ into
    $$ D_i = \underbrace{\hat{D}_i}_\text{exogenous part} + \underbrace{\hat{e}_i}_\text{endogenous part}  $$

    where $\hat{D_i} \equiv \hat{\alpha_2} + \hat{\gamma_2} Z_i + \hat{\gamma_3} X_i $.

2. Plug $\hat{D_i}$ instead of $D_i$ in the Structural Equation, and estimate via OLS. We are then using the "exogenous" part of $D_i$ to capture $\beta$. 

<div class="alert alert-warning">

**Caution**: We can run 2SLS following the steps above, but when we want to do inference we need to be sure we're using the true residuals in the Structural equation $\hat{u}_i$. When we do the manual approach, Stata will report the standard errors based on  $\hat{u}_i +  \hat{e}_i$, which would be wrong. The solution is to use the built-in command `ivregress` or `ivreg2`!
</div>


### 19.2.2 Weak Instrument Test

Recall that the instrument must have an effect - after controlling for covariates $X$ - on the treatment variable $D$. If it didn't have any effect, notice that we don't solve the underlying issue of an endogenous regressor. Whenever this effect is very close to zero, we refer to this as the *weak instrument* problem. In practice, this problem will result in severe finite-sample bias and large variance in our estimates.

Given the fact that we do indeed estimate the First Stage Equation, we can test for this. To do so, we'll use the command `estat firststage`. 

### 19.2.3 Overidentification Test

Recall that in OLS there is no way to test for exogeneity, $E[X_i u_i]=0$, because this holds exactly in the sample when we do the estimation. Similarly, when there is *only one instrument*, we are not able to test for the exogeneity of the instrument  $E[Z_i u_i]=0$, because this will be true by definition in the sample. 

However, when we have access to more than one instrument we can include them all in the First Stage Equation and we can test for instrument exogeneity. Intuitively, if we have more than 1 instrument we can use one of them to construct $\hat{u_i}$ and then the others to test whether $E[Z_i u_i]=0$ holds true in the sample. 
 
The command to run this in Stata is `estat overid`.

If we reject the test, then it says that some of our instruments is not exogenous, which violates the required conditions for IVs. If we don't reject *it does not necessarily* mean that our model is good, it just states that there is not enough evidence in the data to think that the instruments are endogenous. The punchline of this is that overidentification tests can be really good to exclude some models (in this case, instruments), but it won't tell you if your model is good. 