# L9 Instrumental Variables

* L1 just introduces the basic task of Econometrics is to quantify the **causual inference** to examine various statements from Economics theories.

* L2 introduces the _sensible_ **linear projection model** which is the _best linear predictor_ to $y$ or **CEF** ($\mathbb{E}[y \mid \boldsymbol{x}]=m(\boldsymbol{x})$). Most importantly, it points out that if **CEF** is linear then **CEF** is the **linear projection model** with stronger assumption of **conditional mean independence** (i.e. $E[e \mid \boldsymbol{x}]=0$)
  * It is valuable since we have a more amazing interpretation for $\beta$: **effects of some $x$ on expected $y$**.
  * However, such stronger assumption in practical works (including your projects) requires justification from theories, data availability, and etc.: 
    * Discussion about **omitted variables bias** suggest that researchers should include the variables which are both _potentially confounding factors to $y$_ and _correlated to $\boldsymbol{x}$_. (see L2 \& L5)
    * Discussion about **Coefficient Decomposition** illustrates the meaning of regression coefficient is the linear effect of some $x$ on $y$ after _stripping out_ the effects of the other variables. Therefore, if we would like to explore the **pure** effect of some $x$ then we need to add the confounding variables as control in the model. (see L3 \$ L4)  
    
  
* L3 illustrates the estimation algorithm called **ordinary learst squares** (**OLS**) to estimate the _unknown but fixed_ $\beta$ in **linear projection model**.

* L4 just indicates the basics of R which facilitates our computation for the real data.

* L5 illustrates Gauss-Markov Theorem saying that under assumptions of **Random Sampling**, **Linear CEF**, and **Homoskedasticy**, **OLS** estimators have **GOOD** properties of **BLUE**.

* L6 discuss **statistical inference** for _unknown but fixed_ $\beta$ in **linear projection model** given only **ONE** specific set of samples, for which we need to add one more assumption of **normal errors**.

* L7 discuss **furthur issues** about regression analysis:
  * Standardized regression
  * Model specification ($log()$ transformation, interaction, quafratics)
  * Model fit measurement
  * Confidence intervals for **linear CEF** and $y$

  
* L8 discuss application of **qualitative information** and **dummy variables** in regression analysis:
  * Interpretations of parameters before dummies or interactions including dummies
  * Testing for Differences in Regression Functions across Groups
  * Experimental control (random assignment) and analysis 
  * Quasi-experimtent: DID

* One interesting short book, Angrist, J. D., & Pischke, J. S. (2014). Mastering'metrics: The path from cause to effect. Princeton University Press, just summarizes the popular applications of _Econometrics techniques_ to examine or test the **causal effects**. 

* Nowadays, testing and examining **causal effects** becomes increasingly important role of _Econometrics_, and it is the key differentiating point of _Econometrics_ from _machine learning_, which mainly focuses on **prediction** or **forecasting**.

* This book is dedicated to the audience **without technical backgrouds** and it is **cased-oriented introduction book**. Hope the examples in this book can help understand the **ideas and thoughts** in our course.

* The book mainly talks about the following techniques:
  * Ch 1: Randomized Trials (Brief introduction of _experimental control (random assignment)_ and analysis in L8)
  * Ch 2: Regression (We discuss this thoughout our semester, especially L2, L3, L5, L6. In application, it is vital to handle **Omitted variables bias** and implement **Sensitivity Analysis**)
  * Ch 3: Instrumental Variables (L9)
  * Ch 4: Regression Discontinuity Designs (To be determined)
  * Ch 5: Differences-in-Differences (Brief introduction in L8)

* Brief review our logic to apply _Econometrics_ into real projects:
  * First we have research question in mind (e.g. whether some medicine can cure covid-19?)
  * Put the whole statement in **scientific framework**: we need to examine the **causal inference or causal effect** with underlying assumption of **Ceteris paribus**, which means "All things being equal".
  * In order to tackle such underlying assumption or challenge, we need to specify the appropriate model to **capture** such **causal inference or causal effect**, at least closely related to such **causal inference or causal effect**.
  * One ideal way is to consider specifying **linear CEF**, which is $$y = \mathbb{E}[y \mid \boldsymbol{x}] + e =\boldsymbol{x}^{\prime} \boldsymbol{\beta}+ e$$ since the interpretation of some $\beta$ indicates the effects of some $x$ on expected $y$ holding other control variables included in $\mathbb{E}[y \mid \boldsymbol{x}]$ constant.
    * Such interpretation is NOT **causal inference or causal effect** in the strict sense (why?), but it is acceptable.
    * Please recall that **CEF error** has the property of **conditional mean independence**. This property is easily verified:
    $$\begin{aligned}
E[e \mid \boldsymbol{x}] &=\mathbb{E}[(y-m(\boldsymbol{x})) \mid \boldsymbol{x}] \\
&=\mathbb{E}[y \mid \boldsymbol{x}]-\mathbb{E}[m(\boldsymbol{x}) \mid \boldsymbol{x}] \\
&=m(\boldsymbol{x})-m(\boldsymbol{x}) \\
&=0
\end{aligned}$$ 
The actual meaning is requiring $\mathbb{E}\left[e_{i} \mid \boldsymbol{x}_{i}\right]$ should be the **equal to the same constant**  for **any** possible value of $\boldsymbol{x}$, usually facing challenges from reviewers. (e.g. if we only regress the death rates on the dummy of whether taking the medicine, then one may argue that whether the mean age in the control group is the same of that in the experimental group)
    * From the discussion of **omitted variable bias**, we know that one good stragety is to try including the **confounding factors** into the regression, which are variables _directly_ influencing $y$ **and** _directly_ corelated to focal $x$.
    * Since **linear CEF** is also **linear projection model**, then we can just use **OLS** to get the **BLUE** for the _unknown but fixed_ $\boldsymbol{\beta}$.

* However, in practice the world is seldom that ideal, so even using **statistical control** we cannot get the _sensible_ **linear CEF**.

* Finally we will have the model with our interested **causal inference or causal effect** involved, but it is **neither linear CEF nor linear projection model**.

* So, why and what can we do?

## Introduction to endogeneity

* We say that there is **endogeneity** in the linear model 
$$
y_{i}=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+e_{i} \label{eq1} \tag{1}
$$
if $\boldsymbol{\beta}$ is the _parameter of interest_ and $$
\mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \neq \mathbf{0} \label{eq2} \tag{2}
$$

* To distinguish $\eqref{eq1}$ from the **regression** and **projection models**, we will call it a **structural equation** and $\boldsymbol{\beta}$ a **structural parameter**. When $\eqref{eq2}$ holds, it is typical to say that $\boldsymbol{x}_i$ is **endogenous** for $\boldsymbol{\beta}$

* Please note that **endogeneity** just precludes $\eqref{eq1}$ to be **linear CEF** or **linear projection model** (Why?)

* Given $y_i$ and $\boldsymbol{x}_i$ in $\eqref{eq1}$ we can nearly always cunstruct the **linear projection model** $$\begin{aligned}
y_{i} &=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}^{*}+e_{i}^{*} \\
\mathbb{E}\left[\boldsymbol{x}_{i} e_{i}^{*}\right] &=\boldsymbol{0}
\end{aligned}$$

* However, under **endogeneity** $\eqref{eq2}$ the **projection coefficient** $\boldsymbol{\beta}^{*}$ does not equal the **structural parameter** $\boldsymbol{\beta}$: $$\begin{aligned}
\boldsymbol{\beta}^{*} &=\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i} y_{i}\right] \\
&=\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i}\left(\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+e_{i}\right)\right] \\
&=\boldsymbol{\beta}+\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \\
& \neq \boldsymbol{\beta}
\end{aligned}$$ the final relation since $
\mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \neq \mathbf{0}$

* Furthermore, **OLS** is the algorithm totally inspired by **linear projection model** and **consistent** with the $\boldsymbol{\beta}^{*}$ but **NOT** $\boldsymbol{\beta}$: 
$$\widehat{\boldsymbol{\beta}} \underset{p}{\longrightarrow}\left(\mathbb{E}\left[\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{i} y_{i}\right]=\boldsymbol{\beta}^{*} \neq \boldsymbol{\beta}$$ The inconsistency of least-squares is typically referred to as **endogeneity bias** or **estimation bias** due to
endogeneity. (This is an imperfect label as the actual issue is inconsistency, not bias.)

## Examples for structural equation

* From previous concept of **structual equation**, we say $\boldsymbol{\beta}$ is the _parameter of interest_. 

* By this I mean, such $\boldsymbol{\beta}$ are actually **causal inference or causal effect** or at least closely related to **causal inference or causal effect**. 

* However, the tricky point is that finally we have **structual equation** instead of **lienar CEF** or **linear projection model**.

* Here are three typical examples to illustrate why.

### Example: Measurement error in the regressor

* Suppost that $(y_i, \boldsymbol{z}_i)$ are joint random variables and we can validate the **linear projection model** for $$\mathbb{E}\left[y_{i} \mid \boldsymbol{z}_{i}\right]=\boldsymbol{z}_{i}^{\prime} \boldsymbol{\beta}$$

* Just following what we learn we can easily estimate $\boldsymbol{\beta}$ _if we can **correctly** measure $y_i$ and $\boldsymbol{z}_i$_.

* Unfortunately, we encounter the case that _$\boldsymbol{z}_i$ is not observed_. Instead we observe $$\boldsymbol{x}_i = \boldsymbol{z}_i + \boldsymbol{u}_i$$ where $\boldsymbol{u}_i$ is $k \times 1$ **measurement errors**, independent of $e_i$ and $\boldsymbol{z}_i$

* This is an example of a **latent variable model**, where "latent" refers to a structural variable which is unobserved.

* With **classical measurement error**, $\boldsymbol{z}_i$ and $\boldsymbol{u}_i$ are independent and $\mathbb{E}[\boldsymbol{u}_i] = \boldsymbol{0}$, we have  $\boldsymbol{x}_i$ a noisy but **unbiased** measure of $\boldsymbol{z}_i$.

* Finally we have the model: $$\begin{aligned}
y_{i} &=\boldsymbol{z}_{i}^{\prime} \boldsymbol{\beta}+e_{i} \\
&=\left(\boldsymbol{x}_{i}-\boldsymbol{u}_{i}\right)^{\prime} \boldsymbol{\beta}+e_{i} \\
&=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+v_{i}
\end{aligned}$$
where $v_i = e_i - \boldsymbol{u}_{i}^{\prime} \boldsymbol{\beta}$ i.e.
$$y_i = \boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+v_{i}$$

* Again, please note that $v_i$ is **NOT** a projection error: $$\mathbb{E}\left[\boldsymbol{x}_{i} v_{i}\right]=\mathbb{E}\left[\left(\boldsymbol{z}_{i}+\boldsymbol{u}_{i}\right)\left(e_{i}-\boldsymbol{u}_{i}^{\prime} \boldsymbol{\beta}\right)\right]=-\mathbb{E}\left[\boldsymbol{u}_{i} \boldsymbol{u}_{i}^{\prime}\right] \boldsymbol{\beta} \neq \mathbf{0}$$ if $\boldsymbol{\beta} \neq \mathbf{0}$ and $\mathbb{E}[\boldsymbol{u}_{i} \boldsymbol{u}_{i}^{\prime}] \neq \mathbf{0}$ so that $\boldsymbol{\beta}$ is **NOT** **projection coefficients** and **OLS estimators** are **biased** and **inconsistent**.

* **Measurement errors** are the especially usual cases for _self-report data_, like _survey data_.
  * E.g. psychological constructs, like happiness
  * E.g. deliberate mistakes for privacy, like monthly wage

### Example: Supply and Demand (simultaneous equations model (SEM))

* In Economics, we usually have **multiple conditions** to determine the relations among variables, such as _supply and demand_ system.

* The variables $q_i$ and $p_i$ (quantity and price) are determined jointly by the demand equation $$q_{i}=-\beta_{1} p_{i}+e_{1 i}$$ and the supply equation $$q_{i}=\beta_{2} p_{i}+e_{2 i}$$

* Even though the Economics theory describe their relation in terms of two aspects, we can only observe the **only pairs of data $(p_i, q_i)$**. By this I mean we do not have separate $(p_i, q_i)$ for _demand equation_ and separate $(p_i, q_i)$ for _supply equation_. The question is: if we regress $q_i$ on $p_i$ , what happens? Get _demand function_ or _supply function_?

* It is helpful to make simplified assumption to facilitate discussion: $\boldsymbol{e}_{i}=\left(\begin{array}{c}
e_{1 i} \\
e_{2 i}
\end{array}\right)$ is **i.i.d.**. $\mathbb{E}[\boldsymbol{e}_{i}] = \mathbf{0}$ and $\mathbb{E}[\boldsymbol{e}_{i} \boldsymbol{e}_{i}^\prime] = \boldsymbol{I}_2$ (the latter for simplicity).


* It is also helful to solve for $(q_i, p_i)$ in terms of errors an check: $$\begin{aligned}
\left[\begin{array}{cc}
1 & \beta_{1} \\
1 & -\beta_{2}
\end{array}\right]\left(\begin{array}{c}
q_{i} \\
p_{i}
\end{array}\right)=\left(\begin{array}{c}
e_{1 i} \\
e_{2 i}
\end{array}\right)
\end{aligned}$$ so 

$$
\begin{aligned}
\left(\begin{array}{c}
q_{i} \\
p_{i}
\end{array}\right) &=\left[\begin{array}{cc}
1 & \beta_{1} \\
1 & -\beta_{2}
\end{array}\right]^{-1}\left(\begin{array}{c}
e_{1 i} \\
e_{2 i}
\end{array}\right) \\
&=\left[\begin{array}{cc}
\beta_{2} & \beta_{1} \\
1 & -1
\end{array}\right]\left(\begin{array}{c}
e_{1 i} \\
e_{2 i}
\end{array}\right)\left(\frac{1}{\beta_{1}+\beta_{2}}\right) \\
&=\left(\begin{array}{c}
\left(\beta_{2} e_{1 i}+\beta_{1} e_{2 i}\right) /\left(\beta_{1}+\beta_{2}\right) \\
\left(e_{1 i}-e_{2 i}\right) /\left(\beta_{1}+\beta_{2}\right)
\end{array}\right)
\end{aligned}$$

* For comparison, we just construct the **lienar projection model**:$$\begin{aligned}
q_{i} &=\beta^{*} p_{i}+e_{i}^{*} \\
\mathbb{E}\left[p_{i} e_{i}^{*}\right] &=0
\end{aligned}$$ where $$\beta^{*}=\frac{\mathbb{E}\left[p_{i} q_{i}\right]}{\mathbb{E}\left[p_{i}^{2}\right]}=\frac{\beta_{2}-\beta_{1}}{2}$$

* Thus the projection coefficient $\beta^*$ equals **neither the demand slope $\beta_1$ nor the supply slope $\beta_2$**, but equals an average of the two. (The fact that it is a simple average is an artifact of the simple covariance structure.)

* Hence the OLS estimate satisfies $\widehat{\beta} \longrightarrow \beta^{*}$, and the limit does not equal either $\beta_1$ or $\beta_2$.

* The fact that the limit is neither the supply nor demand slope is called **simultaneous equations bias**. This occurs generally when $y_i$ and $x_i$ are __jointly determined__, as in amarket equilibrium.


* In practice, if we have concerns that _both the dependent variable and a regressor are simultaneously determined_ or _they can theoretically affect each other in different scenarios_, then the variables should be treated as **endogenous**.
  * Another example is the simple _consumption function_: $$c_i = \beta_0 + \beta_1 \text{income}_i + e_i$$
  * income is the available disposable income determines the consumption level
  * However, consumption can determines the saving and investment, thus affecting income.

### Example: Choice Variables as Regressors

* Take the classic wage equation: $$\log (\text {wage})= \beta \text{education }+e$$ with $\beta$ the **average causal effect** of education on wages (so brave assumption!)

* If wages are affected by _**unobserved** ability_, and individuals with high ability self-select into higher education, then e contains unobserved ability, so education and $e$ will be _positively correlated_.

* Hence education is **endogenous**.

* Wait! It is the special case of **omitted variable bias**, and we can slightly change the symbol in a familiar way:
  * Suppose the ideal **linear CEF model** is $$\begin{aligned}
y &=\boldsymbol{x}^{\prime} \boldsymbol{\beta}+\boldsymbol{e} \\
&=\boldsymbol{x}_{1}^{\prime} \boldsymbol{\beta}_{1}+\boldsymbol{x}_{2}^{\prime} \boldsymbol{\beta}_{2}+e \\
\mathbb{E}[ \boldsymbol{e} \mid \boldsymbol{x}] &=\mathbf{0}
\end{aligned}$$
  * Now the researcher ONLY collect the data of $y$ and $\boldsymbol{x}_1$, but miss $\boldsymbol{x}_2$, maybe because careless ignorance of the related literature or data availability. Therefore, he can just construct the **linear projection model** like this: $$\begin{aligned}
y &=\boldsymbol{x}_{1}^{\prime} \boldsymbol{\gamma}_{1}+u \\
\mathbb{E}\left[\boldsymbol{x}_{1} u\right] &=\mathbf{0}
\end{aligned}$$
  * With tedious derivation we can see: $$\begin{aligned}
\boldsymbol{\gamma}_{1} &=\left(\mathbb{E}\left[\boldsymbol{x}_{1} \boldsymbol{x}_{1}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{1} y\right] \\
&=\left(\mathbb{E}\left[\boldsymbol{x}_{1} \boldsymbol{x}_{1}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{1}\left(\boldsymbol{x}_{1}^{\prime} \boldsymbol{\beta}_{1}+\boldsymbol{x}_{2}^{\prime} \boldsymbol{\beta}_{2}+e\right)\right] \\
&=\boldsymbol{\beta}_{1}+\left(\mathbb{E}\left[\boldsymbol{x}_{1} \boldsymbol{x}_{1}^{\prime}\right]\right)^{-1} \mathbb{E}\left[\boldsymbol{x}_{1} \boldsymbol{x}_{2}^{\prime}\right] \boldsymbol{\beta}_{2} \\
&=\boldsymbol{\beta}_{1}+\boldsymbol{\Gamma}_{12} \boldsymbol{\beta}_{2}
\end{aligned}$$ where $$\boldsymbol{\Gamma}_{12} = \left(\mathbb{E}\left[\boldsymbol{x}_1 \boldsymbol{x}_1^{\prime}\right]\right)^{-1} \mathbb{E}[\boldsymbol{x}_1 \boldsymbol{x}_2]$$ is the coefficient matrix from a projection of $\boldsymbol{x}_2$ on $\boldsymbol{x}_1$, denoting the correlation between $\boldsymbol{x}_2$ on $\boldsymbol{x}_1$

* For the case of _wage_ ahnd _education_ we just **omit** _ability_ which is _positively correlated_ to _education_. Therefore, **linear projection coefficient** $\beta^*$ will be **upward biased** relative to the **structural coefficient** $\beta$.

* From previous discussion, if we can add the **omitted variables** as control, then it is safer to avoid **omitted variable bias**. However, in this example the **omitted variable** is _ability_, which is _unobservale_ and _difficult to measure_.

* This type of **endogeneity** occurs generally when $y$ and $x$ are both choices made by an economic agent, even if they are made at different points in time.

* Generally, when both the dependent variable and a regressor are choice variablesmade by the same agent, the variables should be treated as **endogenous**.

## Instrumental Variables (IVs) and Identification

* **Linear CEF** is the ideal way to examine the **causal effect**.

* We also know that adding _confounding variables_ as control can help alleviate **omitted variables bias**.

* However, **statistical control** cannot guarantee **linear CEF** (only the necessary condition), and we can find the **usual** reasons from the above examples to have $\boldsymbol{\beta}$ of interest (at least closely related to **causal effect**) embedded in the **structual equation** instead.

* Please note the previous of concept of **structual equation**: $$y_{i}=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+e_{i}$$
$$\mathbb{E}\left[\boldsymbol{x}_{i} e_{i}\right] \neq \mathbf{0}$$ which is neither **linear CEF** nor **linear projection model**.

* The first problem facing to us is how to **identify** $\boldsymbol{\beta}$ in the **structual equation**.

* **Identification** of _unknown_ parameters in the model means that we can get the **unique** solution from the conditions or model setups theoretically.

* Just recall our learning path for **linear projection model**, I always enphasize $\boldsymbol{\beta}$ is not arbitrary and we have requirement on them (what requirement?). In order to **identify** or to have **unique** solution for the _unknown_ parameters in **linear projection model**, the model need to satisfy

  1. $\mathbb{E}\left[y^{2}\right]<\infty$
  2. $\mathbb{E}\left[\|\boldsymbol{x}\|^{2}\right]<\infty$
  3. $\boldsymbol{Q}_{x x}=\mathbb{E}\left[\boldsymbol{x} \boldsymbol{x}^{\prime}\right]$
  
  Given these conditions, the optimization problem can have **unique** solution: $$\boldsymbol{\beta}=\left(\mathbb{E}\left[\boldsymbol{x} \boldsymbol{x}^{\prime}\right]\right)^{-1} \mathbb{E}[\boldsymbol{x} y]$$
  
* **Identification** should be discussed and related conditions should be satisfied before the estimation algorithm on the _unknown but fixed_ parameters.

### Instrumental Variables (IVs)

* In order to **identify** the _unknown_ $\boldsymbol{\beta}$ in **structual model**, we need the help of **IVs** to converse the **structual model** into two **linear projection models**.

* It is also helpful to divide the **regressors** in the **structual model** into two groups: **exogenous variables**, which are determined outside the system and _uncorrelated_ to the **structual errors**, and **endogenous variables**, which are _correlated_ with the **structual errors**: $$\boldsymbol{x}_{i}=\left(\begin{array}{c}
\boldsymbol{x}_{1 i} \\
\boldsymbol{x}_{2 i}
\end{array}\right) \begin{array}{l}
k_{1} \\
k_{2}
\end{array}$$ and similarly $$\boldsymbol{\beta}=\left(\begin{array}{c}
\boldsymbol{\beta}_{1} \\
\boldsymbol{\beta}_{2}
\end{array}\right) \begin{array}{l}
k_{1} \\
k_{2}
\end{array}$$ so that the **structural equation** is $$\begin{aligned}
y_{i} &=\boldsymbol{x}_{i}^{\prime} \boldsymbol{\beta}+e_{i} \\
&=\boldsymbol{x}_{1 i}^{\prime} \boldsymbol{\beta}_{1}+\boldsymbol{x}_{2 i}^{\prime} \boldsymbol{\beta}_{2}+e_{i}
\end{aligned}$$ The regressors are assumed to satisfy $$\begin{array}{l}
\mathbb{E}\left[\boldsymbol{x}_{1 i} e_{i}\right]=\boldsymbol{0} \\
\mathbb{E}\left[\boldsymbol{x}_{2 i} e_{i}\right] \neq \mathbf{0}
\end{array}$$

* We call $\boldsymbol{x}_{1 i}$ **exogenous** and $\boldsymbol{x}_{2 i}$ **endogenous** for the **structural parameter** $\boldsymbol{\beta}$. As the dependent variable $y_i$ is also **endogenous**, we sometimes differentiate $\boldsymbol{x}_{2 i}$ by calling $\boldsymbol{x}_{2 i}$ the **endogenous right-hand-side variables**.

* In most applications we only treat a small subset of the regressors (usually the focal $x$ and some other) as **endogenous**; _most_ of the regressors will be treated as **exogenous**.

* The $l \times 1$ random vector $\boldsymbol{z}_i$ is a set of **instrumental variables** for the above **structural equation** if
$$  
\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{e}_{i}\right] =\mathbf{0} \label{eq3} \tag{3}
$$
$$
\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}\right] \text{is positive definite} \label{eq4} \tag{4}
$$
$$
\operatorname{rank}\left(\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right) =k \label{eq5} \tag{5}
$$

* Let's check each of the conditions:
  * $\eqref{eq3}$ illustrates that the **instruments** are uncorrelated with the **structural error**, often indicated as that they are **exogenous** in the sense that they are determined outside the model for $y_i$. 
    * Notice that the regressors $\boldsymbol{x}_{1i}$ satisfy condition $\eqref{eq3}$ and thus **should** be included as **instrumental variables**. It is thus a subset of the variables $\boldsymbol{z}_{i}$ . Notationally we make the partition: $$\boldsymbol{z}_{i}=\left(\begin{array}{c}
\boldsymbol{z}_{1 i} \\
\boldsymbol{z}_{2 i}
\end{array}\right)=\left(\begin{array}{c}
\boldsymbol{x}_{1 i} \\
\boldsymbol{z}_{2 i}
\end{array}\right) \begin{array}{l}
k_{1} \\
\ell_{2}
\end{array}$$
    * Here, $\boldsymbol{x}_{1 i} = \boldsymbol{z}_{1 i}$ are the **included exogenous variables**, and $\boldsymbol{z}_{2 i}$ are the **excluded exogenous variables**. 
    * Many authors simply label $\boldsymbol{x}_{1 i}$ as the "exogenous variables", $\boldsymbol{x}_{2 i}$ as the "endogenous variables", and $\boldsymbol{z}_{2 i}$ as the "instrumental variables".
    * We say that the model is **just-identified** if $l = k$ ($l_2 = k_2$)and **over-identified** if $l > k$ ($l_2 > k_2$)
  * $\eqref{eq4}$ is a normalization which excludes linearly redundant instruments
  * $\eqref{eq5}$ is often called the **relevance condition** and is essential for the **identification** of the model, as we discuss later. A necessary condition for $\eqref{eq5}$ is that $l \ge k$. This condition illustrates that every **endogenous** variable in $\boldsymbol{x}_{2 i}$ should be corelated to at least one of **instrumental variables** in $\boldsymbol{z}_{2 i}$.

* Difference between _finding **control variables**_ and _finding **instruments**_?
  * In order to alleviate **omitted variables bias**, we usually think about _finding **control variables**_, which may directly correlated to focal $x$ and directly influence $y$.
  * In order to handle **endogeneity**, we usually think about _finding **instruments**_, which are determined outside the system for $(y_i ,\boldsymbol{x}_{2i} )$, causally determine $\boldsymbol{x}_{2i}$, but do not causally determine $y_i$ except through $\boldsymbol{x}_{2i}$.

* Examples for **IVs**
  * Measurement error in the regressor: When $\boldsymbol{x}_{i}$ is a mis-measured version of $\boldsymbol{z}_{i}$ , a common choice for an **instrument** $\boldsymbol{x}_{2i}$ is an alternative measurement of $\boldsymbol{z}_{2i}$ . For this $\boldsymbol{z}_{2i}$ to satisfy the property of an instrumental variable the measurement error in $\boldsymbol{z}_{2i}$ must be independent of that in $\boldsymbol{x}_{i}$.
  * Supply and Demand: An appropriate instrument for price $p_i$ in a _demand equation_ is a variable $\boldsymbol{z}_{2i}$ which influences _supply_ but not _demand_.
  * Choice Variable as Regressor. An ideal instrument affects the choice of the regressor (education) but does not directly influence the dependent variable (wages) except through the indirect effect on the regressor. _College Proximity_ is a potentail choice:
    * If a potential student lives close to a college, this reduces the cost of attendence and thereby raises the likelihood that the student will attend college.
    * However, college proximity does not directly affect a student’s skills or abilities, so should not have a direct effect on his or her market wage.

### Reduced Form

* In this subsection, we will discuss how the **IVs** satisfying the above conditions can help **identify** $\boldsymbol{\beta}$ of our interest in **structural equation**.
 
* We can construct _two _**linear prediction models** by expressing the **endogenous variables** with respect to **IVs** (and **exogenous variables**).
 
* A linear reduced formmodel for $\boldsymbol{x}_{i}$ is $$\boldsymbol{x}_{i}=\boldsymbol{\Gamma}^{\prime} \boldsymbol{z}_{i}+\boldsymbol{u}_{i}$$ which is **multivariate regression**.
  * The $l \times k$ coefficient matrix $\boldsymbol{\Gamma}$ can be defined by **linear projection**. Thus $$\boldsymbol{\Gamma}=\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}\right]^{-1} \mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{x}_{i}^{\prime}\right]$$ so that $$\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{u}_{i}^{\prime}\right]=\mathbf{0}$$

* We can also construct a reduced formequation for $y_i$  with subsitution $\boldsymbol{x}_{i}$ for $\boldsymbol{\Gamma}^{\prime} \boldsymbol{z}_{i}+\boldsymbol{u}_{i}$: $$\begin{aligned}
y_{i} &=\left(\boldsymbol{\Gamma}^{\prime} \boldsymbol{z}_{i}+\boldsymbol{u}_{i}\right)^{\prime} \boldsymbol{\beta}+e_{i} \\
&=\boldsymbol{z}_{i}^{\prime} \boldsymbol{\lambda}+v_{i}
\end{aligned}$$ where $$\lambda=\Gamma \boldsymbol{\beta}$$ and $$v_{i}=\boldsymbol{u}_{i}^{\prime} \boldsymbol{\beta}+e_{i}$$
  * Observe that $$\mathbb{E}\left[\boldsymbol{z}_{i} v_{i}\right]=\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{u}_{i}^{\prime}\right] \boldsymbol{\beta}+\mathbb{E}\left[\boldsymbol{z}_{i} e_{i}\right]=\mathbf{0}$$
  * Since it is a projection equation we can write the reduced form coefficient as $$\boldsymbol{\lambda}=\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}\right]^{-1} \mathbb{E}\left[\boldsymbol{z}_{i} y_{i}\right]$$

### Identification

* Please recall that A parameter is **identified** if it is a **unique** function of the probability distribution of the observables.

* For example, the reduced form coefficient matrices $\boldsymbol{\Gamma}$ and $\boldsymbol{\lambda}$ are **identified** since they can be written as explicit functions of the moments of the observables $(y_i ,\boldsymbol{x}_i , \boldsymbol{z}_i )$. That is, $$\boldsymbol{\Gamma}=\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}\right]^{-1} \mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{x}_{i}^{\prime}\right]$$ and $$\boldsymbol{\lambda}=\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}\right]^{-1} \mathbb{E}\left[\boldsymbol{z}_{i} y_{i}\right]$$ if (4) can hold.

* We are interested in the **structural parameter** $\boldsymbol{\beta}$ which relates to $(\boldsymbol{\Gamma}, \boldsymbol{\lambda})$: $$\lambda=\Gamma \boldsymbol{\beta}$$ the same as $$\mathbb{E}\left[\boldsymbol{z}_{i} y_{i}\right]=\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{x}_{i}^{\prime}\right] \boldsymbol{\beta}$$

* From standard linear algebra we know that there is a **unique** solution if and only if $\Gamma$ has full rank $k$: $$\text{rank}(\Gamma) = \text{rank}(\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{x}_{i}^{\prime}\right])=k$$ which is the (5) **relevance condition**.

* To translate such **relevance condition** into human language:
  1) The number of **IVs** (excluding **exogenous variables**) should be at least equal to the number of **endogenous variables**.
  2) **IVs** (excluding **exogenous variables**) should be correlated to **endogenous variables**.

* It is useful to have explicit expressions for the solution $\boldsymbol{\beta}$
  * When $l = k$ (**just-identified**), then $\Gamma$ is square matrix and **relevance condition** implies that it is invertible, so the **structural parameter** $$\boldsymbol{\beta} = \Gamma^{-1} \lambda$$
  * When $l>k$ (**over-identified**), please note that $\Gamma$ is NOT square matrix and it can NOT be directly inverted. However, with **relevance condition** we can **uniquely** solve for $\boldsymbol{\beta}$ by applying least-squares to the system of equations $$\lambda=\Gamma \boldsymbol{\beta}$$m and the solution is $$\boldsymbol{\beta}=\left(\boldsymbol{\Gamma}^{\prime} \boldsymbol{\Gamma}\right)^{-1} \boldsymbol{\Gamma}^{\prime} \boldsymbol{\lambda}$$
  * One point needed to be noticed that the expression of $$\boldsymbol{\beta}=\left(\boldsymbol{\Gamma}^{\prime} \boldsymbol{\Gamma}\right)^{-1} \boldsymbol{\Gamma}^{\prime} \boldsymbol{\lambda}$$ can be reduced to $\boldsymbol{\beta} = \Gamma^{-1} \lambda$ for the case of $l = k$ (**just-identified**). Therefore this expression is the general expression for $\boldsymbol{\beta}$.

## Two-Stage Least Squares

* Up until now we have discussed the reasons
  * why we finally have parameters $\boldsymbol{\beta}$ embedded in **structual equation** even if we try controlling for the **omitted variables bias**.
  * why **linear projection model** fails for **structual equation**, and correspondingly **OLS estimators** are biased and inconsistent for  $\boldsymbol{\beta}$ embedded in **structual equation**
  * why **IVs** satisfying conditions (3) (4) (5) can help **identify** $\boldsymbol{\beta}$ embedded in **structual equation**.

* From the hints of discussion of reduced form, we can have two-step **linear projection models** to **identify** $\boldsymbol{\beta}$. Similarly we can just follow **Two-Stage Least Squares** to get the **consistent estimator** for $\boldsymbol{\beta}$.

* In order to obtain the expression of the **2sls estimator** (now we allow the general case of $l \ge k$), we can just begin with the second stage **linear projection model**: $$\begin{aligned}
y_{i} &=\boldsymbol{z}_{i}^{\prime} \boldsymbol{\Gamma} \boldsymbol{\beta}+v_{i} \\
\mathbb{E}\left[\boldsymbol{z}_{i} v_{i}\right] &=\mathbf{0}
\end{aligned}$$ Defining $\boldsymbol{w}_i = \Gamma^{\prime} z_i$ we can write this as  $$\begin{aligned}
y_{i} &=\boldsymbol{w}_{i}^{\prime} \boldsymbol{\beta}+v_{i} \\
\mathbb{E}\left[\boldsymbol{w}_{i} v_{i}\right] &=\mathbf{0}
\end{aligned}$$ Therefore, the **OLS estimator** is $$\begin{aligned}
\widehat{\boldsymbol{\beta}} &=\left(\boldsymbol{W}^{\prime} \boldsymbol{W}\right)^{-1}\left(\boldsymbol{W}^{\prime} \boldsymbol{y}\right) \\
&=\left(\boldsymbol{\Gamma}^{\prime} \boldsymbol{Z}^{\prime} \boldsymbol{Z} \boldsymbol{\Gamma}\right)^{-1}\left(\boldsymbol{\Gamma}^{\prime} \boldsymbol{Z}^{\prime} \boldsymbol{y}\right)
\end{aligned}$$

* We can estimate $\boldsymbol{\Gamma}$ from the first-stage reduced form regression, which is $\widehat{\Gamma}=\left(Z^{\prime} Z\right)^{-1}\left(Z^{\prime} X\right)$. Finally we obtain $$\begin{aligned}
\widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}} &=\left(\widehat{\mathbf{\Gamma}}^{\prime} \boldsymbol{Z}^{\prime} \boldsymbol{Z} \widehat{\mathbf{\Gamma}}\right)^{-1}\left(\widehat{\mathbf{\Gamma}}^{\prime} \boldsymbol{Z}^{\prime} \boldsymbol{y}\right) \\
&=\left(\boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-\mathbf{1}} \boldsymbol{Z}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{y} \\
&=\left(\boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{y}
\end{aligned}$$

* Once again, when $k = l$ then $$\begin{aligned}
\widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}} &=\left(\boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{y} \\
&=\left(\boldsymbol{Z}^{\prime} \boldsymbol{X}\right)^{-1}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)\left(\boldsymbol{X}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{X}^{\prime} \boldsymbol{Z}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{y} \\
&=\left(\boldsymbol{Z}^{\prime} \boldsymbol{X}\right)^{-1}\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)\left(\boldsymbol{Z}^{\prime} \boldsymbol{Z}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{y} \\
&=\left(\boldsymbol{Z}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{Z}^{\prime} \boldsymbol{y} \\
&=\widehat{\boldsymbol{\beta}}_{\mathrm{iv}}
\end{aligned}$$

## Consistency and Asymptotic Distribution of 2SLS

* Similarly, we care the quality of **2sls estimator** and it can be proved that **IVs** satisfying thye (3) (4) (5) can support the **consistency** of $\widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}}$ That is, $$\widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}} \rightarrow \boldsymbol{\beta} \text { as } n \rightarrow \infty$$

* For the sake of **statistical inference** we also need to know the distribution of $\widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}}$

* Skipping the standard proofs here, $\widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}}$ is asymptotically normal and the variance is $\begin{equation}
n^{-1} V_{\boldsymbol{\beta}}
\end{equation}$ where $$\begin{equation}
\boldsymbol{V}_{\boldsymbol{\beta}}=\left(\boldsymbol{Q}_{x z} \boldsymbol{Q}_{z z}^{-1} \boldsymbol{Q}_{z x}\right)^{-1}\left(\boldsymbol{Q}_{x z} \boldsymbol{Q}_{z z}^{-1} \boldsymbol{\Omega} \boldsymbol{Q}_{z z}^{-1} \boldsymbol{Q}_{z x}\right)\left(\boldsymbol{Q}_{x z} \boldsymbol{Q}_{z z}^{-1} \boldsymbol{Q}_{z x}\right)^{-1}
\end{equation}$$

* In practice we need the **sample estimate** for $V_{\boldsymbol{\beta}}$: $$\begin{equation}
\widehat{V}_{\beta}=\left(\widehat{Q}_{x z} \widehat{Q}_{z z}^{-1} \widehat{Q}_{z x}\right)^{-1}\left(\widehat{Q}_{x z} \widehat{Q}_{z z}^{-1} \widehat{\Omega} \hat{Q}_{z z}^{-1} \widehat{Q}_{z x}\right)\left(\widehat{Q}_{x z} \widehat{Q}_{z z}^{-1} \widehat{Q}_{z x}\right)^{-1}
\end{equation}$$ where $$\begin{equation}
\begin{aligned}
\widehat{\boldsymbol{Q}}_{z z} &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}=\frac{1}{n} \boldsymbol{Z}^{\prime} \boldsymbol{Z} \\
\widehat{\boldsymbol{Q}}_{\boldsymbol{x} \boldsymbol{z}} &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{x}_{i} \boldsymbol{z}_{i}^{\prime}=\frac{1}{n} \boldsymbol{X}^{\prime} \boldsymbol{Z} \\
\widehat{\boldsymbol{\Omega}} &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime} \hat{e}_{i}^{2} \\
\widehat{e}_{i} &=y_{i}-\boldsymbol{x}_{i}^{\prime} \widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}}
\end{aligned}
\end{equation}$$

* Therefore, the estimated **standard errors** for **statistical inference** can be obtained from the square roots of the diagonal elements of $\begin{equation}
n^{-1} \widehat{\boldsymbol{V}}_{\boldsymbol{\beta}}
\end{equation}$

## Examples uinsg R

In [1]:
# load packages

# for data clearing and pipelines
library(tidyverse)

# for data sets from wooldridge textbook
library(wooldridge)

# for 2sls estimation
library(AER)

── [1mAttaching packages[22m ───────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ──────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: car

Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some


Loading required package: lmtest

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘pa

In [2]:
# Example 15.4 in [WO]

data(card)
?card

0,1
card {wooldridge},R Documentation


* Card (1995) used _wage_ and _education_ data for a sample of men in 1976 to estimate the return to education.

* The above discussion illustrates this as typical case of _Choice Variables as Regressors_, and we have the _unobserved ability_ missing in the error term. Therefore, finally we will have only the **structural equation** with _educ_ is **endogenous variable** (why?)

* He used a dummy variable for whether someone grew up near a four-year college (_nearc4_) as an **instrumental variable** for education (why it is appropriate?)

* We can get the **2sls estimate** using such sample.

In [3]:
# 1st stage of endogenous (educ) on IV and all exogenous variables

lm_educ_1 <- lm(educ ~ nearc4 + exper + expersq + black + smsa + south +
  smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 +
  reg667 + reg668 + reg669,
data = card
)

summary(lm_educ_1)


Call:
lm(formula = educ ~ nearc4 + exper + expersq + black + smsa + 
    south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + 
    reg667 + reg668 + reg669, data = card)

Residuals:
   Min     1Q Median     3Q    Max 
-7.545 -1.370 -0.091  1.278  6.239 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.6382529  0.2406297  69.145  < 2e-16 ***
nearc4       0.3198989  0.0878638   3.641 0.000276 ***
exper       -0.4125334  0.0336996 -12.241  < 2e-16 ***
expersq      0.0008686  0.0016504   0.526 0.598728    
black       -0.9355287  0.0937348  -9.981  < 2e-16 ***
smsa         0.4021825  0.1048112   3.837 0.000127 ***
south       -0.0516126  0.1354284  -0.381 0.703152    
smsa66       0.0254805  0.1057692   0.241 0.809644    
reg662      -0.0786363  0.1871154  -0.420 0.674329    
reg663      -0.0279390  0.1833745  -0.152 0.878913    
reg664       0.1171820  0.2172531   0.539 0.589665    
reg665      -0.2726165  0.2184204  -1.248 0.212082    
reg666

In [4]:
# 2nd stage

# save the predicted value from 1st stage regression

educ_pred <- predict(lm_educ_1)

educ_pred[1:5]

In [5]:
lm_educ_2 <- lm(log(wage) ~ educ_pred + exper + expersq + black + smsa +
  south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 +
  reg667 + reg668 + reg669,
data = card
)

summary(lm_educ_2)


Call:
lm(formula = log(wage) ~ educ_pred + exper + expersq + black + 
    smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + 
    reg666 + reg667 + reg668 + reg669, data = card)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.57387 -0.25161  0.01483  0.27229  1.38522 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.6661511  0.9508542   3.856 0.000118 ***
educ_pred    0.1315038  0.0565104   2.327 0.020028 *  
exper        0.1082711  0.0243243   4.451 8.85e-06 ***
expersq     -0.0023349  0.0003429  -6.810 1.18e-11 ***
black       -0.1467758  0.0554166  -2.649 0.008125 ** 
smsa         0.1118083  0.0325530   3.435 0.000601 ***
south       -0.1446715  0.0280524  -5.157 2.67e-07 ***
smsa66       0.0185311  0.0222167   0.834 0.404286    
reg662       0.1007678  0.0387462   2.601 0.009349 ** 
reg663       0.1482588  0.0378501   3.917 9.17e-05 ***
reg664       0.0498971  0.0449707   1.110 0.267283    
reg665       0.1462719  0.0483883

* We have get the **2sls estimate** following previous procedure
* However, **standard errors** in 2nd stage regression is **misleading**!
* The basic reason is that `lm()` in R will only follow $s^{2}\left[\left(\boldsymbol{X}^{\prime} \boldsymbol{X}\right)^{-1}\right]$ to compute the estimated variance and covariance matrix for OLS regression.

* However, previous discussion indicates **sample estimate** for $n^{-1}V_{\boldsymbol{\beta}}$: $$\begin{equation}
\widehat{V}_{\beta}=\left(\widehat{Q}_{x z} \widehat{Q}_{z z}^{-1} \widehat{Q}_{z x}\right)^{-1}\left(\widehat{Q}_{x z} \widehat{Q}_{z z}^{-1} \widehat{\Omega} \hat{Q}_{z z}^{-1} \widehat{Q}_{z x}\right)\left(\widehat{Q}_{x z} \widehat{Q}_{z z}^{-1} \widehat{Q}_{z x}\right)^{-1}
\end{equation}$$ where $$\begin{equation}
\begin{aligned}
\widehat{\boldsymbol{Q}}_{z z} &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}=\frac{1}{n} \boldsymbol{Z}^{\prime} \boldsymbol{Z} \\
\widehat{\boldsymbol{Q}}_{\boldsymbol{x} \boldsymbol{z}} &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{x}_{i} \boldsymbol{z}_{i}^{\prime}=\frac{1}{n} \boldsymbol{X}^{\prime} \boldsymbol{Z} \\
\widehat{\boldsymbol{\Omega}} &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime} \hat{e}_{i}^{2} \\
\widehat{e}_{i} &=y_{i}-\boldsymbol{x}_{i}^{\prime} \widehat{\boldsymbol{\beta}}_{2 \mathrm{sls}}
\end{aligned}
\end{equation}$$ 

* We can use `R` to compute each term, and luckily useful function in the package `AER` help to conduct **2sls estimate**

In [6]:
# 2sls using ivreg() in package AER

lm_educ_2sls <- ivreg(lwage ~ educ + exper + expersq + black + smsa + south +
  smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 +
  reg667 + reg668 + reg669 | nearc4 + exper + expersq +
  black + smsa + south + smsa66 + reg662 + reg663 +
  reg664 + reg665 + reg666 + reg667 +
  reg668 + reg669,
data = card
)

summary(lm_educ_2sls)


Call:
ivreg(formula = lwage ~ educ + exper + expersq + black + smsa + 
    south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + 
    reg667 + reg668 + reg669 | nearc4 + exper + expersq + black + 
    smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + 
    reg666 + reg667 + reg668 + reg669, data = card)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.83164 -0.24075  0.02428  0.25208  1.42760 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.6661509  0.9248295   3.964 7.54e-05 ***
educ         0.1315038  0.0549637   2.393  0.01679 *  
exper        0.1082711  0.0236586   4.576 4.92e-06 ***
expersq     -0.0023349  0.0003335  -7.001 3.12e-12 ***
black       -0.1467757  0.0538999  -2.723  0.00650 ** 
smsa         0.1118083  0.0316620   3.531  0.00042 ***
south       -0.1446715  0.0272846  -5.302 1.23e-07 ***
smsa66       0.0185311  0.0216086   0.858  0.39119    
reg662       0.1007678  0.0376857   2.674  0.00754 ** 
reg663  

In [7]:
# compared with the following we can find that they are so similar
# but generally we prefer the above which is accurate
summary(lm_educ_2)


Call:
lm(formula = log(wage) ~ educ_pred + exper + expersq + black + 
    smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + 
    reg666 + reg667 + reg668 + reg669, data = card)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.57387 -0.25161  0.01483  0.27229  1.38522 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.6661511  0.9508542   3.856 0.000118 ***
educ_pred    0.1315038  0.0565104   2.327 0.020028 *  
exper        0.1082711  0.0243243   4.451 8.85e-06 ***
expersq     -0.0023349  0.0003429  -6.810 1.18e-11 ***
black       -0.1467758  0.0554166  -2.649 0.008125 ** 
smsa         0.1118083  0.0325530   3.435 0.000601 ***
south       -0.1446715  0.0280524  -5.157 2.67e-07 ***
smsa66       0.0185311  0.0222167   0.834 0.404286    
reg662       0.1007678  0.0387462   2.601 0.009349 ** 
reg663       0.1482588  0.0378501   3.917 9.17e-05 ***
reg664       0.0498971  0.0449707   1.110 0.267283    
reg665       0.1462719  0.0483883

In [8]:
# Load another data for over identified
data(mroz)
?mroz

0,1
mroz {wooldridge},R Documentation


In [9]:
# model without considering endogeneity of educ

lm_mroz_ols <- lm(lwage ~ educ + exper + expersq, data = mroz)

summary(lm_mroz_ols)


Call:
lm(formula = lwage ~ educ + exper + expersq, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.08404 -0.30627  0.04952  0.37498  2.37115 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.5220406  0.1986321  -2.628  0.00890 ** 
educ         0.1074896  0.0141465   7.598 1.94e-13 ***
exper        0.0415665  0.0131752   3.155  0.00172 ** 
expersq     -0.0008112  0.0003932  -2.063  0.03974 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6664 on 424 degrees of freedom
  (325 observations deleted due to missingness)
Multiple R-squared:  0.1568,	Adjusted R-squared:  0.1509 
F-statistic: 26.29 on 3 and 424 DF,  p-value: 1.302e-15


In [10]:
# model considering endogeneity of educ
# here we use motheduc and fatheduc as IVs
# are they appropriate?

lm_mroz_2sls <- ivreg(lwage ~ educ + exper + expersq | exper +
  expersq + motheduc + fatheduc, data = mroz)

summary(lm_mroz_2sls)


Call:
ivreg(formula = lwage ~ educ + exper + expersq | exper + expersq + 
    motheduc + fatheduc, data = mroz)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0986 -0.3196  0.0551  0.3689  2.3493 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.0481003  0.4003281   0.120  0.90442   
educ         0.0613966  0.0314367   1.953  0.05147 . 
exper        0.0441704  0.0134325   3.288  0.00109 **
expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6747 on 424 degrees of freedom
Multiple R-Squared: 0.1357,	Adjusted R-squared: 0.1296 
Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05 


## IV related Testing procedures

### Test of conditions of IVs

* The $l \times 1$ random vector $\boldsymbol{z}_i$ is a set of **instrumental variables** for the above **structural equation** if
$$  
\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{e}_{i}\right] =\mathbf{0} \label{eq3} \tag{3}
$$
$$
\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{z}_{i}^{\prime}\right] \text{is positive definite} \label{eq4} \tag{4}
$$
$$
\operatorname{rank}\left(\mathbb{E}\left[\boldsymbol{z}_{i} \boldsymbol{x}_{i}^{\prime}\right]\right) =k \label{eq5} \tag{5}
$$

* Statistical test of $\eqref{eq3}$ is **feasible** only for **over-identified** case, and we discuss it later.

* Statistical test of $\eqref{eq4}$ is **feasible**, so how?

* Statistical test of $\eqref{eq5}$ (**relevance condition**) is **feasible** and important. Just conduct _t-test_ or _F-test_ for the **IVs** (excluding **exogenous variables**) for the first stage regression.
  * Please keep in mind that we at least need the number of **IVs** (excluding **exogenous variables**) is at least the number of **endogenous variables**
  * After controlling for the **exogenous variables**, we require that at least one parameters before the **IVs** in the first stage regression differ from zero.

In [11]:
# t test for just-identified

summary(lm_educ_1)


Call:
lm(formula = educ ~ nearc4 + exper + expersq + black + smsa + 
    south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + 
    reg667 + reg668 + reg669, data = card)

Residuals:
   Min     1Q Median     3Q    Max 
-7.545 -1.370 -0.091  1.278  6.239 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.6382529  0.2406297  69.145  < 2e-16 ***
nearc4       0.3198989  0.0878638   3.641 0.000276 ***
exper       -0.4125334  0.0336996 -12.241  < 2e-16 ***
expersq      0.0008686  0.0016504   0.526 0.598728    
black       -0.9355287  0.0937348  -9.981  < 2e-16 ***
smsa         0.4021825  0.1048112   3.837 0.000127 ***
south       -0.0516126  0.1354284  -0.381 0.703152    
smsa66       0.0254805  0.1057692   0.241 0.809644    
reg662      -0.0786363  0.1871154  -0.420 0.674329    
reg663      -0.0279390  0.1833745  -0.152 0.878913    
reg664       0.1171820  0.2172531   0.539 0.589665    
reg665      -0.2726165  0.2184204  -1.248 0.212082    
reg666

In [12]:
# F test for over-identified

lm_mroz_1 <- mroz %>%
  filter(!is.na(wage)) %>%
  lm(educ ~ exper + expersq + fatheduc + motheduc,
    data = .
  )

summary(lm_mroz_1)


Call:
lm(formula = educ ~ exper + expersq + fatheduc + motheduc, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8057 -1.0520 -0.0371  1.0258  6.3787 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.102640   0.426561  21.340  < 2e-16 ***
exper        0.045225   0.040251   1.124    0.262    
expersq     -0.001009   0.001203  -0.839    0.402    
fatheduc     0.189548   0.033756   5.615 3.56e-08 ***
motheduc     0.157597   0.035894   4.391 1.43e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.039 on 423 degrees of freedom
Multiple R-squared:  0.2115,	Adjusted R-squared:  0.204 
F-statistic: 28.36 on 4 and 423 DF,  p-value: < 2.2e-16


In [13]:
lm_mroz_1_res <- mroz %>%
  filter(!is.na(wage)) %>%
  lm(educ ~ exper + expersq,
    data = .
  )

summary(lm_mroz_1_res)


Call:
lm(formula = educ ~ exper + expersq, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7235 -0.7829 -0.6398  1.2330  4.8402 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.369356   0.322313  38.377   <2e-16 ***
exper        0.056492   0.045093   1.253    0.211    
expersq     -0.001904   0.001345  -1.416    0.158    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.285 on 425 degrees of freedom
Multiple R-squared:  0.004923,	Adjusted R-squared:  0.0002406 
F-statistic: 1.051 on 2 and 425 DF,  p-value: 0.3504


In [14]:
anova(lm_mroz_1_res, lm_mroz_1)

Unnamed: 0_level_0,Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,425,2219.216,,,,
2,423,1758.575,2.0,460.6411,55.4003,4.268909e-22


### Testing for Endogeneity

* The **2SLS estimator** is _less efficient_ than **OLS estimator** when the explanatory variables are **exogenous**

* Therefore, if **no endogeneity** problem occurs, then we prefer **OLS estimator**.

* We can use _Hausman-Wu test_ to do so. Suppose the **structural model**: $$\begin{equation}
y_{1}=\beta_{0}+\beta_{1} y_{2}+\beta_{2} z_{1}+\beta_{3} z_{2}+u_{1}
\end{equation}$$ where $y_2$ is suspected **endogenous**

* We also have available **IVs** $z_3$ and $z_4$ excluded from the above model. In terms of the first stage **linear prediction model** of $$\begin{equation}
y_{2}=\pi_{0}+\pi_{1} z_{1}+\pi_{2} z_{2}+\pi_{3} z_{3}+\pi_{4} z_{4}+v_{2}
\end{equation}$$ we know that $y_{2}$ is **not endogenous** if and only if $v_{2}$ is _uncorrelated_ to $u_{1}$ in the **structural model**. Idealy speaking, we can just test the statistical significance of $\delta_{1}$ in the **simple projection model**: $$\begin{equation}
u_{1}=\delta_{1} v_{2}+e_{1}
\end{equation}$$

* In practice, we will collect the first stage **linear prediction model** _residuals_ and conduct the following auxiliary regression: $$\begin{equation}
y_{1}=\beta_{0}+\beta_{1} y_{2}+\beta_{2} z_{1}+\beta_{3} z_{2}+\delta_{1} \hat{v}_{2}+\text { error }
\end{equation}$$

* Failing to reject $\mathbf{H}_0: \delta_1 = 0$ indicates that no obvious evidence for **endogeneity** of $y_2$

In [16]:
# Hausman-Wu test

lm_mroz_aux <- mroz %>%
  filter(!is.na(wage)) %>%
  lm(lwage ~ educ + exper + expersq + lm_mroz_1$residuals,
    data = .
  )

summary(lm_mroz_aux)


Call:
lm(formula = lwage ~ educ + exper + expersq + lm_mroz_1$residuals, 
    data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.03743 -0.30775  0.04191  0.40361  2.33303 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          0.0481003  0.3945753   0.122 0.903033    
educ                 0.0613966  0.0309849   1.981 0.048182 *  
exper                0.0441704  0.0132394   3.336 0.000924 ***
expersq             -0.0008990  0.0003959  -2.271 0.023672 *  
lm_mroz_1$residuals  0.0581666  0.0348073   1.671 0.095441 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.665 on 423 degrees of freedom
Multiple R-squared:  0.1624,	Adjusted R-squared:  0.1544 
F-statistic:  20.5 on 4 and 423 DF,  p-value: 1.888e-15


### Sargan test for instrument validity (only feasible for over-identified case)

* In the context of the simple IV estimator, we noted that the _exogeneity requirement cannot be tested_. 

* However, if we have more instruments than we need, we can effectively test whether some of them are _uncorrelated_ with the **structural error**
  * Estimate the structural equation by 2SLS and obtain the 2SLS residuals, $\hat{u}_1$
  * Regress $\hat{u}_1$ on all **IVs** and **exogenous variables**. Obtain the R-squared, say, $R^2_1$
  * Under the null hypothesis that all **IVs** are _uncorrelated_ with $u_1$, $n R_{1}^{2} \sim {\chi}_{q}^{2}$, where $q$ is the number of **IVs** from _outside_ the model minus the total number of **endogenous explanatory variables**. That's why this test is only feasible for over-identified case/
  * Rejecting null concludes that at least some of the **IVs** are not exogenous.

In [17]:
sargan_reg <- mroz %>%
  filter(!is.na(wage)) %>%
  lm(lm_mroz_2sls$residuals ~ exper + expersq + fatheduc +
    motheduc, data = .)

summary(sargan_reg)

sargan_reg_sm <- summary(sargan_reg)


Call:
lm(formula = lm_mroz_2sls$residuals ~ exper + expersq + fatheduc + 
    motheduc, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1012 -0.3124  0.0478  0.3602  2.3441 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.096e-02  1.413e-01   0.078    0.938
exper       -1.833e-05  1.333e-02  -0.001    0.999
expersq      7.341e-07  3.985e-04   0.002    0.999
fatheduc     5.782e-03  1.118e-02   0.517    0.605
motheduc    -6.607e-03  1.189e-02  -0.556    0.579

Residual standard error: 0.6752 on 423 degrees of freedom
Multiple R-squared:  0.0008833,	Adjusted R-squared:  -0.008565 
F-statistic: 0.0935 on 4 and 423 DF,  p-value: 0.9845


In [18]:
sargan_test <- sargan_reg_sm$r.squared * nrow(subset(
  mroz,
  is.na(wage) == FALSE
))
print(sargan_test)
print(1 - pchisq(sargan_test, 1)) # prints p-value

# hence we do not reject the null hypothesis of instrument validity

[1] 0.3780713
[1] 0.5386372


## Insights for Projects

* In order to examine the **causal effects** of some focal $x$ on $y$, we need to consider:
  * In order to alleviate **omitted variables bias**, we usually think about _finding **control variables**_, which may directly correlated to focal $x$ and directly influence $y$.
  * In order to handle **endogeneity**, we usually think about _finding **instruments**_, which are determined outside the system for $(y_i ,\boldsymbol{x}_{2i} )$, causally determine $\boldsymbol{x}_{2i}$, but do not causally determine $y_i$ except through $\boldsymbol{x}_{2i}$.
  
* A reliable analysis should be a series of tests getting the consistent results when adding different sets of **control variables** and handling **endogeneity** using **instruments**. E.g. one test the significance of _educ_ on _income_ for different models using diffferent sets of **control variables** and implementing **2SLS estimates** using **IVs**.

* Even though the data quality cannot support your analysis, you need to discuss the these two issues in the limitation part of the report.

## References

* Ch12 [HAE]
* Ch15 [WO]