# Strict Exogeneity

This is one of the most important assumptions of OLS which makes the OLS estimate **BLUE**. Strict exogeneity means that each error term is uncorrelated with the entire design matrix **X**.

$$ \mathbb{E}[\epsilon_n | \text{X}] = 0, \quad n \in \{1,\dots,N\} $$

or

$$ \mathbb{E}[\mathbb{\epsilon} | \text{X}] = 0 $$

If $ \mathbb{E}[\mathbb{\epsilon} | \text{X}] = 0 $, then

- $ \mathbb{E}[\mathbb{\epsilon}] = 0 $ : The error has no internal structure
- $ \mathbb{E}[\mathbb{\epsilon}\text{X}] = 0 $ : This rules out the possibility that **X** and error are correlated
- $ \mathbb{E}[\mathbb{\epsilon} | \mathcal{f}(\text{X})] = \mathbb{E}[\mathbb{\epsilon}\mathcal{f}(\text{X})] = 0 $ : For some finite valued function $\mathcal{f}$, $\mathcal{f}(\text{X})$ and error are correlated. This is particularly of interest, since with this assumption you may include $\mathcal{f}(\text{X})$, as a regressor and still have OLS providing you unbiased estimates. i.e OLS estimate will unbiased even for models like $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_2^2 + \beta_5 x_1x_2$
- $ \mathbb{E}[y | \text{X}] = \text{X}\beta $ : $ \mathbb{E}[y | \text{X}] = \mathbb{E}[\text{X}\beta + \epsilon | \text{X}] = \text{X}\beta + \mathbb{E}[\epsilon | \text{X}] = \text{X}\beta + 0 = \text{X}\beta $

Because of these, strict exogeneity implies:

- There is no internal structure to the error.
- There is no external structure to the error tied to the predictors.
- OLS estimate will unbiased even for $\mathcal{f}(\text{X})$.

## Why is this assumption broken?

There are may come situations when the assumption of strict exogeneity is broken. This may happen in cases like

### Omitted variable bias

Suppose the true model is

$$ y = \beta_1 X + \beta_2 W + \epsilon $$

But we omit **w**, and estimate

$$ y = \beta_1 X + u $$

Now the error term, $u=\beta_2 W + \epsilon$ will contain **W** and for most of the practical situations **W** will be some what correlated to **X**. That means

$$ \mathbb{E}[u | \text{X}] \neq 0 $$

### Simultaneity Bias

This is when the both independent variable and target variable are simultaneously dependent on each other. There is no directionality of the dependence but we try to model in one direction.

**Example:**

Let say we are trying to model students score in an exam based on the study hours the student put in the preparation. For this the model will be

$$ \text{score} = \beta_0 + \beta_1 \text{study_hours} + \epsilon $$

We know that there is monotonic relationship between exam score and the study hours. If the student study more hours its more likely to get high score. However, if the student expects the exam to be easy and get high score, the student may study for less hours. Therefore, there is a relationship between exam score and study hours both ways.

$$ \text{score} \leftrightarrow \text{study_hours} $$

Because of the bi-directional dependence between independence, the error becomes correlated to the design matrix **X**.

### Reverse Causality

This is when we create a model in a reverse relationship. i.e we create a model

$$ y = \beta_0 + \beta_1 x + \epsilon $$

but in reality, $ x \sim y $ instead of $ y \sim x $ as modeled. Because of this the error becomes correlated to the design matrix **X**.

## Consequence of Endogeneity

Let us take a look at what happens when the assumption of strict exogeneity is broken

### OLS Estimates Become Biased and Inconsistent

We have seen in previous sections that the $\hat\beta$ can be represented in terms of $\beta$ as

$$ \hat\beta = \beta + (\text{X}^T\text{X})^{-1}\text{X}^T\epsilon $$

Taking expectation of both side conditioned on the design matrix, we get

\begin{align*}
\mathbb{E}[\hat\beta | \text{X}] &= \mathbb{E}[\beta + (\text{X}^T\text{X})^{-1}\text{X}^T\epsilon | \text{X}] \\
                      &= \beta +  \mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon | \text{X}] \\
                      &= \beta +  (\text{X}^T\text{X})^{-1}\text{X}^T\mathbb{E}[\epsilon | \text{X}] \\
\end{align*}

So now if strict exogeneity if broken i.e $\mathbb{E}[\epsilon | \text{X}] \neq 0$ then $\mathbb{E}[\hat\beta | \text{X}] \neq \beta$ and the estimate will be biased.

### Inference Is Invalid

Because the OLS estimator is biased:

- Confidence intervals no longer cover the true $\beta$ with the advertised probability
- p-values are meaningless
- Hypothesis testing becomes unreliable

### Variance Estimates Are Wrong

We have seen in previous section that the variance formula for OLS also assumes exogeneity is give by

$$ \mathbb{V}[\hat\beta] = \sigma(\text{X}^T\text{X})^{-1} $$

If errors correlate with predictors:

- The residuals are “contaminated” with information about **X**
- Standard error formulas become invalid
- This leads to overconfident or misleading conclusions

## How to 