In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import statsmodels.api as smf  # Python library used for regressions! 

**Overview:** 

This notebook provides an example of how we can display omitted variable bias. Omitted variable bias occurs when a statistical model (like a regression) leaves out one or more relevant variables. This results in biased estimates of the parameters in the model, which we can uncover through using auxilliary regressions. 

**Analytical Derivation:**

$$\text{Assume the true causal model is:}$$

$$Y_{i} = \beta_{0} + \beta_{1}X_{i} + \beta_{2}Z_{i} + \epsilon{i}$$ 


$$\text{Where,} $$

$$X_{i} \text{ is a vector of all observable covariates that can be measured.}$$

$$Z_{i} \text{ is an omitted variable in the estimating model, and} Cov(X_{i}, Z_{i}) \text{ is not equal to 0.} $$

$$\text{Also, assume that the coefficient on } Z_{i} \text{ is not 0, meaning that it is a determinant of } Y_{i}.$$

$$\text{Then, we can consider an auxiliary regression between the omitted variable and all other observed regressors:}$$


$$Z_{i} = \pi_{0} + \pi_{1}X_{i} + \nu_{i}$$

$$\text{Substituting this into the true causal model uncovers the bias in estimates that is introduced via omitted variable bias:}$$

$$Y_{i} = \beta_{0} + \beta_{1}X_{i} + \beta_{2}(\pi_{0} + \pi_{1}X_{i} + \nu_{i}) + \epsilon{i}$$

$$ => Y_{i} = (\beta_{0} + \beta_{2}\pi_{0}) + (\beta_{1} + \beta_{2}\pi_{1})X_{i} + (\beta_{2}\nu_{i} + \epsilon{i})$$

$$ => Y_{i} = \beta_{0}^{OVB} + \beta_{1}^{OVB}X_{i} + \eta_{i} $$

$$ \text{As we can see above, by omitting a variable with predictive power from the regression, we obtain biased estimates.} $$

We will demonstrate how OVB can be measured by using data from the 2012 Current Population Survey. We will attempt to measure the effect of immigration and education on wage. For the purposes of this exercise, we will limit the sample to just women.

In [2]:
# read in data and restrict to females
ovb_df = pd.read_csv("ovb.csv") 
ovb_df_female = ovb_df[ovb_df['female'] == 1]
ovb_df_female

Unnamed: 0,state,age,wagesal,imm,hispanic,black,asian,educ,wage,logwage,female,fedwkr,statewkr,localwkr
0,11,44,18000,0,0,0,0,14,9.109312,2.209297,1,1,0,0
3,11,39,8000,0,0,0,0,14,5.128205,1.634756,1,0,0,0
6,11,38,25000,0,0,0,0,16,27.173913,3.302257,1,0,0,0
7,11,39,26000,0,0,0,0,13,16.666667,2.813411,1,0,0,0
9,11,37,4500,0,0,0,0,13,4.000000,1.386294,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21897,95,42,65000,0,0,0,0,18,31.250000,3.442019,1,1,0,0
21900,95,35,30000,0,0,0,0,12,12.396694,2.517430,1,0,0,0
21903,95,38,70000,0,0,0,1,18,26.923077,3.292984,1,0,0,0
21904,95,43,48208,0,0,0,0,14,20.601709,3.025374,1,0,0,0


In order to understand OVB, we can break the regression down into 3 models: long, short, and auxilliary. 

1. In the long model, we observe all relevant covariates: log(wage) = constant + education + immigrant_status.

2. In the auxilliary regression, we observe the effect of education on immigrant_status: immigrant_status = constant + education

3. And in the short model, we observe omit 1 relevant covariate: log(wage) = constant + immigrant_status. 

In [3]:
# estimating model 1 (short model)
# logwage = constant, immigration status
X = ovb_df_female['imm']
X = smf.add_constant(X)
Y = ovb_df_female['logwage']
model1 = smf.OLS(Y, X) 
res1 = model1.fit()
print(res1.summary())

                            OLS Regression Results                            
Dep. Variable:                logwage   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     118.5
Date:                Fri, 22 Oct 2021   Prob (F-statistic):           1.85e-27
Time:                        15:34:42   Log-Likelihood:                -10701.
No. Observations:               10601   AIC:                         2.141e+04
Df Residuals:                   10599   BIC:                         2.142e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.8864      0.007    403.480      0.0

In [6]:
# estimating model 2 (auxilliary regression)
# education = constant, immigration status 

X = smf.add_constant(ovb_df_female['imm'])
Y = ovb_df_female['educ']
model2 = smf.OLS(Y, X)
res2 = model2.fit()
print(res2.summary())

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                  0.044
Method:                 Least Squares   F-statistic:                     490.6
Date:                Fri, 22 Oct 2021   Prob (F-statistic):          2.65e-106
Time:                        15:35:30   Log-Likelihood:                -25593.
No. Observations:               10601   AIC:                         5.119e+04
Df Residuals:                   10599   BIC:                         5.121e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         14.4518      0.029    495.764      0.0

In [7]:
# estimating model 3 (long model) 
# logwage = constant, education, immigrant status 
X = ovb_df_female[['educ', 'imm']]
X = smf.add_constant(X)
Y = ovb_df_female['logwage'] 
model3 = smf.OLS(Y,X)
res3 = model3.fit() 
print(res3.summary())

                            OLS Regression Results                            
Dep. Variable:                logwage   R-squared:                       0.224
Model:                            OLS   Adj. R-squared:                  0.224
Method:                 Least Squares   F-statistic:                     1529.
Date:                Fri, 22 Oct 2021   Prob (F-statistic):               0.00
Time:                        15:35:33   Log-Likelihood:                -9416.0
No. Observations:               10601   AIC:                         1.884e+04
Df Residuals:                   10598   BIC:                         1.886e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.2410      0.031     39.814      0.0

We can see that in the short model, the coefficient on immigration (imm) is -0.18, with a t-statistic of -10.887, suggesting that is significant at the 99% confidence level. However, in the long model, we see that, after including education (educ) as a covariate, the coefficient on immigration increases to -0.0101, and it is no longer significant. Instead, we see that education has a positive and significant effect on log(wage). 

Thus, we see that the short model has a bias of 0.1699. 

If we take a look at the auxilliary model, where we regress education on immigration, we can see that the coefficient on immigration is -1.4921 and highly statistically significant. In fact, if we use the coefficients from the long and auxilliary models, we can see how the bias affects the estimates of the short model. We refer to the second to last equation in the analytical derivation above, and the coefficient for the observable covariate: 

$$ \text{short model estimate: }\beta_{1}^{OVB} $$

$$ \beta_{1}^{OVB} = \beta_{1} + \beta_{2}\pi_{0}$$

$$ = -0.0101 +  0.1139*(-1.4921) $$

$$ = -0.1800 $$

Which is exactly the estimate for immigration status in the short model! Thus, by not including education as a covariate in the short model, we introduced ommitted variable bias into the model. One important note to consider is that we ran the auxilliary regression to confirm that the covariance between immigration and education is not 0. It is hard to prove that there is no covariance between any two regressors, but the signifacnt negative relationship between the two suggested that education is also a determinant of wage.