<a href="https://colab.research.google.com/github/sofials2002/SOFIA/blob/master/521_Assignment3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment: IV simulation

In [1]:
!pip install linearmodels

Collecting linearmodels
  Downloading linearmodels-6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Collecting mypy-extensions>=0.4 (from linearmodels)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Collecting pyhdfe>=0.1 (from linearmodels)
  Downloading pyhdfe-0.2.0-py3-none-any.whl.metadata (4.0 kB)
Collecting formulaic>=1.0.0 (from linearmodels)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting setuptools-scm<9.0.0,>=8.0.0 (from setuptools-scm[toml]<9.0.0,>=8.0.0->linearmodels)
  Downloading setuptools_scm-8.1.0-py3-none-any.whl.metadata (6.6 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=1.0.0->linearmodels)
  Downloading interface_meta-1.3.0-py3-none-any.whl.metadata (6.7 kB)
Downloading linearmodels-6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hD

In [2]:
#Installing necessary packages:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.iv import IV2SLS

In [3]:
#Data Generating Process:

np.random.seed(42) #seed for reproducibility.
n = 10000 #sample size

X= np.random.normal(0, 1, n) #observed covariate
A = np.random.normal(0, 1, n) #unobserved covariate: ability
epsilon = np.random.normal(0, 1, n) #error

#Parameters:
Tau = 1
Beta_0 = 1
Beta_1 = 1
Gamma = 1

#Instruments:
Z_1 = np.random.binomial(1, 0.5, n) #Z1 and Z2 are binomial with probability 0.5.
Z_2 = np.random.binomial(1, 0.5, n)

epsilon_3 = np.random.normal(0, 3, n) #e3 is normal with mean 0, stdev 3. This is created because it determines Z3.

Z_3 = (epsilon_3 + A > 0).astype(int) #depends on A and some randomness.

#Equation for variable T:
epsilon_T = np.random.normal(0, 2, n)
T = (5*Z_1 + 0.01*Z_2 + Z_3 + X + 10*A + epsilon_T > 0).astype(int)

#Log equation where Y is wages:
y = np.exp(Tau*T + Beta_0 + X*Beta_1 + A*Gamma + epsilon) #to write the equation in regular terms, we use exp.

logY= np.log(y) #now we express it as the log.


In [4]:
# prompt:  Generate a table of the parameter estimates. It should have five columns1: OLS
# 2: IV instrumenting with Z1
# 3: IV instrumenting with Z2
# 4: IV instrumenting with Z3
# 5: IV instrumenting with Z1 and Z2. You should always be using robust standard errors;

# Create a DataFrame for the data
data = pd.DataFrame({'y': y, 'T': T, 'X': X, 'Z1': Z_1, 'Z2': Z_2, 'Z3': Z_3})

# OLS Regression
ols_model = smf.ols('np.log(y) ~ T + X', data=data).fit(cov_type='HC1')

# IV Regression with Z1
iv_z1_model = IV2SLS.from_formula('np.log(y) ~ 1 + X + [T ~ Z1]', data=data).fit(cov_type='robust')

# IV Regression with Z2
iv_z2_model = IV2SLS.from_formula('np.log(y) ~ 1 + X + [T ~ Z2]', data=data).fit(cov_type='robust')

# IV Regression with Z3
iv_z3_model = IV2SLS.from_formula('np.log(y) ~ 1 + X + [T ~ Z3]', data=data).fit(cov_type='robust')

# IV Regression with Z1 and Z2
iv_z1z2_model = IV2SLS.from_formula('np.log(y) ~ 1 + X + [T ~ Z1 + Z2]', data=data).fit(cov_type='robust')


# Create a table of parameter estimates
results_table = pd.DataFrame({
    'OLS': ols_model.params,
    'IV (Z1)': iv_z1_model.params,
    'IV (Z2)': iv_z2_model.params,
    'IV (Z3)': iv_z3_model.params,
    'IV (Z1 & Z2)': iv_z1z2_model.params
})

results_table


Unnamed: 0,OLS,IV (Z1),IV (Z2),IV (Z3),IV (Z1 & Z2)
Intercept,0.033531,0.813372,0.052726,-0.301963,0.811976
T,2.570309,1.304684,2.539157,3.114793,1.30695
X,0.966078,1.007822,0.967105,0.948119,1.007747


In [5]:
# prompt: Test statistically whether Z1 is correlated with X, and report the results of that test.

# Perform correlation test between Z1 and X
correlation_model = smf.ols('X ~ Z1', data=data).fit()
print(correlation_model.summary())

# Extract the p-value for the Z1 coefficient
p_value = correlation_model.pvalues['Z1']

# Print the results
print(f"\nCorrelation Test between Z1 and X:\np-value: {p_value}")

# Interpret the results
if p_value < 0.05:
  print("\nConclusion: There is a statistically significant correlation between Z1 and X.")
else:
  print("\nConclusion: There is no statistically significant correlation between Z1 and X.")


                            OLS Regression Results                            
Dep. Variable:                      X   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     2.926
Date:                Tue, 08 Oct 2024   Prob (F-statistic):             0.0872
Time:                        16:22:42   Log-Likelihood:                -14222.
No. Observations:               10000   AIC:                         2.845e+04
Df Residuals:                    9998   BIC:                         2.846e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0149      0.014      1.055      0.2

**Discussion:** the above correlation test shows that the instrument $Z_1$ isn't correlated with X (the observed covariate, which is parental education). This can be a preliminary evidence for the exclusion restriction: " Z affects Y only through T."

This result would support the notion that $Z_1$ might be a valid instrument, as it does not seem to affect $X$ directly. So, it supports the validity of our instrument under the exclusion restriction.

However, it is still essential to consider the broader context and potential omitted variable bias that might not be captured by the correlation test.

In many developing countries, the education system may be heavily influenced by external factors such as socioeconomic status, community resources, and parental education. If $Z_1$ is correlated with $X$, it raises concerns about whether $Z_1$ can be seen as a purely random assignment affecting only educational attainment (i.e., high school completion) and not influenced by other factors related to the outcome of interest (like wages).

Imbens and Wooldridge (2009) note that in developing countries, finding valid instruments can be particularly challenging due to potential violations of the exclusion restriction, as many instruments (like scholarships) might directly affect outcomes (e.g., wages) in ways unrelated to schooling.

 In contexts where educational opportunities are limited or vary significantly based on family background and other socioeconomic factors, finding a valid instrument becomes more challenging. A correlation between $Z_1$ and $X$ suggests that $Z_1$ may capture unobserved factors that also affect wages, thereby making it less convincing as a valid instrument.

In [6]:
# prompt: Now, re-generate the data with 100,000 observations. Report the summary statistics.

# Re-generate the data with 100,000 observations
np.random.seed(42)  # seed for reproducibility
n = 100000  # sample size

X = np.random.normal(0, 1, n)
A = np.random.normal(0, 1, n)
epsilon = np.random.normal(0, 1, n)

Tau = 1
Beta_0 = 1
Beta_1 = 1
Gamma = 1

Z_1 = np.random.binomial(1, 0.5, n)
Z_2 = np.random.binomial(1, 0.5, n)

epsilon_3 = np.random.normal(0, 3, n)
Z_3 = (epsilon_3 + A > 0).astype(int)

epsilon_T = np.random.normal(0, 2, n)
T = (5 * Z_1 + 0.01 * Z_2 + Z_3 + X + 10 * A + epsilon_T > 0).astype(int)

y = np.exp(Tau * T + Beta_0 + X * Beta_1 + A * Gamma + epsilon)

logY = np.log(y)

data = pd.DataFrame({'y': y, 'T': T, 'X': X, 'Z1': Z_1, 'Z2': Z_2, 'Z3': Z_3})

# Report summary statistics
data.describe()


Unnamed: 0,y,T,X,Z1,Z2,Z3
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,31.339305,0.61162,0.000967,0.50261,0.49831,0.49815
std,140.615428,0.487384,1.000906,0.499996,0.5,0.499999
min,0.000641,0.0,-4.465604,0.0,0.0,0.0
25%,1.236865,0.0,-0.674494,0.0,0.0,0.0
50%,5.328871,1.0,0.00265,1.0,0.0,0.0
75%,20.775248,1.0,0.676915,1.0,1.0,1.0
max,15193.609008,1.0,4.479084,1.0,1.0,1.0


When we construct the Wald estimator, we should choose wether to condition on $X$ or not. This consideration can be broken down into two main aspects: consistency and efficiency.

Conditioning on $X$ means including this relevant covariate in the model to control for potential confounding factors.

**1. Consistency:** refers to the property of an estimator that, as the sample size increases, it converges in probability to the true parameter value.

Conditioning on X: When we condition on $X$, we control for the influence of covariates, reducing omitted variable bias. This allows us to obtain a consistent estimator of the causal effect, especially in observational data where $X$ may influence both the treatment and the outcome. This means that conditioning on $X$ generally helps to ensure that the estimator accurately captures the causal effect we are interested in.

Not Conditioning on $X$: If we do not condition on $X$, we risk confounding the treatment effect with the effects of unobserved variables that are correlated with both the treatment and the outcome. This can lead to biased estimates, undermining consistency. In this case, as the sample size grows, the estimator may not converge to the true causal effect, as it could be capturing spurious relationships instead.

**2. Efficiency:** refers to the variance of the estimator. An efficient estimator has the smallest possible variance among all consistent estimators.

Conditioning on $X$: By conditioning on $X$, we often achieve a more efficient estimator. This is because the additional information provided by $X$ allows us to reduce the variability of our estimator. When  $X$ accounts for significant variation in the outcome, conditioning on it helps to clarify the relationship between the treatment and the outcome, leading to a tighter confidence interval and more reliable inferences.

Not Conditioning on $X$: Without conditioning, the estimator may have larger variance because it does not utilize the information that $X$ provides. This results in less precise estimates, and thus, the estimator can be considered less efficient. The larger variance could be due to not adequately accounting for the sources of variability in the outcome.

**Conclusion:** In summary, whether to condition on $X$ in constructing the Wald estimator is crucial for achieving both consistency and efficiency. Conditioning on $X$ enhances the reliability of the estimator by mitigating bias and reducing variance, while failing to condition risks obtaining an inconsistent and inefficient estimator. Therefore, to make robust inferences about causal relationships, it is generally advisable to condition on relevant covariates like $X$.

In [7]:
# prompt: There are four expectations in the Wald estimator. Construct and report the sample averages of each of those four expectations.
#My estimator controls for X, in case it significantly impacts the treatment and outcome.

# Calculate the sample averages of the four expectations in the Wald estimator:

import statsmodels.api as sm

# Calculate the sample averages using regression to condition on X

# Calculate E[Z1 * Y]
E_Z1Y = np.mean(data['Z1'] * np.log(data['y']))  # Ensure you're using log(y) here

# Calculate E[Y]
E_Y = np.mean(np.log(data['y']))  # Overall mean of log(y)

# Calculate E[T | X] and E[Z1 | X] using OLS regression
# For E[T | X]
T_model = smf.ols('T ~ X', data=data).fit()
E_T_given_X = T_model.predict(data[['X']]).mean()

# For E[Z1 | X]
Z1_model = smf.ols('Z1 ~ X', data=data).fit()
E_Z1_given_X = Z1_model.predict(data[['X']]).mean()

# Calculate the sample averages for the Wald estimator
E_Z1T = np.mean(data['Z1'] * data['T'])  # Already calculated correctly
E_T = np.mean(data['T'])  # Already calculated correctly
E_X = np.mean(data['X'])  # Already calculated correctly

# Report the sample averages
print(f"E[Z1 * Y]: {E_Z1Y}")
print(f"E[Z1]: {E_Z1_given_X}")
print(f"E[T | X]: {E_T_given_X}")
print(f"E[T]: {E_T}")
print(f"E[X]: {E_X}")


E[Z1 * Y]: 0.8516393476869983
E[Z1]: 0.5026100000000002
E[T | X]: 0.61162
E[T]: 0.61162
E[X]: 0.0009668681409495974


In [8]:
wald_estimator = (E_Z1Y - E_Z1_given_X * E_Y) / (E_Z1T - E_Z1_given_X * E_T_given_X)
print(f"Wald Estimator: {wald_estimator}")

Wald Estimator: 0.9370449771816036


In [9]:
##Estimate the analogous Wald estimator directly using the package of your choice. By "analogous," I mean that the coefficient estimate should be identical.

# First Stage: Regress T on Z1 and X
first_stage_model = smf.ols('T ~ Z1 + X', data=data).fit()
data['T_hat'] = first_stage_model.fittedvalues

In [10]:
# Second Stage: Regress Y (or logY) on T_hat
second_stage_model = smf.ols('np.log(y) ~ T_hat + X', data=data).fit()

# Extract the coefficient for T_hat
wald_estimator_smf = second_stage_model.params['T_hat']

print(f"Wald Estimator (smf): {wald_estimator_smf}")


Wald Estimator (smf): 0.9545511576675308


In [11]:
#Estimate the analogous Wald estimator directly, I tried using IVSLS now to double check:

from linearmodels.iv import IV2SLS

# Define the model using IV2SLS
iv_model = IV2SLS.from_formula('np.log(y) ~ 1 + X + [T ~ Z1]', data=data)

# Fit the model
iv_results = iv_model.fit(cov_type='robust')

# Extract the coefficient for T (the Wald estimator)
wald_estimator_iv2sls = iv_results.params['T']

# Print the Wald estimator from IV2SLS
print(f"Wald Estimator (IV2SLS): {wald_estimator_iv2sls}")


Wald Estimator (IV2SLS): 0.9545511576675327


In [12]:
# Check the statistical significance of the estimator

# Get the standard error of the Wald estimator from IV2SLS
standard_error = iv_results.std_errors['T']

# Calculate the t-statistic
t_statistic = wald_estimator_iv2sls / standard_error

# Calculate the p-value
p_value = iv_results.pvalues['T']

# Print the results
print(f"Wald Estimator (IV2SLS): {wald_estimator_iv2sls}")
print(f"Standard Error: {standard_error}")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("\nConclusion: The Wald estimator is statistically significant at the 5% level.")
else:
    print("\nConclusion: The Wald estimator is not statistically significant at the 5% level.")


Wald Estimator (IV2SLS): 0.9545511576675327
Standard Error: 0.05086547309439087
t-statistic: 18.76619049421158
p-value: 0.0

Conclusion: The Wald estimator is statistically significant at the 5% level.


This indicates that, on average, completing high school (the treatment T) is associated with an increase of approximately 0.95 in the log of wages compared to not completing high school, after controlling for  X(the covariate).
In practical terms, this suggests that completing high school is associated with a significant increase in wages. Given the logarithmic scale, we can interpret this as an approximate increase of about 95% in wages for those who complete high school compared to those who do not.