How does linear regression allow for the estimation of the average treatment effect (ATE) in causal inference studies, and what role does the dummy variable play in this estimation?

$ Y_i = \beta_0 + k \beta_1 + u_i $
<br>
$E[Y|T=0] = \beta_0 and E[Y|T=1] = \beta_0 +k$ where k is our ATE. Dummy variable is our treatment indication as 1 or 0 allows us to calculate the ATE which is the coefficient(coveriance between treatment and outcome divided by variance of the treatment) in the model summary.

Why is it necessary to control for potential confounders when using linear regression to estimate causal relationships, and how does including these confounders in the regression model help mitigate bias?
- No omitted variable bias and no confounding variables. -> it can result in baised estimates.
- RCT makes the treated and untreated to be equal, so the bias vanishes.

What is the significance of the "partialling out" interpretation in multivariate linear regression, especially in the context of causal inference, and how does it relate to the ceteris paribus condition?

- When a constant factor(1) and just coef for the treatment, we can get the how our treatment impacting the target variable which means if there is a difference, we can say there is a bias/confounding variables(especially if the factor lowered the ATE). 

How does randomized control trial (RCT) differ from regression analysis in addressing confounding bias, and why might RCT not always be feasible in real-world research scenarios?

- RCT gives you treated and untreated equal, so there will be no bias. confounding variables are something we can never say we accounted for.
- RCT is expensive (Patients with harmful pills)

Why is the concept of omitted variable bias (OVB) crucial in the context of linear regression for causal inference, and how can causal graphs help in understanding the potential sources of bias in estimating causal effects?

- if there is no bias in the omitted variables, if there is no impact on treatment from omitted variables.
- Causal graph can provide some senses grapically if given variables are only proxy and if we ever miss some important variables that are confounding.


In [None]:
How does linear regression allow for the estimation of the average treatment effect (ATE) in causal inference studies, and what role does the dummy variable play in this estimation?

Linear regression facilitates the estimation of the ATE by modeling the outcome variable as a function of the treatment and potentially other covariates. The dummy variable, typically coded as 0 for the control group and 1 for the treatment group, plays a crucial role in this process. It allows for the direct estimation of the treatment effect as the difference in predicted outcomes between the treatment and control groups. The coefficient of the dummy variable in the regression model represents the ATE, indicating the average difference in outcomes attributable to the treatment after controlling for other factors in the model.

Why is it necessary to control for potential confounders when using linear regression to estimate causal relationships, and how does including these confounders in the regression model help mitigate bias?

Controlling for potential confounders is necessary because these variables can influence both the treatment assignment and the outcome, potentially introducing bias into the estimation of the treatment effect. Including confounders in the regression model as covariates allows researchers to isolate the effect of the treatment from the effects of these other variables. This helps mitigate bias by ensuring that the estimated treatment effect is not confounded by differences in the covariates between the treatment and control groups. Essentially, it adjusts for the fact that the groups might have been different on these variables even before the treatment was applied.

What is the significance of the "partialling out" interpretation in multivariate linear regression, especially in the context of causal inference, and how does it relate to the ceteris paribus condition?

The "partialling out" interpretation in multivariate linear regression is significant because it allows for the estimation of the unique effect of a single independent variable (e.g., treatment) on the dependent variable (e.g., outcome), controlling for the influence of all other variables in the model. This is particularly important in causal inference, as it aligns with the ceteris paribus condition—holding all else constant. It means that the estimated coefficient of the treatment variable reflects the change in the outcome variable for a one-unit change in the treatment, assuming all other covariates remain constant. This provides a clearer and more accurate understanding of the causal relationship between the treatment and the outcome.

How does randomized control trial (RCT) differ from regression analysis in addressing confounding bias, and why might RCT not always be feasible in real-world research scenarios?

An RCT addresses confounding bias by randomly assigning participants to the treatment or control group, ensuring that, on average, all other variables are equally distributed across these groups. This randomization helps isolate the effect of the treatment from the effects of confounding variables. In contrast, regression analysis statistically controls for confounders by including them as covariates in the model. While RCT is considered the gold standard for establishing causality, it may not always be feasible due to ethical, logistical, or financial constraints. For example, it might be unethical or impractical to randomly assign individuals to receive a potentially harmful treatment or to withhold a potentially beneficial one. In such cases, observational studies with regression analysis can provide valuable insights, although with greater caution regarding causality.
Why is the concept of omitted variable bias (OVB) crucial in the context of linear regression for causal inference, and how can causal graphs help in understanding the potential sources of bias in estimating causal effects?
Omitted variable bias (OVB) occurs when a relevant variable that influences both the treatment and the outcome is not included in the regression model. This can lead to biased estimates of the treatment effect because the omitted variable can confound the relationship between the treatment and the outcome. Causal graphs, or directed acyclic graphs (DAGs), can help identify potential sources of OVB by visually representing the causal relationships among variables. By mapping out these relationships, researchers can identify variables that are likely to confound the treatment-outcome relationship if omitted from the analysis. Including these variables in the regression model can then reduce the risk of OVB and lead to more accurate estimates of causal effects.

In [14]:
import pandas as pd
import numpy as np
df = pd.read_csv('wage_data.csv').dropna()

In [15]:
def regress(y, X): 
    return np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))


In [31]:
controls = ['IQ', 'exper', 'tenure', 'age', 'married', 'black',
            'south', 'urban', 'sibs', 'brthord', 'meduc', 'feduc']
X = df[controls].assign(intercep=1)
X1 = df[controls+['educ','lhwage']].assign(intercep=1)
t = df["educ"]
y = df["lhwage"]

beta_aux = regress(t, X)
t_tilde = t - X.dot(beta_aux)

kappa = t_tilde.cov(y) / t_tilde.var()
kappa

0.04114719101005263

In [33]:
import statsmodels.formula.api as smf
smf.ols('lhwage~educ + '+'+'.join(controls),data=X1).fit().summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.1156,0.232,4.802,0.000,0.659,1.572
educ,0.0411,0.010,4.075,0.000,0.021,0.061
IQ,0.0038,0.001,2.794,0.005,0.001,0.006
exper,0.0153,0.005,3.032,0.003,0.005,0.025
tenure,0.0094,0.003,2.836,0.005,0.003,0.016
age,0.0086,0.006,1.364,0.173,-0.004,0.021
married,0.1795,0.053,3.415,0.001,0.076,0.283
black,-0.0801,0.063,-1.263,0.207,-0.205,0.044
south,-0.0397,0.035,-1.129,0.259,-0.109,0.029


In [38]:
smf.ols('lhwage~educ ',data=df).fit().summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.3071,0.104,22.089,0.000,2.102,2.512
educ,0.0536,0.008,7.114,0.000,0.039,0.068


In [41]:

#-------------------------------


import pandas as pd
import numpy as np

np.random.seed(42)


data = pd.DataFrame({
    'age': np.random.randint(18, 65, size=100),  # Age of individuals
    'income': np.random.normal(50000, 15000, size=100),  # Annual income with some variation
    'education_years': np.random.randint(12, 21, size=100),  # Years of education
    'health_index': np.random.uniform(0, 1, size=100) * 100,  # Health index from 0 to 100
    'employed': np.random.choice([0, 1], size=100)  # Employment status (0 = unemployed, 1 = employed)
})

data['income_post_training'] = data['income'] + np.where(data['employed'] == 1, np.random.normal(5000, 2000, size=100), 0)



In [42]:
data

Unnamed: 0,age,income,education_years,health_index,employed,income_post_training
0,56,59544.576625,20,41.978086,1,63859.398889
1,46,36399.189971,15,25.620694,1,41897.819055
2,32,57140.638811,12,61.151371,0,57140.638811
3,60,69554.919026,13,8.159418,0,69554.919026
4,25,53173.805185,12,0.518486,1,59951.773773
...,...,...,...,...,...,...
95,24,63099.754536,12,82.042684,0,63099.754536
96,26,53084.752010,20,65.148477,0,53084.752010
97,41,28347.749725,17,20.668436,1,32495.685047
98,18,34584.337039,18,27.396113,0,34584.337039



Practice Questions

Calculate the Average Treatment Effect (ATE) of the training program on income.
- 2503


What is the average income for those who were employed vs. those who were not?

- 51957.362504, 54460.945837

Does higher education correlate with higher income in this dataset?
- Yes, the orthogonality gives a lot larger than 0.
Is there a significant difference in the health index between employed and unemployed individuals?

Predict post-training income using a linear regression model with 'age', 'education_years', 'health_index', and 'employed' as predictors. What is the coefficient for 'employed', and how does it relate to the ATE calculated in question 1?
- 1905. This decreased a lot more with other variables in our bigger model. This means including the above question we have omitted/confounding variables that have significant impact on our model, so we cannot conclude employed(training?) is causal.

In [57]:
smf.ols('income_post_training~employed',data=data).fit().summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.196e+04,2024.300,25.667,0.000,4.79e+04,5.6e+04
employed,2503.5833,3051.747,0.820,0.414,-3552.510,8559.676


In [58]:
data.groupby('employed')['income_post_training'].mean()

employed
0    51957.362504
1    54460.945837
Name: income_post_training, dtype: float64

In [59]:
beta = regress(data['income_post_training'], data[['education_years']])
beta

array([3312.90923662])

In [60]:
e = data['income_post_training'] - data[['education_years']].dot(beta)
print("Orthogonality imply that the dot product is zero:", np.dot(e, data[['education_years']]))
data[['education_years']].assign(e=e).corr()

Orthogonality imply that the dot product is zero: [-1.35041773e-08]


Unnamed: 0,education_years,e
education_years,1.0,-0.449365
e,-0.449365,1.0


In [61]:
smf.ols('income_post_training~age + education_years + health_index + employed',data=data).fit().summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.775e+04,1.01e+04,4.723,0.000,2.77e+04,6.78e+04
age,17.4976,110.682,0.158,0.875,-202.235,237.230
education_years,431.3614,581.286,0.742,0.460,-722.637,1585.359
health_index,-63.4558,50.829,-1.248,0.215,-164.363,37.452
employed,1905.3979,3103.469,0.614,0.541,-4255.768,8066.564
