
<div align="center">

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangtgao/DS-UA_201-Causal-Inference-Spring-2025/blob/main/labs/10-Matching_Instruments.ipynb)

</div>

$$
\begin{array}{c}
\textbf{CAUSAL INFERENCE}\\\\
\textbf{Xiang Gao} \\
\textit{Center for Data Science, New York University} \\\\
\textit{Apr 11, 2025}\\\\\\
\text{Materials prepared by: Daniela Pinto Veizaga, Xiang Pan, and Xiang Gao}
\end{array}
$$

---

## Goals For Today

1. Matching
2. Understand what instrumental variables (IVs) are and when they are used.
3. Explore the assumptions needed for an IV to yield valid causal estimates.


## Matching

To obtain treated and control groups with similar covariate distributions.

We have multiple subgroups based on the control variables and $S \perp U | C$, we can use the matching estimator:

$$
\begin{aligned}
\operatorname{Matching}
& =  \sum_{c \in C} \left( \mathbb{E}[Y \mid S=1, C = c] - \mathbb{E}[Y \mid S=0, C = c] \right) \cdot \mathbb{P}(C = c)
\end{aligned}
$$

### Regression (weight is not the probability)


\begin{align}
\alpha_1=\sum_{k=1}^K E\left[Y(S=1, U)-Y(S=0, U) \mid \mathbf{C}=\mathbf{c}_k\right] W\left(\mathbf{C}=\mathbf{c}_k\right)
\end{align}

$W$ is the weight of the subgroup, which might not be the same as the probability of the subgroup.

$$
\begin{aligned}
W(C = c_k) \neq \mathbb{P}(C = c_k)
\end{aligned}
$$



We can look at the example from last time:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm

In [3]:
# Step 1: Simulate family status (rich=1, poor=0)
np.random.seed(42)
n = 1000
family_status = np.random.binomial(1, 0.4, n)  # 40% rich, 60% poor

# Step 2: Simulate college attendance based on family status (rich more likely to attend college)
college = np.random.binomial(1, 0.7 * family_status + 0.3 * (1 - family_status), n)

# Step 3: Simulate income based on family status and college attendance

noise = np.random.normal(0, 5000, n)  # Noise to add to income
income_observe = (50000 + 30000 * college + 20000 * family_status + noise)
income_college = (50000 + 30000 * 1 + 20000 * family_status + noise)

# For people who went to college (treated group), compared to the untreated,
# they would have higher income in the counterfactual world where they did not go to college
income_no_college = (50000 + 30000 * 0 + 10000 * college + 20000 * family_status + noise)

# Create a DataFrame for clarity
df = pd.DataFrame({
    'Family_Status': family_status,  # 1 = rich, 0 = poor
    'College': college,  # 1 = went to college, 0 = did not go to college
    'Income_Observe': income_observe,
    'Income_College': income_college,
    'Income_No_College': income_no_college
})

In [4]:
# Step 4: Compute ATE, ATT, ATU
# TRUE ATE:
ATE = df['Income_College'].mean() - df['Income_No_College'].mean()

# ATT:
ATT = df[(df['College'] == 1)]['Income_College'].mean() - df[(df['College'] == 1)]['Income_No_College'].mean()

# ATU:
ATU = df[(df['College'] == 0)]['Income_College'].mean() - df[(df['College'] == 0)]['Income_No_College'].mean()

# Print the results
print(f"TRUE Average Treatment Effect (ATE): {ATE:.2f}")
print(f"TRUE Average Treatment Effect on the Treated (ATT): {ATT:.2f}")
print(f"TRUE Average Treatment Effect on the Untreated (ATU): {ATU:.2f}")

TRUE Average Treatment Effect (ATE): 25560.00
TRUE Average Treatment Effect on the Treated (ATT): 20000.00
TRUE Average Treatment Effect on the Untreated (ATU): 30000.00


In [5]:
# Step 6: Condition on confounder - family status

ATE_family_1 = df[(df['College'] == 1) & (df['Family_Status'] == 1)]['Income_Observe'].mean() - df[(df['College'] == 0) & (df['Family_Status'] == 1)]['Income_Observe'].mean()

# Print the results
print(f"ATE for those with family_status = 1: {ATE_family_1:.2f}")

ATE_family_0 = df[(df['College'] == 1) & (df['Family_Status'] == 0)]['Income_Observe'].mean() - df[(df['College'] == 0) & (df['Family_Status'] == 0)]['Income_Observe'].mean()

# Print the results
print(f"ATE for those with family_status = 0: {ATE_family_0:.2f}")

ATE for those with family_status = 1: 30553.75
ATE for those with family_status = 0: 30308.08


In [11]:
proportions = df['Family_Status'].value_counts()/len(df)
proportions

Unnamed: 0_level_0,count
Family_Status,Unnamed: 1_level_1
0,0.613
1,0.387


In [20]:
var_family_0 = df[df['Family_Status'] == 0]["College"].var()
var_family_1 = df[df['Family_Status'] == 1]["College"].var()
print(f"Variance of treatment for Family Status = 0: {var_family_0:.4f}")
print(f"Variance of treatment for Family Status = 1: {var_family_1:.4f}")

Variance of treatment for Family Status = 0: 0.2091
Variance of treatment for Family Status = 1: 0.2192


In [15]:
# The matching estimator for ATE is
ATE_matching = ATE_family_1 * proportions[1] + ATE_family_0 * proportions[0]
print(f"Matching estimate for ATE: {ATE_matching:.2f}")

Matching estimator for ATE: 30403.15


Now, let us compare to the regression estimate:

In [16]:
ols=sm.ols(formula='Income_Observe~College+Family_Status', data=df).fit()
ols.summary()

0,1,2,3
Dep. Variable:,Income_Observe,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.947
Method:,Least Squares,F-statistic:,8981.0
Date:,"Fri, 11 Apr 2025",Prob (F-statistic):,0.0
Time:,09:09:48,Log-Likelihood:,-9912.8
No. Observations:,1000,AIC:,19830.0
Df Residuals:,997,BIC:,19850.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.996e+04,221.278,225.780,0.000,4.95e+04,5.04e+04
College,3.041e+04,335.519,90.623,0.000,2.97e+04,3.11e+04
Family_Status,1.98e+04,342.263,57.864,0.000,1.91e+04,2.05e+04

0,1,2,3
Omnibus:,1.17,Durbin-Watson:,2.05
Prob(Omnibus):,0.557,Jarque-Bera (JB):,1.054
Skew:,0.025,Prob(JB):,0.59
Kurtosis:,3.151,Cond. No.,3.09


Let's see why the weight in regression relates to the treatment variance!

Say we have two groups, 1 and 2.
What we are doing is to estimate $\beta$ (we are using the traditional linear regression notation here).

$$
Y = X \beta + \epsilon
$$


\begin{align}
X=\left[\begin{array}{l}
X_1 \\
X_2
\end{array}\right] \quad \text { and } \quad Y=\left[\begin{array}{l}
Y_1 \\
Y_2
\end{array}\right]
\end{align}

\begin{align}
X^T X=X_1^T X_1+X_2^T X_2 \quad \text { and } \quad X^T Y=X_1^T Y_1+X_2^T Y_2
\end{align}

\begin{align}
\hat{\beta}_{\text {pooled }}=\left(X_1^T X_1+X_2^T X_2\right)^{-1}\left(X_1^T Y_1+X_2^T Y_2\right)
\end{align}

\begin{align}
\hat{\beta}_{\text {pooled }}=W_1 \hat{\beta}_1+W_2 \hat{\beta}_2
\end{align}

\begin{align}
\hat{\beta}_1=\left(X_1^T X_1\right)^{-1} X_1^T Y_1 \quad \text { and } \quad \hat{\beta}_2=\left(X_2^T X_2\right)^{-1} X_2^T Y_2
\end{align}

\begin{align}
\hat{\beta}_{\text {pooled }}=\left(X_1^T X_1+X_2^T X_2\right)^{-1}\left(X_1^T X_1 \hat{\beta}_1+X_2^T X_2 \hat{\beta}_2\right)
\end{align}

\begin{align}
W_1=\left(X_1^T X_1+X_2^T X_2\right)^{-1} X_1^T X_1
\end{align}

\begin{align}
W_2=\left(X_1^T X_1+X_2^T X_2\right)^{-1} X_2^T X_2
\end{align}

Map back to our problem, the $X$ is the treatment variable, and the $Y$ is the outcome variable.

Therefore, larger treatment variance will have larger weight.

### Curse of Dimensionality (matching)

If you have large number of control variables, say k, and we assume the control variables are binary, then the number of subgroups you need to match is $2^k$.

If you have continuous control variables, you can discretize them into bins, and the number of subgroups you need to match is the number of bins.
1. What if there is only one in the bin?
   1. Increase the size of the bin (less accurate, loss of information)
   2. Remove the data point (less observations)


## Instrumental Variables

Imagine we want to estimate the **causal effect** of a treatment variable, $S$, on some outcome, $Y$. But there’s a problem: we suspect that $S$ is **correlated with unobserved factors** that also affect $Y$. These unobserved factors are captured by the variable $U$. This means:

$$S \not\!\perp\!\!\!\perp U,$$

In other words, $S$ and $U$ are related, which can lead to **biased and inconsistent estimates** when we try to estimate the causal effect of $S$ on $Y$.

> **But is there a way to estimate the causal effect despite this problem?**  
**Yes!** We can use an **instrumental variable (IV)** to help us out.

### What is an Instrumental Variable?

An **instrumental variable (IV)**, denoted by $Z$, is a variable that:
- **Affects** the treatment variable $S$.
- Has **no direct effect** on the outcome variable $Y$.
- Is **independent** of the unobserved confounders $U$ that are correlated with both $S$ and $Y$.

The goal is that the instrument $Z$ can help us isolate the variation in $S$ that is **uncorrelated** with the unobserved confounders $U$, allowing us to estimate a more accurate causal effect.

<img src="https://raw.githubusercontent.com/dapivei/causal-infere/main/images/instrument.png" width="500"/>

### When Should We Use IVs?

Instrumental Variables are especially useful when:
> - The treatment variable $S$ suffers from **measurement error**, **omitted variable bias**, or **reverse causality** (where $Y$ affects $S$ instead of the other way around).

### Key Assumptions for a Valid Instrument

For $Z$ to be a valid instrument, it must satisfy three important assumptions:

1. **Exogeneity**:  
   The instrument $Z$ must be **uncorrelated** with the error term $U$ (or other unobserved factors that affect $S$), i.e., $Z$ must not be influenced by any variables that also affect $Y$. This assumption ensures that $Z$ affects $S$ but not $Y$ directly.  
   $$ U \perp Z \quad \text{and} \quad V \perp Z $$

2. **Relevance**:  
   The instrument $Z$ must have a **significant** effect on the treatment $S$. In other words, $Z$ should change the likelihood that someone receives the treatment $S$, at least for some sub-group of the population.  
   $$ S(Z=1, V) \neq S(Z=0, V) $$

3. **Monotonicity**:  
   The instrument $Z$ must affect $S$ in the **same direction** for all individuals. That is, $Z$ should only either **increase** or **decrease** the likelihood of receiving treatment, but not both.  
   $$ S(Z=1, V) > S(Z=0, V) $$

### Local Average Treatment Effect (LATE)

If these assumptions hold, we can estimate the **Local Average Treatment Effect (LATE)**. This is the causal effect of $S$ on $Y$ for a specific group of people called the **compliers**—those whose treatment status is affected by the instrument $Z$.

> **Compliers** are individuals whose treatment $S$ is influenced by the instrument $Z$. They are the population we can estimate the causal effect for using IVs.


### Types of Individuals in IV Analysis

The population can be divided into four groups based on how they respond to the instrument $Z$:

1. **Always-takers**: These individuals always take the treatment, regardless of the instrument $Z$.  
   $$ S(Z=1, V) = S(Z=0, V) = 1 $$

2. **Never-takers**: These individuals never take the treatment, regardless of the instrument $Z$.  
   $$ S(Z=1, V) = S(Z=0, V) = 0 $$

3. **Compliers**: These individuals comply with the instrument. They take the treatment when $Z = 1$, but not when $Z = 0$.  
   $$ S(Z=1, V) = 1 \quad \text{and} \quad S(Z=0, V) = 0 $$

4. **Defiers**: These individuals do the opposite of what the instrument suggests. They take the treatment when $Z = 0$, but not when $Z = 1$.  
   $$ S(Z=1, V) = 0 \quad \text{and} \quad S(Z=0, V) = 1 $$