# Lecture 10: Instrumental Variables (IV)

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<ORG>/<REPO>/blob/main/lectures/L10_IV/L10_IV_student.ipynb)

## Learning Objectives
1. Define the core assumptions of **Instrumental Variables**: Relevance, Exclusion, and Independence.
2. Calculate the **Wald Estimator** for a binary instrument and treatment.
3. Implement **Two-Stage Least Squares (2SLS)** using Python.
4. Understand the **LATE** interpretation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from phs564_ci.datasets import load_data
from phs564_ci.estimators.iv import wald_estimator, iv_2sls

# Load data with an instrument 'Z' and unmeasured confounding
df = load_data("l10_iv.csv")
df.head()

--- 
### 1. The Problem: Unmeasured Confounding
In this dataset, the crude association between $A$ and $Y$ is biased because of an unmeasured variable $U$.

In [None]:
crude_rd = df[df['A'] == 1]['Y'].mean() - df[df['A'] == 0]['Y'].mean()
print(f"Crude RD (Biased): {crude_rd:.3f}")

--- 
### 2. The Solution: The Wald Estimator
If $Z$ is a valid instrument, we can use it to identify the effect among compliers.

In [None]:
# Wald Estimator = [E(Y|Z=1) - E(Y|Z=0)] / [E(A|Z=1) - E(A|Z=0)]
iv_effect = wald_estimator(df, 'Y', 'A', 'Z')
print(f"Wald IV Estimate (LATE): {iv_effect:.3f}")

--- 
## üõë Activity 1: Critique IV candidates (Slide 11)

**Candidate:** Use **Distance to Surgery Center** as an instrument for whether a patient receives a specific surgery ($A$).

1. **Relevance:** Is distance likely to affect whether someone gets surgery? 
2. **Exclusion:** Does distance affect the outcome *only* through its effect on surgery? (What if distance also affects travel stress or follow-up quality?)
3. **Independence:** Is distance unrelated to unmeasured patient health ($U$)? (Do healthier people live closer to hospitals?)

--- 
### üñºÔ∏è Figure Generation: Weak Instruments (Slide 12)
Let's see why a weak instrument is dangerous.

In [None]:
def simulate_iv(relevance):
    U = np.random.normal(0, 1, 1000)
    Z = np.random.binomial(1, 0.5, 1000)
    A = np.random.binomial(1, 0.2 + relevance * Z + 0.3 * U)
    Y = 0.5 * A + 0.5 * U + np.random.normal(0, 1, 1000)
    temp_df = pd.DataFrame({'Z': Z, 'A': A, 'Y': Y})
    return wald_estimator(temp_df, 'Y', 'A', 'Z')

strong_results = [simulate_iv(relevance=0.4) for _ in range(200)]
weak_results = [simulate_iv(relevance=0.05) for _ in range(200)]

plt.figure(figsize=(10, 6))
sns.kdeplot(strong_results, label='Strong Instrument (Rel=0.4)', fill=True)
sns.kdeplot(weak_results, label='Weak Instrument (Rel=0.05)', fill=True)
plt.axvline(0.5, color='red', linestyle='--', label='True Effect')
plt.title("IV Performance: Strong vs. Weak Instruments")
plt.legend()
plt.savefig("figures/L10/weak_iv.png")
plt.show()

--- 
### 3. Two-Stage Least Squares (2SLS)
For continuous outcomes or when adding covariates $L$.

In [None]:
from linearmodels.iv import IV2SLS

# 2SLS using the linearmodels library (wrapped in our helper)
iv_res = iv_2sls(df, 'Y', 'A', 'Z', [])
print(iv_res.summary.tables[1])

### 4. Summary
- IV is a way to handle unmeasured confounding.
- It requires strict assumptions that cannot be fully verified from the data.
- Weak instruments lead to unstable and potentially biased estimates.