# Lecture 06: Confounding and Adjustment Strategies

[!["Open In Colab"](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<ORG>/<REPO>/blob/main/lectures/L06_Confounding/L06_Confounding_student.ipynb)

## Learning Objectives
1. Define confounding using **potential outcomes**.
2. Compare different adjustment strategies: **Standardization, Regression, and Matching**.
3. Understand **overadjustment bias** and how to avoid it using DAGs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from phs564_ci.datasets import load_data
from phs564_ci.diagnostics.balance import calculate_smd

# Load data with multiple confounders
df = load_data("l06_confounding.csv")
df.head()

--- 
## ðŸ›‘ Activity 1: Choose adjustment set (Slide 11)

**Scenario:** You have measured `L1`, `L2`, `L3`, and `M`.
- Your DAG shows: `L1 -> A`, `L1 -> Y`, `A -> M -> Y`, `A -> L2 <- Y`.

1. Which variable is a **confounder**?
2. Which is a **mediator**?
3. Which is a **collider**?
4. What is your **minimal sufficient adjustment set**?

--- 
### 1. Strategy A: Regression Adjustment (Outcome Modeling)
We fit a model $E[Y|A, L]$ and look at the coefficient for $A$.

In [None]:
# Fit linear model with confounders L1 and L2
model = smf.ols("Y ~ A + L1 + L2", data=df).fit()
print(f"Regression Adjusted RD: {model.params['A']:.3f}")

# Compare with crude difference
crude_rd = df[df['A']==1]['Y'].mean() - df[df['A']==0]['Y'].mean()
print(f"Crude RD: {crude_rd:.3f}")

--- 
### 2. Strategy B: Matching
For this exercise, we will use a simple matching approach. In practice, you might use libraries like `CausalModel` or `matching`.

In [None]:
from sklearn.neighbors import NearestNeighbors

# Simple 1:1 matching on L1 and L2
treated = df[df['A'] == 1]
control = df[df['A'] == 0]

nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['L1', 'L2']])
distances, indices = nn.kneighbors(treated[['L1', 'L2']])

matched_control = control.iloc[indices.flatten()]

matched_rd = treated['Y'].mean() - matched_control['Y'].mean()
print(f"Matched RD: {matched_rd:.3f}")

--- 
### 3. Diagnostic: Balance Check (Slide 16)
We use **Standardized Mean Differences (SMD)** to check if matching worked.

In [None]:
smd_before = calculate_smd(df, 'A', ['L1', 'L2'])
matched_df = pd.concat([treated, matched_control])
smd_after = calculate_smd(matched_df, 'A', ['L1', 'L2'])

print("SMD Before Matching:")
print(smd_before)
print("\nSMD After Matching:")
print(smd_after)

--- 
## ðŸ›‘ Activity 2: Exit ticket (Slide 21)

1. Give an example of a variable that is a **confounder** for your research question.
2. Give an example of a variable that would be a **mediator** (and thus should not be adjusted for if you want the total effect).

### 4. Summary
- Regression, Matching, and Standardization are all valid if assumptions hold.
- Use SMDs to verify that your adjustment method actually balanced the groups.
- Beware of overadjustment!