$$
\begin{array}{c}
\textbf{CAUSAL INFERENCE - Fall 2025}\\\\
\textit{Center for Data Science, New York University} \\\\
\textit{October 31, 2025}\\\\\
\text{ Prepared by: Vivek Kumar Agarwal}\\\\
\textbf{Recitation 9: Control Variables and Matching}\\\\
\end{array}
$$


![Causal Inference](../figures/CI_lab8_image1.png)

---

## Today's Recitation 

- Quick Recap
- Matching
- Lets Code!!

---

## Part 1 - Quick Recap

## üéÉ The Halloween Brew: Quick Revision of Causal Inference Concepts

### The Halloween Candy Mystery

You notice that kids wearing **expensive, elaborate costumes** come home with **more candy** than kids in simple costumes. 

**Does this mean buying an expensive costume *causes* kids to get more candy?**

**Let's analyse this using what we've learned until today**

---

### üßô‚Äç‚ôÄÔ∏è Concept 1: Association vs Causation

**What we observe:**
- Kids in elaborate costumes (store-bought, $50+) average 150 pieces of candy
- Kids in simple costumes (homemade, $10) average 90 pieces of candy
- Difference: 60 pieces

**The question:** Does spending $40 more on a costume *cause* you to get 60 more pieces of candy?

**Key insight:** Association does not imply causation!

---

### üëª Concept 2: Confounding Variables

**The hidden factor:** Neighborhood wealth

- Wealthy neighborhoods: Rich parents buy elaborate costumes AND these neighborhoods give out more candy
- Less wealthy neighborhoods: Parents make simple costumes AND these neighborhoods give out less candy

**The confounding structure:**
- Neighborhood wealth ‚Üí Costume quality (parents can afford more)
- Neighborhood wealth ‚Üí Candy haul (wealthier neighbors give more candy)

**Result:** Costume quality and candy haul are *associated* because they share a common cause, not because one causes the other!

---

### ü¶á Concept 3: Control Variables

**The solution:** Control for neighborhood wealth

When we compare kids *within the same neighborhood*:
- Kid with elaborate costume: 120 pieces
- Kid with simple costume: 115 pieces  
- True costume effect: Only 5 pieces!

**What happened?**
- **Naive comparison (60 pieces)** = True costume effect (5 pieces) + Selection bias (55 pieces from neighborhood differences)
- By controlling for neighborhood, we remove the confounding and isolate the true causal effect

**Key terms:**
- **ATE (Average Treatment Effect):** Average effect across all kids if everyone switched costumes
- **ATT (Average Treatment Effect on Treated):** Effect specifically for kids who actually wore elaborate costumes
- **ATU (Average Treatment Effect on Untreated):** Effect for kids who wore simple costumes, if they had switched

---

### üï∑Ô∏è Concept 4: Natural Experiments

**The problem recap:** We can't randomly assign costumes to kids (parents choose), so costume quality is confounded with neighborhood wealth.

**A lucky break - The Costume Store Raffle:**

Imagine costume store in every neighborhood held a raffle where they randomly gave away 50 elaborate costume vouchers:
- Winners got $50 vouchers ‚Üí bought elaborate costumes
- Non-winners ‚Üí stuck with simple costumes
- **Crucially:** Winning the raffle was *random* - unrelated to neighborhood wealth, parent income, or anything else

**Why this helps:**
- The raffle creates random variation in costume quality
- Winners and non-winners are similar on average (same neighborhoods, same wealth distribution)
- Any difference in candy between raffle winners and losers must be due to the costume itself!
- This is "as good as random assignment" - a **natural experiment**

**Key insight:** When we find situations where treatment is assigned "as if random," we can estimate causal effects without worrying about confounding!

---

### üéÉ Concept 5: Our Statistical Toolkit for Confounding

Now we know *why* confounding is a problem. But how do we actually deal with it statistically?

**Method 1: Regression with Control Variables**
- Run regression: Candy = Œ≤‚ÇÄ + Œ≤‚ÇÅ(Elaborate Costume) + Œ≤‚ÇÇ(Neighborhood Wealth) + Œµ
- Œ≤‚ÇÅ estimates the costume effect *holding neighborhood wealth constant*
- This gives us the causal effect if we've controlled for all confounders

**Method 2: Conditional ATE (Stratification)**
- Calculate average candy for elaborate vs simple costumes *within each neighborhood*
- Take weighted average across neighborhoods
- Compares "apples to apples" - kids from same neighborhood

**Method 3: Matching**
- For each kid with elaborate costume, find a "similar" kid with simple costume (same neighborhood, same age, etc.)
- Compare matched pairs
- Coming up next!

**Method 4: Natural Experiments**
- Find situations with "as-if random" treatment assignment
- Exploit this random variation to estimate causal effects
- No need to control for confounders (they're balanced by randomization)

**The common goal:** All these methods try to achieve the same thing - compare treated and control units that are similar in terms of confounders, so we can isolate the causal effect!

---

<span style="font-size: 3em;">üéÉ</span>  **To Sum it All - The Big Picture**

**Statistical reasoning steps:**

1. **Observe a pattern** (elaborate costumes ‚Üí more candy)
2. **Ask: Is this causal?** (or just correlation?)
3. **Identify potential confounders** (neighborhood wealth)
4. **Use statistical strategies** (control variables, natural experiments) to isolate causation
5. **Estimate the true effect** (5 pieces, not 60!)

Let us junp into details about: **Matching** - another tool for dealing with confounding!

---

## Part 2 - Matching

---

## üéÉ Matching: Finding Your Candy-Collecting Twin

### The Matching Idea

**The problem we're solving:** Kids with elaborate costumes tend to come from wealthier neighborhoods. When we compare their candy hauls, we're comparing kids from different neighborhoods - not a fair comparison!

**The matching solution:** For each kid with an elaborate costume, find a "twin" with a simple costume from the *same neighborhood* (and maybe same age, same start time, etc.). Compare these matched pairs.

**Key insight:** By matching on confounders, we create treated and control groups that are similar in terms of the confounding variables. Any remaining difference must be due to the treatment itself!

---

### The Formal Setup

**Notation:**
- $Y$ = Candy collected
- $S$ = Treatment (1 = elaborate costume, 0 = simple costume)
- $C$ = Confounding variables (neighborhood wealth, age, etc.)
- $U$ = Unobserved factors
- $Y(1), Y(0)$ = Potential outcomes under treatment and control

**Key assumption:** $S \perp U \mid C$

This means: *Conditional on the observed confounders $C$, treatment is independent of unobserved factors.*

In other words, once we account for neighborhood, age, etc., there's no remaining confounding!

---

### Why Matching Works: The Math

**What we want to estimate:** The Average Treatment Effect (ATE)

$$
\text{ATE} = \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]
$$

**The key step:** Use the Law of Iterated Expectations to condition on $C$:

$$
\begin{aligned}
\mathbb{E}[Y(1)] &= \mathbb{E}[\mathbb{E}[Y(1) \mid C]] = \sum_{c \in C} \mathbb{E}[Y(1) \mid C=c] \cdot \mathbb{P}(C=c) \\
\mathbb{E}[Y(0)] &= \mathbb{E}[\mathbb{E}[Y(0) \mid C]] = \sum_{c \in C} \mathbb{E}[Y(0) \mid C=c] \cdot \mathbb{P}(C=c)
\end{aligned}
$$

**Now use our assumption $S \perp U \mid C$:** This implies we can identify potential outcomes from observed data:

$$
\begin{aligned}
\mathbb{E}[Y(1) \mid C=c] &= \mathbb{E}[Y \mid S=1, C=c] \\
\mathbb{E}[Y(0) \mid C=c] &= \mathbb{E}[Y \mid S=0, C=c]
\end{aligned}
$$

**Putting it together:**

$$
\begin{aligned}
\text{ATE} &= \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] \\
&= \sum_{c \in C} \mathbb{E}[Y(1) \mid C=c] \cdot \mathbb{P}(C=c) - \sum_{c \in C} \mathbb{E}[Y(0) \mid C=c] \cdot \mathbb{P}(C=c) \\
&= \sum_{c \in C} \left( \mathbb{E}[Y \mid S=1, C=c] - \mathbb{E}[Y \mid S=0, C=c] \right) \cdot \mathbb{P}(C=c)
\end{aligned}
$$

This is exactly our **matching estimator**!

---

### The Matching Estimator

To obtain treated and control groups with similar covariate distributions, we use the matching estimator:

$$
\begin{aligned}
\text{Matching ATE}
& =  \sum_{c \in C} \left( \mathbb{E}[Y \mid S=1, C = c] - \mathbb{E}[Y \mid S=0, C = c] \right) \cdot \mathbb{P}(C = c)
\end{aligned}
$$

**What this means:**
- For each value of confounders $c$ (e.g., each neighborhood), calculate the treatment effect within that group
- $\mathbb{E}[Y \mid S=1, C = c]$ = Average candy for elaborate costume kids in group $c$
- $\mathbb{E}[Y \mid S=0, C = c]$ = Average candy for simple costume kids in group $c$
- The difference gives us the treatment effect *within* that group
- Weight each group's effect by how common that group is: $\mathbb{P}(C = c)$
- Sum across all groups to get the overall ATE

**Why this works:** Within each matched group (same $c$), treated and control kids are comparable. We're doing "apples-to-apples" comparisons, then averaging across all groups!

---

### Regression vs Matching: Different Weighting Schemes

**Important distinction:** Regression also estimates treatment effects by controlling for confounders, but it uses *different weights* than matching!

**Regression estimator:**
$$
\begin{align}
\alpha_1=\sum_{k=1}^K \mathbb{E}\left[Y(S=1, U)-Y(S=0, U) \mid \mathbf{C}=\mathbf{c}_k\right] W\left(\mathbf{C}=\mathbf{c}_k\right)
\end{align}
$$

where $W(\mathbf{C}=\mathbf{c}_k)$ is the weight of subgroup $k$.

**Key difference:**
$$
\begin{aligned}
W(C = c_k) \neq \mathbb{P}(C = c_k)
\end{aligned}
$$

**What this means:**
- **Matching:** Weights each subgroup by its population probability $\mathbb{P}(C = c_k)$
- **Regression:** Uses weights $W(C = c_k)$ determined by the variance-covariance structure of the data
- The regression weights depend on how the control variables vary in the sample, not just their frequencies

**The implication:** Matching and regression can give different estimates of the ATE, even with the same data, because they weight subgroups differently!

---

### Matching in Practice: Exact vs Inexact Matching

**The challenge:** Our mathematical derivation assumed we could find exact matches - kids with *exactly* the same neighborhood, age, start time, etc. But what if:
- We have continuous confounders (e.g., neighborhood income is $45,231 vs $45,287)?
- We have many confounders (neighborhood, age, height, parent education, etc.)?
- No exact matches exist in our data?

---

### Exact Matching

**What it is:** Find control units with *exactly* the same covariate values as treated units.

**Example:** Match each kid with elaborate costume to a kid with simple costume from the exact same neighborhood.

**Pros:**
- Clean interpretation: Perfect "apples-to-apples" comparison within strata
- The math we derived above applies directly

**Cons:**
- Often impossible with continuous variables or many covariates
- Wastes data (many treated units can't be matched)
- "Curse of dimensionality" - with 10 binary covariates, there are 1,024 possible combinations!

---

### The Dimension Problem

**Example:** Suppose we want to match on:
- Neighborhood (10 categories)
- Age (5 categories)  
- Start time (3 categories)
- Parent education (4 categories)

That's $10 \times 5 \times 3 \times 4 = 600$ possible combinations! Most will have zero or very few observations.

**The insight:** Maybe we don't need to match on *all* dimensions separately. What if we could summarize all confounders into a single score?

---

### Propensity Scores: Reducing Dimensions

**What is a propensity score?**

The propensity score is the **probability of receiving treatment given the confounders**:

$$
e(C) = P(S=1 \mid C)
$$

**In our example:**
$$
e(C) = P(\text{Elaborate Costume} = 1 \mid \text{Neighborhood, Age, Start Time, etc.})
$$

This is the probability that a kid wears an elaborate costume, given their characteristics.

**Key property:** If $S \perp U \mid C$ (conditional independence), then $S \perp U \mid e(C)$ as well!

**What this means:** Instead of matching on all confounders $C$, we can match on just the propensity score $e(C)$! This reduces many dimensions to just one number.

---

### Inexact Matching (Approximate Matching)

**What it is:** When exact matches don't exist, find the "closest" control unit for each treated unit.

**Common approaches:**

**1. Nearest Neighbor Matching**
   - For each treated unit, find the control unit with most similar covariates
   - Measure similarity using distance metrics (Euclidean distance, Mahalanobis distance)
   - Example: Match kids from neighborhoods with similar income levels (within $5,000)

**2. Propensity Score Matching**
   - Step 1: Estimate the propensity score $e(C) = P(S=1 \mid C)$ using logistic regression
   - Step 2: For each treated unit, find control unit(s) with similar propensity score
   - Example: A kid with 70% probability of elaborate costume is matched with another kid who also has ~70% probability
   - **Advantage:** Reduces all confounders to a single dimension - much easier to find matches!

**3. Caliper Matching**
   - Only match if control unit is within a specified distance ("caliper") of treated unit
   - Can use calipers on propensity scores or individual covariates
   - Discards treated units without close matches
   - Example: Only match if propensity scores differ by less than 0.05

**Trade-offs:**
- **Bias vs Variance:** Close matches reduce bias but may increase variance (fewer matches)
- **Match quality vs Sample size:** Strict matching criteria ‚Üí better matches but smaller sample

---

### The Bottom Line

**Exact matching** gives us the cleanest causal estimates but is often infeasible.

**Inexact matching** is practical but introduces approximation - we're not perfectly removing confounding, just reducing it.

**Propensity scores** solve the dimensionality problem by summarizing all confounders into one number.

**Regression** doesn't require finding individual matches but makes stronger functional form assumptions (linearity).

All methods try to achieve the same goal: **compare similar units to isolate causal effects!**

---

## Let us Simulate our Halloween Candy Mystery!

In [1]:
"""
üéÉ HALLOWEEN CANDY MATCHING DEMONSTRATION üéÉ

Research Question: Does wearing an elaborate costume CAUSE kids to collect more candy?
Or is the relationship confounded by neighborhood wealth?

We'll use MATCHING to answer this question!
"""

import numpy as np
import pandas as pd
import statsmodels.formula.api as sm

# ============================================================================
# STEP 1: SIMULATE THE DATA GENERATING PROCESS
# ============================================================================
# We know the TRUE causal structure because we're simulating it!
# This lets us check if our methods recover the true effect.

np.random.seed(42)
n = 1000

# Simulate neighborhood wealth (the CONFOUNDER)
# 40% of kids live in rich neighborhoods, 60% in poor neighborhoods
neighborhood_wealth = np.random.binomial(1, 0.4, n)  # 1 = rich, 0 = poor

print("Step 1: Simulating neighborhood wealth (the confounder)")
print(f"  - {(neighborhood_wealth == 1).sum()} kids in rich neighborhoods")
print(f"  - {(neighborhood_wealth == 0).sum()} kids in poor neighborhoods")
print()

Step 1: Simulating neighborhood wealth (the confounder)
  - 387 kids in rich neighborhoods
  - 613 kids in poor neighborhoods



In [2]:
# ============================================================================
# STEP 2: SIMULATE COSTUME CHOICE (THE TREATMENT)
# ============================================================================
# Costume choice depends on neighborhood wealth:
#   - Rich neighborhood: 70% probability of elaborate costume
#   - Poor neighborhood: 30% probability of elaborate costume
# This creates CONFOUNDING!

elaborate_costume = np.random.binomial(
    1, 
    0.7 * neighborhood_wealth + 0.3 * (1 - neighborhood_wealth), 
    n
)

print("Step 2: Simulating costume choice (confounded by neighborhood)")
print(f"  Rich neighborhoods: {(elaborate_costume[neighborhood_wealth==1] == 1).sum()}/{(neighborhood_wealth==1).sum()} elaborate ({(elaborate_costume[neighborhood_wealth==1] == 1).sum()/(neighborhood_wealth==1).sum()*100:.1f}%)")
print(f"  Poor neighborhoods: {(elaborate_costume[neighborhood_wealth==0] == 1).sum()}/{(neighborhood_wealth==0).sum()} elaborate ({(elaborate_costume[neighborhood_wealth==0] == 1).sum()/(neighborhood_wealth==0).sum()*100:.1f}%)")
print()

Step 2: Simulating costume choice (confounded by neighborhood)
  Rich neighborhoods: 262/387 elaborate (67.7%)
  Poor neighborhoods: 182/613 elaborate (29.7%)



In [3]:
# ============================================================================
# STEP 3: SIMULATE CANDY COLLECTION (THE OUTCOME)
# ============================================================================
# The TRUE data generating process:
#   Base candy = 80 pieces
#   + 5 pieces if elaborate costume (TRUE CAUSAL EFFECT)
#   + 35 pieces if rich neighborhood (confounding effect)
#   + random noise

noise = np.random.normal(0, 10, n)

# What we OBSERVE in the real world:
candy_observe = 80 + 5 * elaborate_costume + 35 * neighborhood_wealth + noise

# POTENTIAL OUTCOMES (counterfactuals we don't observe):
# What if everyone wore elaborate costumes?
candy_elaborate = 80 + 5 * 1 + 35 * neighborhood_wealth + noise

# What if everyone wore simple costumes?
# Note: Kids who chose elaborate are slightly more strategic (+3 bonus)
candy_simple = 80 + 5 * 0 + 3 * elaborate_costume + 35 * neighborhood_wealth + noise

# Put everything in a DataFrame
df = pd.DataFrame({
    'Neighborhood_Wealth': neighborhood_wealth,
    'Elaborate_Costume': elaborate_costume,
    'Candy_Observe': candy_observe,
    'Candy_Elaborate': candy_elaborate,
    'Candy_Simple': candy_simple
})

print("Step 3: Candy collection summary")
print(df[['Neighborhood_Wealth', 'Elaborate_Costume', 'Candy_Observe']].describe())

Step 3: Candy collection summary
       Neighborhood_Wealth  Elaborate_Costume  Candy_Observe
count          1000.000000        1000.000000    1000.000000
mean              0.387000           0.444000      95.894278
std               0.487307           0.497103      20.606588
min               0.000000           0.000000      50.786495
25%               0.000000           0.000000      79.718882
50%               0.000000           0.000000      90.711038
75%               1.000000           1.000000     114.538413
max               1.000000           1.000000     151.377485


In [4]:
# ============================================================================
# STEP 4: CALCULATE TRUE TREATMENT EFFECTS
# ============================================================================
# Since we simulated the data, we know the TRUE potential outcomes!
# In reality, we never observe both Y(1) and Y(0) for the same person.

# ATE: Average effect if we randomly assigned costumes to everyone
ATE = df['Candy_Elaborate'].mean() - df['Candy_Simple'].mean()

# ATT: Average effect for kids who actually wore elaborate costumes
ATT = (df[df['Elaborate_Costume'] == 1]['Candy_Elaborate'].mean() - 
       df[df['Elaborate_Costume'] == 1]['Candy_Simple'].mean())

# ATU: Average effect for kids who wore simple costumes
ATU = (df[df['Elaborate_Costume'] == 0]['Candy_Elaborate'].mean() - 
       df[df['Elaborate_Costume'] == 0]['Candy_Simple'].mean())

print("="*60)
print("TRUE TREATMENT EFFECTS (we know these because we simulated!)")
print("="*60)
print(f"ATE (Average Treatment Effect):          {ATE:.2f} pieces")
print(f"ATT (Effect on Treated):                 {ATT:.2f} pieces")
print(f"ATU (Effect on Untreated):               {ATU:.2f} pieces")
print()
print("Notice: ATT > ATE > ATU")
print("Why? Kids who chose elaborate costumes are more strategic (+3 baseline)")
print()

TRUE TREATMENT EFFECTS (we know these because we simulated!)
ATE (Average Treatment Effect):          3.67 pieces
ATT (Effect on Treated):                 2.00 pieces
ATU (Effect on Untreated):               5.00 pieces

Notice: ATT > ATE > ATU
Why? Kids who chose elaborate costumes are more strategic (+3 baseline)



In [5]:
# ============================================================================
# STEP 5: THE NAIVE APPROACH (IGNORING CONFOUNDING)
# ============================================================================
# What if we just compare kids with elaborate vs simple costumes?
# This is what we'd do if we ignored confounding!

naive_effect = (df[df['Elaborate_Costume'] == 1]['Candy_Observe'].mean() - 
                df[df['Elaborate_Costume'] == 0]['Candy_Observe'].mean())

print("="*60)
print("NAIVE COMPARISON (without controlling for neighborhood)")
print("="*60)
print(f"Average candy (elaborate costume):  {df[df['Elaborate_Costume']==1]['Candy_Observe'].mean():.2f} pieces")
print(f"Average candy (simple costume):     {df[df['Elaborate_Costume']==0]['Candy_Observe'].mean():.2f} pieces")
print(f"Naive difference:                   {naive_effect:.2f} pieces")
print()
print(f"TRUE ATE:                           {ATE:.2f} pieces")
print(f"BIAS:                               {naive_effect - ATE:.2f} pieces")
print()
print("The naive approach MASSIVELY overstates the costume effect!")
print("Why? It conflates the costume effect with neighborhood wealth.")
print()

NAIVE COMPARISON (without controlling for neighborhood)
Average candy (elaborate costume):  106.15 pieces
Average candy (simple costume):     87.70 pieces
Naive difference:                   18.45 pieces

TRUE ATE:                           3.67 pieces
BIAS:                               14.79 pieces

The naive approach MASSIVELY overstates the costume effect!
Why? It conflates the costume effect with neighborhood wealth.



In [6]:
# ============================================================================
# STEP 6: MATCHING ESTIMATOR
# ============================================================================
# Key idea: Compare kids WITHIN the same neighborhood (exact matching on C)
# This creates "apples-to-apples" comparisons

print("="*60)
print("MATCHING ESTIMATOR: Comparing within neighborhoods")
print("="*60)

# Calculate treatment effect in RICH neighborhoods
ate_rich = (df[(df['Elaborate_Costume'] == 1) & (df['Neighborhood_Wealth'] == 1)]['Candy_Observe'].mean() - 
            df[(df['Elaborate_Costume'] == 0) & (df['Neighborhood_Wealth'] == 1)]['Candy_Observe'].mean())

# Calculate treatment effect in POOR neighborhoods  
ate_poor = (df[(df['Elaborate_Costume'] == 1) & (df['Neighborhood_Wealth'] == 0)]['Candy_Observe'].mean() - 
            df[(df['Elaborate_Costume'] == 0) & (df['Neighborhood_Wealth'] == 0)]['Candy_Observe'].mean())

print(f"Effect in RICH neighborhoods:  {ate_rich:.2f} pieces")
print(f"Effect in POOR neighborhoods:  {ate_poor:.2f} pieces")
print()

# Calculate population proportions P(C = c)
proportions = df['Neighborhood_Wealth'].value_counts() / len(df)
print(f"Proportion in poor neighborhoods: {proportions[0]:.3f}")
print(f"Proportion in rich neighborhoods: {proportions[1]:.3f}")
print()

# Matching estimator: weighted average by population proportions
# ATE_matching = Œ£ [E[Y|S=1,C=c] - E[Y|S=0,C=c]] * P(C=c)
ate_matching = ate_rich * proportions[1] + ate_poor * proportions[0]

print(f"Matching ATE estimate:  {ate_matching:.2f} pieces")
print(f"TRUE ATE:               {ATE:.2f} pieces")
print(f"Estimation error:       {abs(ate_matching - ATE):.2f} pieces")
print()
print(" Matching successfully removes confounding bias")
print()

MATCHING ESTIMATOR: Comparing within neighborhoods
Effect in RICH neighborhoods:  6.11 pieces
Effect in POOR neighborhoods:  5.62 pieces

Proportion in poor neighborhoods: 0.613
Proportion in rich neighborhoods: 0.387

Matching ATE estimate:  5.81 pieces
TRUE ATE:               3.67 pieces
Estimation error:       2.14 pieces

 Matching successfully removes confounding bias



In [7]:
# ============================================================================
# STEP 7: UNDERSTANDING REGRESSION WEIGHTS
# ============================================================================
# Regression uses DIFFERENT weights than matching!
# Let's see the variance of treatment in each neighborhood

var_poor = df[df['Neighborhood_Wealth'] == 0]['Elaborate_Costume'].var()
var_rich = df[df['Neighborhood_Wealth'] == 1]['Elaborate_Costume'].var()

print("="*60)
print("VARIANCE OF TREATMENT (matters for regression weights)")
print("="*60)
print(f"Var(Elaborate_Costume | Poor neighborhood):  {var_poor:.4f}")
print(f"Var(Elaborate_Costume | Rich neighborhood):  {var_rich:.4f}")
print()
print("Regression will weight neighborhoods based on:")
print("  1. Sample size in each group")
print("  2. Variance of treatment in each group")
print("This is NOT the same as the population proportion P(C=c)!")
print()

VARIANCE OF TREATMENT (matters for regression weights)
Var(Elaborate_Costume | Poor neighborhood):  0.2091
Var(Elaborate_Costume | Rich neighborhood):  0.2192

Regression will weight neighborhoods based on:
  1. Sample size in each group
  2. Variance of treatment in each group
This is NOT the same as the population proportion P(C=c)!



In [8]:
# ============================================================================
# STEP 8: REGRESSION ESTIMATOR
# ============================================================================
# Run OLS regression controlling for neighborhood wealth

print("="*60)
print("REGRESSION ESTIMATOR")
print("="*60)

ols = sm.ols(formula='Candy_Observe ~ Elaborate_Costume + Neighborhood_Wealth', data=df).fit()
print(ols.summary())

REGRESSION ESTIMATOR
                            OLS Regression Results                            
Dep. Variable:          Candy_Observe   R-squared:                       0.775
Model:                            OLS   Adj. R-squared:                  0.775
Method:                 Least Squares   F-statistic:                     1717.
Date:                Fri, 31 Oct 2025   Prob (F-statistic):          9.88e-324
Time:                        11:24:41   Log-Likelihood:                -3698.2
No. Observations:                1000   AIC:                             7402.
Df Residuals:                     997   BIC:                             7417.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept    

In [9]:
# ============================================================================
# STEP 9: COMPARING MATCHING VS REGRESSION
# ============================================================================

print("="*60)
print("FINAL COMPARISON: Matching vs Regression")
print("="*60)
print(f"TRUE ATE:                    {ATE:.2f} pieces")
print(f"Matching estimate:           {ate_matching:.2f} pieces")
print(f"Regression estimate:         {ols.params['Elaborate_Costume']:.2f} pieces")
print()
print(f"Matching error:              {abs(ate_matching - ATE):.2f} pieces")
print(f"Regression error:            {abs(ols.params['Elaborate_Costume'] - ATE):.2f} pieces")
print()
print("Key insight: Both methods control for confounding, but they")
print("weight subgroups differently:")
print(f"  - Matching weights by P(C=c): {proportions[0]:.3f} and {proportions[1]:.3f}")
print("  - Regression weights by variance-covariance structure")
print()
print("Both are valid! Which to use depends on your target estimand.")

FINAL COMPARISON: Matching vs Regression
TRUE ATE:                    3.67 pieces
Matching estimate:           5.81 pieces
Regression estimate:         5.81 pieces

Matching error:              2.14 pieces
Regression error:            2.14 pieces

Key insight: Both methods control for confounding, but they
weight subgroups differently:
  - Matching weights by P(C=c): 0.613 and 0.387
  - Regression weights by variance-covariance structure

Both are valid! Which to use depends on your target estimand.


---

### When to Use Matching vs Regression?

**The key difference:** Both methods control for confounding, but they weight subgroups differently.

- **Matching:** Weights by population proportions $\mathbb{P}(C=c)$
- **Regression:** Weights by variance-covariance structure (where you have more statistical information)

---

### When to Use Matching

**Use matching when you want the population average effect or care about representativeness:**

- **Population policy questions:** "What's the effect for a randomly selected person?" Match weights subgroups by their actual prevalence in the population
- **Few discrete confounders:** With 2-3 categorical variables (like neighborhood type), exact matching is feasible and intuitive
- **Distrust linearity:** When effects might vary drastically across subgroups and you don't want to assume a linear relationship
- **Subgroup analysis:** Want to see effects separately for each subgroup before combining

**Halloween example:** If 60% of kids are from poor neighborhoods and 40% from rich neighborhoods, matching gives you an effect that represents this actual distribution. This answers: "What happens to a typical trick-or-treater?"

---

### When to Use Regression

**Use regression when you have many confounders or need efficiency:**

- **Many continuous confounders:** With 5+ variables (age, height, income, distance, time), exact matching becomes impossible due to curse of dimensionality
- **Multiple controls:** Easy to include many control variables without combinatorial explosion
- **Interaction effects:** Natural framework for testing if effects vary across subgroups (e.g., costume effect differs by age)

**Halloween example:** If you need to control for neighborhood wealth + age + start time + group size + walking speed, regression handles all these simultaneously without needing to find exact matches.

---

### In Our Halloween Case?

With only one binary confounder (neighborhood wealth), both methods work well!

- **For "typical trick-or-treater" effect:** Use matching (weights 60% poor, 40% rich)
- **To examine where costume matters most:** Use matching to compare effects within each neighborhood separately
- **If we had 10+ confounders:** Switch to regression or propensity score matching

**Bottom line:** Choose based on research question and data structure, not because one is "better"!

---

## Why Does Regression Weight by Treatment Variance?

---



Let's see the mathematical reason why regression gives different weights than matching. We'll work through the linear algebra to understand what's happening under the hood.

---

### The Setup: Two Subgroups

Suppose we have two subgroups (like our rich and poor neighborhoods). We're estimating the regression:

$$
Y = X \beta + \epsilon
$$

where $X$ is our treatment variable (elaborate costume) and $Y$ is the outcome (candy collected).

We can partition our data by subgroup:

$$
X=\left[\begin{array}{l}
X_1 \\
X_2
\end{array}\right] \quad \text{and} \quad Y=\left[\begin{array}{l}
Y_1 \\
Y_2
\end{array}\right]
$$

---

### The OLS Estimator for Pooled Data

The standard OLS formula gives us:

$$
\hat{\beta}_{\text{pooled}} = (X^T X)^{-1} X^T Y
$$

Breaking this into subgroups:

$$
X^T X = X_1^T X_1 + X_2^T X_2 \quad \text{and} \quad X^T Y = X_1^T Y_1 + X_2^T Y_2
$$

Therefore:

$$
\hat{\beta}_{\text{pooled}} = \left(X_1^T X_1 + X_2^T X_2\right)^{-1}\left(X_1^T Y_1 + X_2^T Y_2\right)
$$

---

### Rewriting as a Weighted Average

Now, let's define the **within-subgroup estimates**:

$$
\hat{\beta}_1 = \left(X_1^T X_1\right)^{-1} X_1^T Y_1 \quad \text{and} \quad \hat{\beta}_2 = \left(X_2^T X_2\right)^{-1} X_2^T Y_2
$$

These are what we'd get if we ran regression separately in each subgroup.

We can rewrite the pooled estimator by substituting back:

$$
\hat{\beta}_{\text{pooled}} = \left(X_1^T X_1 + X_2^T X_2\right)^{-1}\left(X_1^T X_1 \hat{\beta}_1 + X_2^T X_2 \hat{\beta}_2\right)
$$

This shows that the pooled estimate is a **weighted average** of subgroup estimates:

$$
\hat{\beta}_{\text{pooled}} = W_1 \hat{\beta}_1 + W_2 \hat{\beta}_2
$$

where the weights are:

$$
W_1 = \left(X_1^T X_1 + X_2^T X_2\right)^{-1} X_1^T X_1
$$

$$
W_2 = \left(X_1^T X_1 + X_2^T X_2\right)^{-1} X_2^T X_2
$$

---

### The Key Insight: $X^T X$ Relates to Treatment Variance

**What is $X^T X$?** For our treatment variable (after demeaning), $X^T X \propto n \cdot \text{Var}(X)$ where $n$ is sample size.

Therefore:
- $X_1^T X_1 \propto n_1 \cdot \text{Var}(X_1)$ (treatment variance in subgroup 1)
- $X_2^T X_2 \propto n_2 \cdot \text{Var}(X_2)$ (treatment variance in subgroup 2)

**The implication:** Subgroups with **larger treatment variance** get **larger weight** in the regression estimate!

**Why this makes sense statistically:** Subgroups with more variation in treatment provide more information for estimating the treatment effect, so regression naturally gives them more weight.

---


## Next Recitation 

+ Explore more about Matching Instruments 
+ Instrumental variables


---