[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zjelveh/zjelveh.github.io/blob/master/files/cfc/10_causal_inference.ipynb)

# Anatomy of a Comparison: From Description to Causation

**Two ways to ask the same question:**

- **Version A:** "Are there more noise complaints on days with Yankees games?"
- **Version B:** "Do Yankees games *cause* more noise complaints?"

Same data. Same comparison. Different questions.

Version A is **descriptive** - it asks what happened.  
Version B is **causal** - it asks why it happened.

This notebook shows how to move from Version A toward Version B by using multiple comparison strategies.

## 1. Load the Data

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load datasets from GitHub
base_url = 'https://raw.githubusercontent.com/zjelveh/zjelveh.github.io/refs/heads/master/files/cfc/'

# 311 complaints (Bronx and Brooklyn, 2023) - includes noise AND heating
complaints = pd.read_csv(base_url + 'nyc_311_sample_2023.csv')

# Yankees home games with results
yankees = pd.read_csv(base_url + 'yankees_home_games_2023.csv')

print(f"Total 311 complaints: {len(complaints):,}")
print(f"Yankees home games: {len(yankees)}")
print(f"\nComplaint types:")
print(complaints['complaint_type'].value_counts())


In [None]:
complaints

In [None]:
# Prepare dates
# Use format='mixed' to handle different date formats in the combined data
complaints['created_date'] = pd.to_datetime(complaints['created_date'], format='mixed')
complaints


In [None]:

yankees['game_date'] = pd.to_datetime(yankees['game_date'])

# Create date strings for matching
complaints['year'] = complaints['created_date'].dt.year
complaints['month'] = complaints['created_date'].dt.month
complaints['day'] = complaints['created_date'].dt.day
complaints['date_str'] = complaints['year'].astype(str) + '-' + complaints['month'].astype(str) + '-' + complaints['day'].astype(str)

yankees['year'] = yankees['game_date'].dt.year
yankees['month'] = yankees['game_date'].dt.month
yankees['day'] = yankees['game_date'].dt.day
yankees['date_str'] = yankees['year'].astype(str) + '-' + yankees['month'].astype(str) + '-' + yankees['day'].astype(str)

# Create game day indicator
game_dates = yankees['date_str'].unique()
complaints['is_game_day'] = complaints['date_str'].isin(game_dates)

print("Data prepared.")
print(f"Unique game dates: {len(game_dates)}")

In [None]:
# Create separate dataframes for noise and heating
# Noise = complaints that contain "Noise" in the complaint_type
noise = complaints[complaints['complaint_type'].str.contains('Noise')].copy()

# Heating = HEAT/HOT WATER complaints
heating = complaints[complaints['complaint_type'] == 'HEAT/HOT WATER'].copy()

print(f"Noise complaints: {len(noise):,}")
print(f"Heating complaints: {len(heating):,}")

In [None]:
# Count unique days for game days vs non-game days
# We'll use this throughout the notebook
all_dates = noise.groupby(['date_str', 'is_game_day']).size().reset_index()
days_count = all_dates.groupby('is_game_day')['date_str'].size().reset_index()
days_count.columns = ['is_game_day', 'num_days']

game_days = days_count[days_count['is_game_day'] == True]['num_days'].values[0]
non_game_days = days_count[days_count['is_game_day'] == False]['num_days'].values[0]

print(f"Game days in dataset: {game_days}")
print(f"Non-game days in dataset: {non_game_days}")

## 2. The Basic Comparison

**Question:** Are there more noise complaints on game days?

This is a Version A question - purely descriptive.

We'll look at ALL noise complaints across NYC (Bronx and Brooklyn combined).

In [None]:
# Count total noise complaints by game day status (ALL boroughs)
totals = noise.groupby('is_game_day').size().reset_index()
totals.columns = ['is_game_day', 'total_complaints']

# Get values
game_total = totals[totals['is_game_day'] == True]['total_complaints'].values[0]
non_game_total = totals[totals['is_game_day'] == False]['total_complaints'].values[0]

# Calculate averages
avg_game = game_total / game_days
avg_non_game = non_game_total / non_game_days
pct_increase = ((avg_game - avg_non_game) / avg_non_game) * 100

print("THE BASIC FINDING (Bronx + Brooklyn combined):")
print(f"Average noise complaints on game days: {avg_game:.1f}")
print(f"Average noise complaints on non-game days: {avg_non_game:.1f}")
print(f"Percent increase: {pct_increase:.1f}%")

In [None]:
# Visualize
plot_data = pd.DataFrame({
    'Day Type': ['Non-Game Days', 'Game Days'],
    'Average Complaints': [avg_non_game, avg_game]
})

sns.barplot(data=plot_data, x='Day Type', y='Average Complaints')
plt.title('Noise Complaints: Game Days vs Non-Game Days (NYC)')
plt.ylabel('Average Daily Complaints')

### This answers Version A.

Yes, there are about 33% more noise complaints on game days.

### But does this answer Version B?

Do Yankees games *cause* more noise complaints?

**What else could explain this pattern?**
- **Day of week:** Games cluster on weekends. Weekends are already noisier.
- **Season:** Baseball = summer = windows open, outdoor parties.
- **Citywide trend:** Maybe NYC is just getting noisier over time, and game days happen to coincide with noisier periods.
- **Location:** Maybe areas near Yankee Stadium (like the Bronx) are just noisier places in general — more people, more bars, more activity.
- **Reporting behavior:** Maybe people complain more on game days, not that there's actually more noise.

To move toward Version B, we need to address these alternatives.

## 3. Strategy 1: Compare to Yourself

**Logic:** Same place, different times.

Compare the Bronx on game days to the Bronx on non-game days.

**What this rules out:** "Maybe the Bronx is just a noisy place."

We're comparing the Bronx to *itself*, so location factors cancel out.

**What this doesn't rule out:** Things that change over time (day of week, season, citywide trends).

In [None]:
# Filter to Bronx only
bronx_noise = noise[noise['borough'] == 'BRONX']

# Count by game day status
bronx_totals = bronx_noise.groupby('is_game_day').size().reset_index()
bronx_totals.columns = ['is_game_day', 'total_complaints']

# Calculate averages
bronx_game_total = bronx_totals[bronx_totals['is_game_day'] == True]['total_complaints'].values[0]
bronx_non_game_total = bronx_totals[bronx_totals['is_game_day'] == False]['total_complaints'].values[0]

bronx_avg_game = bronx_game_total / game_days
bronx_avg_non_game = bronx_non_game_total / non_game_days
bronx_pct_increase = ((bronx_avg_game - bronx_avg_non_game) / bronx_avg_non_game) * 100

print("BRONX: Compare to Yourself")
print(f"Average on game days: {bronx_avg_game:.1f}")
print(f"Average on non-game days: {bronx_avg_non_game:.1f}")
print(f"Percent increase: {bronx_pct_increase:.1f}%")

### What does this tell us?

The Bronx shows a ~39% increase on game days.

**This rules out:** "The Bronx is just a noisy place in general."

**This doesn't rule out:** Day of week, season, or citywide trends — maybe NYC is just noisier on those particular days for reasons unrelated to the game.

## 4. Strategy 2: Compare to Someone Else

**Logic:** Same time, different places.

On game days, compare the Bronx to Brooklyn. Does the Bronx have a higher *share* of complaints than usual?

**What this rules out:** "Maybe it's just weekends" or "Maybe it's just summer."

If game days are noisier everywhere for time-based reasons, both boroughs should have similar shares. If the Bronx has a higher share on game days, something specific to the Bronx is happening.

**What this doesn't rule out:** Differences between places (population, housing, who complains more).

In [None]:
# Strategy 2: Compare shares on game days vs non-game days

# On game days: what percent of complaints come from Bronx vs Brooklyn?
game_day_noise = noise[noise['is_game_day'] == True]
game_day_bronx = len(game_day_noise[game_day_noise['borough'] == 'BRONX'])
game_day_brooklyn = len(game_day_noise[game_day_noise['borough'] == 'BROOKLYN'])
game_day_total = game_day_bronx + game_day_brooklyn

# On non-game days: what percent of complaints come from Bronx vs Brooklyn?
non_game_day_noise = noise[noise['is_game_day'] == False]
non_game_day_bronx = len(non_game_day_noise[non_game_day_noise['borough'] == 'BRONX'])
non_game_day_brooklyn = len(non_game_day_noise[non_game_day_noise['borough'] == 'BROOKLYN'])
non_game_day_total = non_game_day_bronx + non_game_day_brooklyn

# First, let's look at raw counts (this will be misleading!)
print("RAW COUNTS (misleading!):")
print(f"  Bronx on game days: {game_day_bronx:,} total complaints")
print(f"  Brooklyn on game days: {game_day_brooklyn:,} total complaints")
print(f"\nBrooklyn has MORE complaints... but Brooklyn is bigger!\n")

# Better approach: what SHARE of complaints comes from each borough?
bronx_share_game = (game_day_bronx / game_day_total) * 100
brooklyn_share_game = (game_day_brooklyn / game_day_total) * 100

bronx_share_non_game = (non_game_day_bronx / non_game_day_total) * 100
brooklyn_share_non_game = (non_game_day_brooklyn / non_game_day_total) * 100

print("SHARE OF COMPLAINTS (better!):")
print("="*50)
print(f"\nOn NON-GAME days:")
print(f"  Bronx share: {bronx_share_non_game:.1f}%")
print(f"  Brooklyn share: {brooklyn_share_non_game:.1f}%")

print(f"\nOn GAME days:")
print(f"  Bronx share: {bronx_share_game:.1f}%")
print(f"  Brooklyn share: {brooklyn_share_game:.1f}%")

print(f"\nChange in Bronx share: {bronx_share_game - bronx_share_non_game:+.1f} percentage points")

In [None]:
# Visualize the shares
share_data = pd.DataFrame({
    'Day Type': ['Non-Game Days', 'Non-Game Days', 'Game Days', 'Game Days'],
    'Borough': ['Bronx', 'Brooklyn', 'Bronx', 'Brooklyn'],
    'Share of Complaints (%)': [bronx_share_non_game, brooklyn_share_non_game, 
                                 bronx_share_game, brooklyn_share_game]
})

sns.barplot(data=share_data, x='Borough', y='Share of Complaints (%)', hue='Day Type')
plt.title('Share of Complaints by Borough: Game Days vs Non-Game Days')
plt.ylabel('Share of Total Complaints (%)')

### What does this tell us?

On non-game days, the Bronx accounts for about 46% of noise complaints (Brooklyn has 54%).

On game days, the Bronx share *increases* to about 48% — a 2 percentage point shift.

**This rules out pure time-based explanations.** If game days were just noisier because of weekends or summer, both boroughs would increase proportionally and the shares would stay the same.

The fact that Bronx's share goes UP on game days suggests something specific to the Bronx is happening.

**But we still can't rule out:** Maybe the Bronx just has more game-day activity unrelated to the stadium.

## 5. Strategy 3: Difference-in-Differences

**Logic:** Combine both strategies.

Compare the *change* in the Bronx to the *change* in Brooklyn.

**The insight:** 
- If there's a citywide trend, both boroughs change equally → difference = 0
- If the stadium has an extra effect, Bronx changes MORE → difference > 0

This is called **difference-in-differences** because we're taking the difference of differences.

In [None]:
# For diff-in-diff, we need to calculate Brooklyn averages (not just shares)
brooklyn_noise = noise[noise['borough'] == 'BROOKLYN']

brooklyn_totals = brooklyn_noise.groupby('is_game_day').size().reset_index()
brooklyn_totals.columns = ['is_game_day', 'total_complaints']

brooklyn_game_total = brooklyn_totals[brooklyn_totals['is_game_day'] == True]['total_complaints'].values[0]
brooklyn_non_game_total = brooklyn_totals[brooklyn_totals['is_game_day'] == False]['total_complaints'].values[0]

brooklyn_avg_game = brooklyn_game_total / game_days
brooklyn_avg_non_game = brooklyn_non_game_total / non_game_days
brooklyn_pct_increase = ((brooklyn_avg_game - brooklyn_avg_non_game) / brooklyn_avg_non_game) * 100

# Create the diff-in-diff table
print("DIFFERENCE-IN-DIFFERENCES TABLE")
print("="*60)
print(f"{'Borough':<12} {'Non-Game':<12} {'Game Day':<12} {'Change':<15}")
print("-"*60)
print(f"{'Bronx':<12} {bronx_avg_non_game:<12.1f} {bronx_avg_game:<12.1f} +{bronx_avg_game - bronx_avg_non_game:.1f} (+{bronx_pct_increase:.1f}%)")
print(f"{'Brooklyn':<12} {brooklyn_avg_non_game:<12.1f} {brooklyn_avg_game:<12.1f} +{brooklyn_avg_game - brooklyn_avg_non_game:.1f} (+{brooklyn_pct_increase:.1f}%)")
print("-"*60)

# Calculate diff-in-diff
bronx_change = bronx_avg_game - bronx_avg_non_game
brooklyn_change = brooklyn_avg_game - brooklyn_avg_non_game
diff_in_diff = bronx_change - brooklyn_change
diff_in_diff_pct = bronx_pct_increase - brooklyn_pct_increase

print(f"\nDiff-in-Diff (raw): {diff_in_diff:.1f} extra complaints in Bronx")
print(f"Diff-in-Diff (percent): {diff_in_diff_pct:.1f} extra percentage points in Bronx")

In [None]:
# Visualize the diff-in-diff with a line plot
did_data = pd.DataFrame({
    'Borough': ['Bronx', 'Bronx', 'Brooklyn', 'Brooklyn'],
    'Day Type': ['Non-Game', 'Game Day', 'Non-Game', 'Game Day'],
    'Avg Complaints': [bronx_avg_non_game, bronx_avg_game, brooklyn_avg_non_game, brooklyn_avg_game]
})

# Line plot shows the change more clearly
plt.figure(figsize=(8, 5))
sns.lineplot(data=did_data, x='Day Type', y='Avg Complaints', hue='Borough', marker='o', markersize=10, linewidth=2)
plt.title('Difference-in-Differences: Bronx vs Brooklyn')
plt.ylabel('Average Daily Complaints')
plt.xlabel('')

# Add annotations for the increases
plt.annotate(f'+{bronx_pct_increase:.0f}%', xy=(1, bronx_avg_game), 
             xytext=(1.1, bronx_avg_game), fontsize=11, color='#1f77b4')
plt.annotate(f'+{brooklyn_pct_increase:.0f}%', xy=(1, brooklyn_avg_game), 
             xytext=(1.1, brooklyn_avg_game), fontsize=11, color='#ff7f0e')

### What Diff-in-Diff Rules Out

1. **"Maybe it's just weekends/summer"** - If game days are noisier everywhere, both boroughs increase equally and it cancels out.

2. **"Maybe the Bronx is just noisier"** - We're comparing *changes*, not levels. If Bronx is always noisier, that doesn't matter.

3. **"Maybe NYC is getting noisier over time"** - Brooklyn captures any general trend. We subtract it out.

**The ~11 percentage point difference** is our best estimate of the localized stadium effect - the extra noise specifically from being near Yankee Stadium.

## 6. The Placebo Test

We've addressed time and place explanations. But what if our OUTCOME is the problem?

**The idea:** Test an outcome that SHOULDN'T be affected by Yankees games.

If Yankees games cause NOISE complaints, they shouldn't cause HEATING complaints. There's no reason a baseball game would affect whether someone's boiler breaks.

**The logic:**
- If heating complaints ALSO spike on game days → Something else is going on (maybe people just call 311 more on game days?)
- If heating complaints show NO game-day effect → Strengthens our noise finding

In [None]:
# We already have heating complaints loaded!
print(f"Heating complaints: {len(heating):,}")
print(f"\nBy borough:")
print(heating['borough'].value_counts())

In [None]:
# Run the same analysis on heating complaints - Overall
heating_totals = heating.groupby('is_game_day').size().reset_index()
heating_totals.columns = ['is_game_day', 'total_complaints']

heating_game_total = heating_totals[heating_totals['is_game_day'] == True]['total_complaints'].values[0]
heating_non_game_total = heating_totals[heating_totals['is_game_day'] == False]['total_complaints'].values[0]

heating_avg_game = heating_game_total / game_days
heating_avg_non_game = heating_non_game_total / non_game_days
heating_pct_change = ((heating_avg_game - heating_avg_non_game) / heating_avg_non_game) * 100

print("PLACEBO TEST: Heating Complaints")
print("="*50)
print(f"Average on game days: {heating_avg_game:.1f}")
print(f"Average on non-game days: {heating_avg_non_game:.1f}")
print(f"Percent change: {heating_pct_change:+.1f}%")

In [None]:
# Placebo by borough (diff-in-diff for heating)

# Bronx heating
bronx_heating = heating[heating['borough'] == 'BRONX']
bronx_heating_totals = bronx_heating.groupby('is_game_day').size().reset_index()
bronx_heating_totals.columns = ['is_game_day', 'total']

bronx_heating_game = bronx_heating_totals[bronx_heating_totals['is_game_day'] == True]['total'].values[0] / game_days
bronx_heating_non_game = bronx_heating_totals[bronx_heating_totals['is_game_day'] == False]['total'].values[0] / non_game_days
bronx_heating_pct = ((bronx_heating_game - bronx_heating_non_game) / bronx_heating_non_game) * 100

# Brooklyn heating
brooklyn_heating = heating[heating['borough'] == 'BROOKLYN']
brooklyn_heating_totals = brooklyn_heating.groupby('is_game_day').size().reset_index()
brooklyn_heating_totals.columns = ['is_game_day', 'total']

brooklyn_heating_game = brooklyn_heating_totals[brooklyn_heating_totals['is_game_day'] == True]['total'].values[0] / game_days
brooklyn_heating_non_game = brooklyn_heating_totals[brooklyn_heating_totals['is_game_day'] == False]['total'].values[0] / non_game_days
brooklyn_heating_pct = ((brooklyn_heating_game - brooklyn_heating_non_game) / brooklyn_heating_non_game) * 100

# Diff-in-diff for heating
heating_diff_in_diff = bronx_heating_pct - brooklyn_heating_pct

print("\nPLACEBO BY BOROUGH:")
print(f"Bronx heating change on game days: {bronx_heating_pct:+.1f}%")
print(f"Brooklyn heating change on game days: {brooklyn_heating_pct:+.1f}%")
print(f"\nDiff-in-diff (heating): {heating_diff_in_diff:+.1f} percentage points")

In [None]:
# Compare noise vs heating results
print("COMPARISON: Noise vs Heating (Placebo)")
print("="*60)
print(f"{'Metric':<35} {'Noise':<12} {'Heating':<12}")
print("-"*60)
print(f"{'Overall game day effect':<35} {pct_increase:+.1f}%{'':<6} {heating_pct_change:+.1f}%")
print(f"{'Bronx game day effect':<35} {bronx_pct_increase:+.1f}%{'':<6} {bronx_heating_pct:+.1f}%")
print(f"{'Brooklyn game day effect':<35} {brooklyn_pct_increase:+.1f}%{'':<6} {brooklyn_heating_pct:+.1f}%")
print(f"{'Diff-in-diff (Bronx - Brooklyn)':<35} {diff_in_diff_pct:+.1f}%{'':<6} {heating_diff_in_diff:+.1f}%")

In [None]:
# Visualize the comparison
comparison_data = pd.DataFrame({
    'Complaint Type': ['Noise', 'Noise', 'Heating', 'Heating'],
    'Borough': ['Bronx', 'Brooklyn', 'Bronx', 'Brooklyn'],
    'Game Day Effect (%)': [bronx_pct_increase, brooklyn_pct_increase, bronx_heating_pct, brooklyn_heating_pct]
})

sns.barplot(data=comparison_data, x='Complaint Type', y='Game Day Effect (%)', hue='Borough')
plt.title('Game Day Effect: Noise vs Heating (Placebo)')
plt.ylabel('Percent Change on Game Days')
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)

### Interpreting the Placebo Test

**What we found:**
- **Noise complaints:** Large positive effect on game days (+33%), especially in Bronx (+39%)
- **Heating complaints:** Large NEGATIVE effect on game days (-75%)

**Why is heating negative?**
Baseball season = April to October = warm weather = no one needs heating!

**The key insight:**
- The diff-in-diff for noise is +11% (Bronx has a localized effect)
- The diff-in-diff for heating is -1% (essentially zero - no Bronx-specific effect)

**This is good news for our noise finding.** If heating also spiked in the Bronx on game days, we'd be worried. But it doesn't.

The noise pattern is SPECIFIC to noise, not a general "people complain more on game days" effect.

## 7. Summary: From Version A to Version B

| Step | What We Did | What It Rules Out |
|------|-------------|-------------------|
| Basic comparison | Game vs non-game days (all NYC) | Nothing yet — just shows a pattern |
| Compare to yourself | Bronx over time | "Maybe the Bronx is just noisy" |
| Compare to someone else | Bronx vs Brooklyn | "Maybe it's just weekends/summer" |
| Diff-in-diff | Compare the changes | Both place AND time explanations |
| Placebo test | Heating complaints | "Maybe people just complain more on game days" |

**Each step rules out different alternative explanations.**

Together, they build a much stronger case that Yankees games cause more noise complaints — not proof, but much closer to Version B than a single comparison alone.

## 8. For Your Final Project

Think about your own analysis:

1. **What is your main comparison?**
   - What two groups are you comparing?

2. **Is your question Version A or Version B?**
   - Are you describing a pattern or claiming causation?

3. **What are 2-3 alternative explanations?**
   - What else could explain the pattern you find?

4. **Can you compare to yourself AND someone else?**
   - Is there a "treatment" group and "control" group?
   - Can you compare before/after within the same group?

5. **What would be your placebo outcome?**
   - What outcome SHOULDN'T be affected by your treatment?
   - If you find an effect there too, what would that mean?