# Robustness Checks: Realistic Dataset

**Author:** Tomasz Solis  
**Date:** December 2025  
**Goal:** Test if realistic ITS results hold under stress tests

## Context

Notebook 04 found mixed results:
- Downtown: +53 riders (significant)
- Suburban: +23 riders (not significant)
- Cross-town: +7 riders (not significant)

**Baseline robustness (notebook 03):**
- Placebo: PASSED (16% of real effects)
- Window: EXCELLENT (<5% variation)
- Specs: EXCELLENT (<6% variation)

**Expected for realistic:**
- MORE sensitivity (high noise + small effects)
- Some placebo false positives possible
- More variation across specifications
- Still directionally robust

**What I'm testing:**
Same 4 checks as baseline, but expecting messier results. This teaches how to interpret robustness on real-world data.

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from datetime import datetime
from pathlib import Path

fig_output_dir = Path("../outputs/figures")
pio.templates.default = "plotly_white"

ROUTE_COLORS = {
    'Downtown': '#1f77b4',
    'Suburban': '#ff7f0e',
    'Cross-town': '#2ca02c'
}

REAL_INTERVENTION = datetime(2024, 1, 1)
baseline_results = {'Downtown': 53.0, 'Suburban': 23.4, 'Cross-town': 7.2}

print("✓ Setup complete")

✓ Setup complete


In [2]:
df = pd.read_csv('../data/hard_mode/transit_ridership_realistic.csv')
df['date'] = pd.to_datetime(df['date'])
df_pre = df[df['date'] < REAL_INTERVENTION].copy()

print(f"Loaded {len(df):,} observations")
print(f"Pre-intervention: {len(df_pre):,} observations")

Loaded 783 observations
Pre-intervention: 624 observations


---
## 1. Placebo Tests

Test fake intervention dates before Jan 2024. With high noise, we might see some false positives.

In [3]:
def run_placebo_test(data, route_type, fake_intervention_date, maxlags=4):
    """Run ITS with fake intervention date."""
    route_data = data[data['route_type'] == route_type].copy()
    route_data = route_data.sort_values('date').reset_index(drop=True)
    
    route_data['fake_post'] = (route_data['date'] >= fake_intervention_date).astype(float)
    route_data['fake_time_since'] = route_data.apply(
        lambda row: max(0, (row['date'] - fake_intervention_date).days / 7) if row['fake_post'] else 0,
        axis=1
    )
    route_data['fake_time'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    
    y = route_data['avg_ridership'].values
    X = pd.DataFrame({
        'time': route_data['fake_time'].values.astype(float),
        'post_intervention': route_data['fake_post'].values.astype(float),
        'time_since_intervention': route_data['fake_time_since'].values.astype(float)
    })
    X = sm.add_constant(X, has_constant='add')
    
    model = OLS(y, X.values)
    results = model.fit(cov_type='HAC', cov_kwds={'maxlags': maxlags})
    
    params = dict(zip(X.columns, results.params))
    se = dict(zip(X.columns, results.bse))
    pvalues = dict(zip(X.columns, results.pvalues))
    
    beta_2 = params['post_intervention']
    se_2 = se['post_intervention']
    
    return {
        'route': route_type,
        'fake_date': fake_intervention_date,
        'beta_2': beta_2,
        'se': se_2,
        'p_value': pvalues['post_intervention'],
        'significant': pvalues['post_intervention'] < 0.05
    }

# Run placebo tests
fake_dates = [
    datetime(2022, 1, 1),
    datetime(2022, 7, 1),
    datetime(2023, 1, 1),
    datetime(2023, 7, 1),
]

print("Running placebo tests...")
placebo_results = []
for fake_date in fake_dates:
    for route in ['Downtown', 'Suburban', 'Cross-town']:
        result = run_placebo_test(df_pre, route, fake_date, maxlags=4)
        placebo_results.append(result)

placebo_df = pd.DataFrame(placebo_results)

# Summary
n_significant = placebo_df['significant'].sum()
mean_placebo = placebo_df['beta_2'].mean()
max_placebo = placebo_df['beta_2'].abs().max()
min_real_effect = min(baseline_results.values())
ratio = max_placebo / min_real_effect

print(f"\n{'='*60}")
print("PLACEBO TEST SUMMARY")
print(f"{'='*60}")
print(f"Tests run: {len(placebo_df)}")
print(f"Significant (p<0.05): {n_significant} ({n_significant/len(placebo_df)*100:.0f}%)")
print(f"Estimate range: [{placebo_df['beta_2'].min():+.1f}, {placebo_df['beta_2'].max():+.1f}]")
print(f"Mean: {mean_placebo:+.1f} riders")
print(f"\nLargest placebo: {max_placebo:.1f} riders")
print(f"Smallest real effect: {min_real_effect:.1f} riders")
print(f"Ratio: {ratio:.1%}")

# More lenient criteria for realistic data
expected_false_pos = len(placebo_df) * 0.05
print(f"\nExpected false positives: {expected_false_pos:.1f}")
print(f"Observed significant: {n_significant}")

if ratio < 0.5 and n_significant <= expected_false_pos + 2:
    print(f"\n✓ ACCEPTABLE: Placebo estimates reasonable given noise")
    print(f"  Ratio {ratio:.0%} shows placebos smaller than real effects")
elif ratio < 1.0:
    print(f"\n⚠️  BORDERLINE: Some placebo sensitivity")
    print(f"  Ratio {ratio:.0%} - placebos approaching real effect magnitudes")
else:
    print(f"\n❌ CONCERNING: High placebo sensitivity")

print(f"{'='*60}")

Running placebo tests...

PLACEBO TEST SUMMARY
Tests run: 12
Significant (p<0.05): 3 (25%)
Estimate range: [-68.8, +14.4]
Mean: -18.2 riders

Largest placebo: 68.8 riders
Smallest real effect: 7.2 riders
Ratio: 955.0%

Expected false positives: 0.6
Observed significant: 3

❌ CONCERNING: High placebo sensitivity


**Takeaway:** High noise means some placebo false positives are expected. Key is that magnitude stays smaller than real effects.

---
## 2. Window Sensitivity

Test 1-4 years of pre-data. Confounders (especially competitor) may cause more variation than baseline.

In [4]:
def fit_its_window(data, route_type, start_date, maxlags=4):
    """Fit ITS using specific time window."""
    route_data = data[
        (data['route_type'] == route_type) & 
        (data['date'] >= start_date)
    ].copy()
    
    route_data = route_data.sort_values('date').reset_index(drop=True)
    route_data['window_time'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    
    y = route_data['avg_ridership'].values
    X = pd.DataFrame({
        'time': route_data['window_time'].values.astype(float),
        'post_intervention': route_data['post_intervention'].values.astype(float),
        'time_since_intervention': route_data['time_since_intervention'].values.astype(float)
    })
    X = sm.add_constant(X, has_constant='add')
    
    model = OLS(y, X.values)
    results = model.fit(cov_type='HAC', cov_kwds={'maxlags': maxlags})
    
    params = dict(zip(X.columns, results.params))
    beta_2 = params['post_intervention']
    
    return {'route': route_type, 'level_change': beta_2}

windows = [
    datetime(2020, 1, 1),  # 4 years
    datetime(2021, 1, 1),  # 3 years
    datetime(2022, 1, 1),  # 2 years
    datetime(2023, 1, 1),  # 1 year
]

print("Testing window sensitivity...")
window_results = []
for start_date in windows:
    for route in ['Downtown', 'Suburban', 'Cross-town']:
        result = fit_its_window(df, route, start_date)
        window_results.append(result)

window_df = pd.DataFrame(window_results)

# Summary
max_variation = 0
print(f"\n{'='*60}")
print("WINDOW SENSITIVITY")
print(f"{'='*60}")

for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_windows = window_df[window_df['route'] == route]
    original = baseline_results[route]
    estimates = route_windows['level_change'].values
    variation = ((estimates.max() - estimates.min()) / abs(original)) * 100
    max_variation = max(max_variation, variation)
    
    print(f"\n{route}:")
    print(f"  Baseline (4yr): {original:+.1f}")
    print(f"  Range: [{estimates.min():+.1f}, {estimates.max():+.1f}]")
    print(f"  Variation: {variation:.1f}%")

print(f"\nMax variation: {max_variation:.1f}%")

if max_variation < 25:
    print("✓ ACCEPTABLE: Reasonably stable across windows")
elif max_variation < 50:
    print("⚠️  MODERATE: Some window sensitivity (expected with confounders)")
else:
    print("❌ HIGH: Very sensitive to window choice")

print(f"{'='*60}")

Testing window sensitivity...

WINDOW SENSITIVITY

Downtown:
  Baseline (4yr): +53.0
  Range: [+53.0, +67.8]
  Variation: 27.9%

Suburban:
  Baseline (4yr): +23.4
  Range: [+23.4, +36.2]
  Variation: 54.7%

Cross-town:
  Baseline (4yr): +7.2
  Range: [+7.2, +13.2]
  Variation: 83.4%

Max variation: 83.4%
❌ HIGH: Very sensitive to window choice


**Takeaway:** More variation than baseline expected. Confounders affect different time windows differently.

---
## 3. Leave-One-Out

Same as baseline - should still show 0% deviation (segment-specific models).

In [5]:
def fit_its_leave_one_out(data, excluded_route, maxlags=4):
    """Fit ITS excluding one route."""
    results = {}
    
    for route in ['Downtown', 'Suburban', 'Cross-town']:
        if route == excluded_route:
            continue
            
        route_data = data[data['route_type'] == route].copy()
        route_data = route_data.sort_values('date').reset_index(drop=True)
        route_data['model_time'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
        
        y = route_data['avg_ridership'].values
        X = pd.DataFrame({
            'time': route_data['model_time'].values.astype(float),
            'post_intervention': route_data['post_intervention'].values.astype(float),
            'time_since_intervention': route_data['time_since_intervention'].values.astype(float)
        })
        X = sm.add_constant(X, has_constant='add')
        
        model = OLS(y, X.values)
        fitted = model.fit(cov_type='HAC', cov_kwds={'maxlags': maxlags})
        params = dict(zip(X.columns, fitted.params))
        
        results[route] = {
            'excluded': excluded_route,
            'route': route,
            'level_change': params['post_intervention']
        }
    
    return results

print("Running leave-one-out...")
loo_results = []
for excluded in ['Downtown', 'Suburban', 'Cross-town']:
    results = fit_its_leave_one_out(df, excluded)
    for res in results.values():
        loo_results.append(res)

loo_df = pd.DataFrame(loo_results)

# Summary
print(f"\n{'='*60}")
print("LEAVE-ONE-OUT")
print(f"{'='*60}")

max_deviation = 0
for route in ['Downtown', 'Suburban', 'Cross-town']:
    original = baseline_results[route]
    route_loo = loo_df[loo_df['route'] == route]
    
    deviations = [abs((row['level_change'] - original) / original) * 100 
                 for _, row in route_loo.iterrows()]
    max_dev = max(deviations) if deviations else 0
    max_deviation = max(max_deviation, max_dev)
    
    print(f"\n{route}: {max_dev:.1f}% max deviation")

print(f"\nOverall: {max_deviation:.1f}% max deviation")
print("\nNote: 0% expected with segment-specific models (same as baseline)")
print("Confirms routes are independent segments.")
print(f"{'='*60}")

Running leave-one-out...

LEAVE-ONE-OUT

Downtown: 0.0% max deviation

Suburban: 0.1% max deviation

Cross-town: 0.5% max deviation

Overall: 0.5% max deviation

Note: 0% expected with segment-specific models (same as baseline)
Confirms routes are independent segments.


**Takeaway:** Same as baseline - 0% deviation confirms segment-specific approach.

---
## 4. Alternative Specifications

Test lag variations and boundary exclusions. Expect more sensitivity than baseline.

In [6]:
def test_newey_west_lags(data, route_type, lags):
    """Test different Newey-West lag specifications."""
    route_data = data[data['route_type'] == route_type].copy()
    route_data = route_data.sort_values('date').reset_index(drop=True)
    route_data['model_time'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    
    y = route_data['avg_ridership'].values
    X = pd.DataFrame({
        'time': route_data['model_time'].values.astype(float),
        'post_intervention': route_data['post_intervention'].values.astype(float),
        'time_since_intervention': route_data['time_since_intervention'].values.astype(float)
    })
    X = sm.add_constant(X, has_constant='add')
    
    model = OLS(y, X.values)
    results = model.fit(cov_type='HAC', cov_kwds={'maxlags': lags})
    params = dict(zip(X.columns, results.params))
    
    return {'route': route_type, 'spec': f'NW lag {lags}', 'beta_2': params['post_intervention']}

def test_boundary_exclusion(data, route_type, exclude_first=0, exclude_last=0):
    """Test excluding boundary periods."""
    route_data = data[data['route_type'] == route_type].copy()
    route_data = route_data.sort_values('date').reset_index(drop=True)
    
    route_data['weeks_post'] = route_data.apply(
        lambda row: (row['date'] - REAL_INTERVENTION).days / 7 if row['post_intervention'] == 1 else -999,
        axis=1
    )
    
    if exclude_first > 0 or exclude_last > 0:
        mask = (
            (route_data['post_intervention'] == 0) |
            ((route_data['weeks_post'] > exclude_first) & (route_data['weeks_post'] <= (52 - exclude_last)))
        )
        route_data = route_data[mask].copy()
    
    route_data['model_time'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    y = route_data['avg_ridership'].values
    X = pd.DataFrame({
        'time': route_data['model_time'].values.astype(float),
        'post_intervention': route_data['post_intervention'].values.astype(float),
        'time_since_intervention': route_data['time_since_intervention'].values.astype(float)
    })
    X = sm.add_constant(X, has_constant='add')
    
    model = OLS(y, X.values)
    results = model.fit(cov_type='HAC', cov_kwds={'maxlags': 4})
    params = dict(zip(X.columns, results.params))
    
    return {
        'route': route_type,
        'spec': f'Excl F{exclude_first}L{exclude_last}',
        'beta_2': params['post_intervention']
    }

print("Testing alternative specifications...")

alt_results = []

# Lag tests
for lags in [2, 4, 6, 8]:
    for route in ['Downtown', 'Suburban', 'Cross-town']:
        alt_results.append(test_newey_west_lags(df, route, lags))

# Boundary tests
for exclude_first, exclude_last in [(4, 0), (0, 4), (4, 4)]:
    for route in ['Downtown', 'Suburban', 'Cross-town']:
        alt_results.append(test_boundary_exclusion(df, route, exclude_first, exclude_last))

alt_df = pd.DataFrame(alt_results)

# Summary
print(f"\n{'='*60}")
print("ALTERNATIVE SPECIFICATIONS")
print(f"{'='*60}")

worst_dev = 0
for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_alts = alt_df[alt_df['route'] == route]
    baseline = baseline_results[route]
    estimates = route_alts['beta_2'].values
    
    deviations = [abs((est - baseline) / baseline) * 100 for est in estimates]
    max_dev = max(deviations)
    worst_dev = max(worst_dev, max_dev)
    
    print(f"\n{route}:")
    print(f"  Baseline: {baseline:+.1f}")
    print(f"  Range: [{estimates.min():+.1f}, {estimates.max():+.1f}]")
    print(f"  Max deviation: {max_dev:.1f}%")

print(f"\nWorst deviation: {worst_dev:.1f}%")

if worst_dev < 25:
    print("✓ GOOD: Stable across specs")
elif worst_dev < 50:
    print("⚠️  MODERATE: Some spec sensitivity (expected with noise)")
else:
    print("❌ HIGH: Very sensitive to specification")

print(f"{'='*60}")

Testing alternative specifications...

ALTERNATIVE SPECIFICATIONS

Downtown:
  Baseline: +53.0
  Range: [+20.3, +53.0]
  Max deviation: 61.7%

Suburban:
  Baseline: +23.4
  Range: [+23.4, +46.3]
  Max deviation: 97.7%

Cross-town:
  Baseline: +7.2
  Range: [+2.8, +13.6]
  Max deviation: 89.5%

Worst deviation: 97.7%
❌ HIGH: Very sensitive to specification


**Takeaway:** More variation than baseline expected. Small effects + high noise = more sensitivity to modeling choices.

---
## Summary

**Robustness checks completed:**
1. Placebo tests
2. Window sensitivity
3. Leave-one-out
4. Alternative specifications

**Comparison to baseline:**

| Test | Baseline (Clean) | Realistic (Messy) |
|------|------------------|-------------------|
| Placebo | 16% ratio, 0/12 sig | 955% ratio, 3/12 sig ⚠️ |
| Window | <5% variation | 28-83% variation ⚠️ |
| Leave-one-out | 0% deviation | 0% deviation ✓ |
| Specifications | <6% deviation | 62-98% deviation ⚠️ |

**Pattern observed:**
- Much MORE sensitivity than baseline (as expected)
- But sensitivity is SEVERE, not moderate
- Leave-one-out still 0% (segment approach correct)
- Results do NOT survive variations well

**What this teaches:**

**Baseline robustness:** "Results are extremely stable - strong evidence"

**Realistic robustness:** "Results are fragile - weak evidence"

The key difference:
- **Baseline:** Robust to everything (placebo, window, specs)
- **Realistic:** Fragile to everything (except segmentation)

**Why results are fragile:**

1. **Placebo tests show red flags**
   - Finding effects larger than real effects in fake dates
   - 25% false positive rate (vs expected 5%)
   - Model picking up noise, not just signal

2. **Window sensitivity is high**
   - Cross-town: 83% variation (completely unreliable)
   - Suburban: 55% variation (very sensitive)
   - Downtown: 28% variation (least bad, but still high)

3. **Specification sensitivity is very high**
   - Suburban: 98% variation (modeling choices dominate)
   - Cross-town: 90% variation
   - Downtown: 62% variation
   - Which model is "right"? Can't tell.

**Combined evidence (notebooks 04-06):**

**Downtown:**
- Point estimate: +53 riders
- Significant in main spec
- BUT: 62% spec sensitivity, 28% window sensitivity
- Assessment: Directional positive signal, but imprecise

**Suburban:**
- Point estimate: +23 riders
- NOT significant
- 98% spec sensitivity (worst case scenario)
- Assessment: Cannot reliably distinguish from noise

**Cross-town:**
- Point estimate: +7 riders
- NOT significant
- 83% window, 90% spec sensitivity
- Assessment: Effect too small to detect given noise

**Competitor confounder:**
Remains fundamental limitation - biases counterfactual downward, potentially inflating all estimates.

**Final assessment:**

These results show **fragile evidence**, not just uncertainty:

**What happened:**
- High noise (3x baseline) overwhelmed small effects
- Competitor confounder created identification problem
- Placebo tests revealing we're partly measuring noise
- Sensitivity tests showing modeling choices dominate results

**What we can say:**
- Downtown: Directional evidence of positive effect, but can't quantify precisely
- Suburban/Cross-town: Effects not reliably detectable above noise

**What we CANNOT say:**
- Cannot claim "results useful for decisions" - too fragile
- Cannot quantify effects with confidence
- Cannot rule out that estimates are mostly noise + competitor bias

**In practice, I would report:**

"Analysis inconclusive. ITS methodology applied correctly, but data quality insufficient for reliable conclusions. Competitor confounder (Jul 2023) and high noise prevent clean measurement. Downtown shows directional positive signal, but precision too low for ROI calculations. Suburban/Cross-town effects not detectable. 

Recommendations: 
(1) Wait 6-12 months for more post-data to improve precision
(2) Consider complementary analysis (surveys, A/B tests on future features)
(3) Acknowledge we cannot measure this intervention's impact reliably with current data"

**Key learning:**

This is what happens when ITS assumptions are badly violated:
- You get point estimates, but can't trust them
- Robustness checks reveal fragility
- Honest answer: "We can't measure this cleanly"

Being transparent about weak evidence is more valuable than overselling it. Sometimes the data just can't answer the question reliably, and acknowledging that is good analytics practice.