# ITS Model: Realistic Dataset

**Author:** Tomasz Solis  
**Date:** December 2025  
**Goal:** Apply ITS to messy data with confounders

## Context

EDA (notebook 04) identified:
- Small effects (+50/+30/+15 vs baseline +300/+200/+150)
- High noise (3x baseline variance)
- Confounders: Competitor (Jul 2023), gas spike (2022), winter (2023)
- Non-parallel pre-trends (segment-specific models needed)

**What to expect:**
- Wide confidence intervals
- Mixed statistical significance
- Estimates may be biased by competitor confounder

This is practice for real product analytics - small signals, noise, ambiguity.

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from datetime import datetime
from pathlib import Path

fig_output_dir = Path("../outputs/figures")
pio.templates.default = "plotly_white"

ROUTE_COLORS = {
    'Downtown': '#1f77b4',
    'Suburban': '#ff7f0e',
    'Cross-town': '#2ca02c'
}

INTERVENTION_DATE = datetime(2024, 1, 1)

print("✓ Setup complete")

✓ Setup complete


---
## 1. Load Data

In [2]:
df = pd.read_csv('../data/hard_mode/transit_ridership_realistic.csv')
df['date'] = pd.to_datetime(df['date'])

print(f"Loaded {len(df):,} observations")
print(f"Period: {df['date'].min().date()} to {df['date'].max().date()}")

Loaded 783 observations
Period: 2020-01-06 to 2024-12-30


---
## 2. Fit ITS Models

Using same specification as baseline:
```
ridership = β₀ + β₁(time) + β₂(post) + β₃(time_since) + ε
```

Segment-specific models (non-parallel pre-trends), Newey-West SEs (autocorrelation).

In [3]:
def fit_its_model(data, route_type, maxlags=4):
    """Fit ITS segmented regression."""
    route_data = data[data['route_type'] == route_type].copy()
    route_data = route_data.sort_values('date').reset_index(drop=True)
    route_data['model_time'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    
    y = route_data['avg_ridership'].values
    X = pd.DataFrame({
        'time': route_data['model_time'].values.astype(float),
        'post_intervention': route_data['post_intervention'].values.astype(float),
        'time_since_intervention': route_data['time_since_intervention'].values.astype(float)
    })
    X = sm.add_constant(X, has_constant='add')
    
    model = OLS(y, X.values)
    results_fit = model.fit(cov_type='HAC', cov_kwds={'maxlags': maxlags})
    
    params = dict(zip(X.columns, results_fit.params))
    se = dict(zip(X.columns, results_fit.bse))
    pvalues = dict(zip(X.columns, results_fit.pvalues))
    
    beta_2 = params['post_intervention']
    se_2 = se['post_intervention']
    
    return {
        'route': route_type,
        'beta_1': params['time'],
        'beta_2': beta_2,
        'se_2': se_2,
        'p_value': pvalues['post_intervention'],
        'ci_lower': beta_2 - 1.96 * se_2,
        'ci_upper': beta_2 + 1.96 * se_2,
        'r_squared': results_fit.rsquared,
        'significant': pvalues['post_intervention'] < 0.05,
        'data': route_data,
        'fitted': results_fit.fittedvalues
    }

print("="*60)
print("FITTING ITS MODELS")
print("="*60)

results = {}
for route in ['Downtown', 'Suburban', 'Cross-town']:
    print(f"Fitting {route}...")
    results[route] = fit_its_model(df, route, maxlags=4)
    
print("\n✓ Models fitted")

FITTING ITS MODELS
Fitting Downtown...
Fitting Suburban...
Fitting Cross-town...

✓ Models fitted


---
## 3. Results

In [4]:
print("="*60)
print("ITS RESULTS: REALISTIC DATA")
print("="*60)

true_effects = {'Downtown': 50, 'Suburban': 30, 'Cross-town': 15}

for route in ['Downtown', 'Suburban', 'Cross-town']:
    res = results[route]
    true = true_effects[route]
    error = res['beta_2'] - true
    error_pct = (error / true) * 100
    
    print(f"\n{route}:")
    print(f"  β₁ (pre-trend):    {res['beta_1']:+.2f} riders/week")
    print(f"  β₂ (level change): {res['beta_2']:+.1f} riders")
    print(f"  SE:                {res['se_2']:.1f}")
    print(f"  95% CI:            [{res['ci_lower']:.1f}, {res['ci_upper']:.1f}]")
    print(f"  p-value:           {res['p_value']:.3f} {'*' if res['significant'] else '(n.s.)'}")
    print(f"  R²:                {res['r_squared']:.3f}")
    print(f"  ")
    print(f"  True effect:       {true:+.0f} riders")
    print(f"  Error:             {error:+.1f} riders ({error_pct:+.0f}%)")
    
    in_ci = res['ci_lower'] <= true <= res['ci_upper']
    print(f"  {'✓' if in_ci else '⚠️'}  True effect {'within' if in_ci else 'OUTSIDE'} 95% CI")

print("\n" + "="*60)

ITS RESULTS: REALISTIC DATA

Downtown:
  β₁ (pre-trend):    +2.49 riders/week
  β₂ (level change): +53.0 riders
  SE:                22.5
  95% CI:            [8.9, 97.1]
  p-value:           0.018 *
  R²:                0.893
  
  True effect:       +50 riders
  Error:             +3.0 riders (+6%)
  ✓  True effect within 95% CI

Suburban:
  β₁ (pre-trend):    +1.63 riders/week
  β₂ (level change): +23.4 riders
  SE:                22.2
  95% CI:            [-20.1, 66.8]
  p-value:           0.292 (n.s.)
  R²:                0.868
  
  True effect:       +30 riders
  Error:             -6.6 riders (-22%)
  ✓  True effect within 95% CI

Cross-town:
  β₁ (pre-trend):    +1.09 riders/week
  β₂ (level change): +7.2 riders
  SE:                11.1
  95% CI:            [-14.6, 28.9]
  p-value:           0.519 (n.s.)
  R²:                0.830
  
  True effect:       +15 riders
  Error:             -7.8 riders (-52%)
  ✓  True effect within 95% CI



---
## 4. Comparison to Baseline

In [5]:
print("="*60)
print("BASELINE VS REALISTIC COMPARISON")
print("="*60)

print("\nBaseline (clean data):")
print("  True effects:     +300, +200, +150 riders")
print("  Errors:           <3% (all routes)")
print("  Significance:     3/3 routes (all p < 0.001)")
print("  CI width:         ±10-15 riders")
print("  R²:               0.98-0.99")

print("\nRealistic (messy data):")
print("  True effects:     +50, +30, +15 riders (6-10x smaller)")

# Calculate from actual results
errors = [abs((results[r]['beta_2'] - true_effects[r]) / true_effects[r] * 100) for r in ['Downtown', 'Suburban', 'Cross-town']]
n_sig = sum([results[r]['significant'] for r in ['Downtown', 'Suburban', 'Cross-town']])
ci_widths = [(results[r]['ci_upper'] - results[r]['ci_lower']) / 2 for r in ['Downtown', 'Suburban', 'Cross-town']]
r2_vals = [results[r]['r_squared'] for r in ['Downtown', 'Suburban', 'Cross-town']]

print(f"  Errors:           {min(errors):.0f}-{max(errors):.0f}%")
print(f"  Significance:     {n_sig}/3 routes")
print(f"  CI width:         ±{min(ci_widths):.0f}-{max(ci_widths):.0f} riders (2-4x wider)")
print(f"  R²:               {min(r2_vals):.3f}-{max(r2_vals):.3f}")

print("\nKey differences:")
print("  - Much wider confidence intervals (reflects high noise)")
print("  - Mixed statistical significance (not all routes detectable)")
print("  - Larger % errors (small effects harder to estimate precisely)")
print("  - All true effects still within CIs (methodology working)")

print("="*60)

BASELINE VS REALISTIC COMPARISON

Baseline (clean data):
  True effects:     +300, +200, +150 riders
  Errors:           <3% (all routes)
  Significance:     3/3 routes (all p < 0.001)
  CI width:         ±10-15 riders
  R²:               0.98-0.99

Realistic (messy data):
  True effects:     +50, +30, +15 riders (6-10x smaller)
  Errors:           6-52%
  Significance:     1/3 routes
  CI width:         ±22-44 riders (2-4x wider)
  R²:               0.830-0.893

Key differences:
  - Much wider confidence intervals (reflects high noise)
  - Mixed statistical significance (not all routes detectable)
  - Larger % errors (small effects harder to estimate precisely)
  - All true effects still within CIs (methodology working)


---
## 5. Visualize Results with Counterfactuals

In [6]:
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=['Downtown Route', 'Suburban Route', 'Cross-town Route'],
    vertical_spacing=0.12
)

for i, route in enumerate(['Downtown', 'Suburban', 'Cross-town'], 1):
    res = results[route]
    route_data = res['data']
    
    # Actual
    fig.add_trace(
        go.Scatter(
            x=route_data['date'],
            y=route_data['avg_ridership'],
            mode='markers',
            name='Actual',
            marker=dict(size=3, color=ROUTE_COLORS[route], opacity=0.4),
            showlegend=(i==1)
        ),
        row=i, col=1
    )
    
    # Fitted
    fig.add_trace(
        go.Scatter(
            x=route_data['date'],
            y=res['fitted'],
            mode='lines',
            name='Fitted',
            line=dict(color=ROUTE_COLORS[route], width=2),
            showlegend=(i==1)
        ),
        row=i, col=1
    )
    
    # Counterfactual
    counterfactual = res['fitted'].copy()
    post_mask = route_data['post_intervention'] == 1
    counterfactual[post_mask] = counterfactual[post_mask] - res['beta_2']
    
    fig.add_trace(
        go.Scatter(
            x=route_data['date'],
            y=counterfactual,
            mode='lines',
            name='Counterfactual',
            line=dict(color='gray', width=2, dash='dash'),
            showlegend=(i==1)
        ),
        row=i, col=1
    )
    
    # Intervention line
    fig.add_vline(
        x=INTERVENTION_DATE.timestamp() * 1000,
        line_dash="dot",
        line_color="red",
        row=i, col=1
    )
    
    # Annotation
    sig_text = '*' if res['significant'] else 'n.s.'
    fig.add_annotation(
        x=datetime(2024, 6, 1),
        y=route_data['avg_ridership'].max(),
        text=f"β₂ = {res['beta_2']:+.1f}<br>95% CI: [{res['ci_lower']:.0f}, {res['ci_upper']:.0f}]<br>{sig_text}",
        showarrow=False,
        bgcolor="white",
        bordercolor=ROUTE_COLORS[route],
        row=i, col=1
    )

fig.update_layout(
    height=1000,
    title_text='ITS Results: Realistic Dataset (with Counterfactuals)',
    showlegend=True
)

fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="Average Daily Ridership")

fig.show()
fig.write_image(f"{fig_output_dir}/14_realistic_its_results.png", scale=2)
print("✓ Saved figure")

✓ Saved figure


**Visual observations:**
- High scatter around fitted lines (noise makes individual points hard to interpret)
- Small gaps between fitted and counterfactual (treatment effects)
- Downtown shows clearest gap (matches significant result)
- Suburban/Cross-town gaps ambiguous (matches non-significant results)

---
## 6. Interpretation and Limitations

### What the results show:

**Downtown:** Evidence of positive effect
- Estimate: +53 riders (p=0.018)
- 95% CI: [9, 97] riders
- Captures true effect (+50) well
- Wide CI reflects uncertainty but excludes zero

**Suburban:** Directional but not conclusive
- Estimate: +23 riders (p=0.292, not significant)
- 95% CI: [-20, 67] riders
- Underestimates true effect (+30)
- CI includes zero - can't rule out no effect

**Cross-town:** Highly uncertain
- Estimate: +7 riders (p=0.519, not significant)
- 95% CI: [-15, 29] riders
- Substantially underestimates true effect (+15)
- CI includes zero - no statistical evidence of effect

### Key limitations:

**1. Competitor confounder (most serious)**
- Launched Jul 2023, 6 months before express lanes
- Depressed ridership in immediate pre-intervention period
- ITS projects this depressed trend as counterfactual
- **Implication:** Estimates may be upward biased (overestimate express lanes effect)

**2. High noise relative to effects**
- Signal-to-noise ratios: 0.4-0.8
- Small true effects (15-50 riders) vs large noise (σ=36-60)
- Wide CIs reflect genuine uncertainty
- Lower power to detect smaller effects

**3. Statistical vs practical significance**
- Cross-town not statistically significant BUT
- True effect (+15) IS within confidence interval
- "No significant effect" ≠ "proven no effect"
- Lack of precision, not proof of absence

### What we can say:

**Conservative interpretation:**
- Results suggest positive effects on ridership, but estimates are imprecise
- Downtown shows clearest evidence (significant, captures true effect)
- Suburban/Cross-town show directional evidence but high uncertainty
- Competitor confounder is a serious limitation - true effects may be smaller than estimates
- All true effects fall within confidence intervals (methodology working correctly)

**For business decisions:**
- If threshold is "any positive effect": Downtown has strong evidence, others weak
- If need precise ROI: Current uncertainty too high for confident cost-benefit
- Recommendation: Continue monitoring 6-12 months to reduce uncertainty

### What this teaches:

Real product analytics often involves:
- Small effects relative to noise
- Confounders you can't fully control for
- Mixed statistical significance
- Need to communicate "what's knowable vs unknowable"

Good practice:
- Acknowledge limitations upfront
- Quantify uncertainty (confidence intervals)
- Qualify conclusions appropriately
- Distinguish "no evidence of effect" from "evidence of no effect"
- Help stakeholders understand uncertainty

---
## Summary

**Methodology:**
- Applied same ITS approach as baseline
- Segment-specific models (non-parallel pre-trends)
- Newey-West standard errors (autocorrelation)
- Estimated immediate level changes (β₂)

**Results:**
- Downtown: +53 riders (sig.), Suburban: +23 riders (n.s.), Cross-town: +7 riders (n.s.)
- 1 of 3 routes statistically significant (vs 3/3 in baseline)
- All true effects within 95% CIs (methodology working)
- Errors: 6-52% (vs <3% in baseline)

**Key differences from baseline:**
- 6-10x smaller effects
- 3x higher noise
- Mixed significance
- 2-4x wider confidence intervals

**Main limitation:**
Competitor launch (Jul 2023) creates downward bias in counterfactual, potentially inflating estimates.

**Initial assessment:**
Downtown shows positive effect and captures true value well. Suburban and Cross-town not statistically significant - could be real effects too small to detect reliably, or could be noise.

**Critical question:** Are these results robust or fragile? 

With high noise, confounders, and small effects, the estimates might be sensitive to:
- Specification choices (lag structure, boundary periods)
- Time window selection (which periods included)
- Placebo timing (are we finding real effects or noise?)

**Next:** Robustness checks (notebook 06) will test whether these results hold up under stress tests or fall apart. This will determine if we can trust these estimates for decisions.