# EDA: Realistic Transit Dataset

**Author:** Tomasz Solis  
**Date:** December 2025  
**Goal:** Explore realistic (messy) transit ridership data

## Context

Baseline dataset had large, clean effects (+300/+200/+150 riders). This realistic dataset has:
- **6-10x smaller effects:** +50/+30/+15 riders
- **3x higher noise:** More week-to-week variance
- **Confounding events:** Competitor launch, gas spike, severe winter

This mirrors real product analytics - small signals in noisy data with multiple things changing at once.

**What I'm checking:**
- Data quality
- Pre-intervention trends
- Visible confounders
- Whether ITS assumptions hold

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import statsmodels.api as sm
from datetime import datetime
from pathlib import Path

fig_output_dir = Path("../outputs/figures")
pio.templates.default = "plotly_white"

ROUTE_COLORS = {
    'Downtown': '#1f77b4',
    'Suburban': '#ff7f0e',
    'Cross-town': '#2ca02c'
}

INTERVENTION_DATE = datetime(2024, 1, 1)

print("✓ Setup complete")

✓ Setup complete


---
## 1. Load and Inspect

In [2]:
df = pd.read_csv('../data/hard_mode/transit_ridership_realistic.csv')
df['date'] = pd.to_datetime(df['date'])

print(f"Dataset: {len(df):,} observations")
print(f"Period: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"Routes: {df['route_type'].nunique()}")
print(f"\nMissing values: {df.isnull().sum().sum()}")
print(f"Date gaps: {(df.groupby('route_type')['date'].diff().dt.days.dropna() != 7).sum()}")

Dataset: 783 observations
Period: 2020-01-06 to 2024-12-30
Routes: 3

Missing values: 0
Date gaps: 0


In [3]:
# Summary stats by period

for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_data = df[df['route_type'] == route]
    pre = route_data[route_data['post_intervention'] == 0]['avg_ridership']
    post = route_data[route_data['post_intervention'] == 1]['avg_ridership']
    
    print(f"\n{route}:")
    print(f"  Pre:  mean={pre.mean():6.1f}, std={pre.std():5.1f}")
    print(f"  Post: mean={post.mean():6.1f}, std={post.std():5.1f}")
    print(f"  Naive diff: {post.mean() - pre.mean():+.1f} riders")

print("Note: Naive differences confounded by trends + events")
print("High pre-period variance suggests multiple confounders")

SUMMARY BY ROUTE AND PERIOD

Downtown:
  Pre:  mean= 716.1, std=164.8
  Post: mean=1082.3, std= 80.0
  Naive diff: +366.2 riders

Suburban:
  Pre:  mean= 547.3, std=110.4
  Post: mean= 783.4, std= 59.1
  Naive diff: +236.0 riders

Cross-town:
  Pre:  mean= 393.6, std= 77.0
  Post: mean= 551.1, std= 42.7
  Naive diff: +157.5 riders

Note: Naive differences confounded by trends + events
High pre-period variance suggests multiple confounders


**Initial observations:**
- Complete dataset, no missing values or gaps
- Naive differences (+157 to +366) much larger than true effects (+15 to +50)
- High pre-period standard deviations (77-165 riders)
- Lower post-period variance (42-80 riders) - stabilization?

---
## 2. Time Series Visualization

In [4]:
fig = go.Figure()

for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_data = df[df['route_type'] == route].sort_values('date')
    fig.add_trace(go.Scatter(
        x=route_data['date'],
        y=route_data['avg_ridership'],
        mode='lines',
        name=route,
        line=dict(color=ROUTE_COLORS[route], width=1.5)
    ))

# Mark intervention
fig.add_vline(
    x=INTERVENTION_DATE.timestamp() * 1000,
    line_dash="dash",
    line_color="red",
    annotation_text="Express Lanes Launch",
    annotation_position="top"
)

# Mark confounders
fig.add_vline(
    x=datetime(2023, 7, 1).timestamp() * 1000,
    line_dash="dot",
    line_color="gray",
    annotation_text="Competitor Launch",
    annotation_position="bottom left"
)

fig.add_vrect(
    x0=datetime(2022, 3, 1).timestamp() * 1000,
    x1=datetime(2022, 6, 30).timestamp() * 1000,
    fillcolor="yellow",
    opacity=0.2,
    annotation_text="Gas Spike",
    annotation_position="top left"
)

fig.add_vrect(
    x0=datetime(2023, 1, 1).timestamp() * 1000,
    x1=datetime(2023, 2, 28).timestamp() * 1000,
    fillcolor="lightblue",
    opacity=0.2,
    annotation_text="Severe Winter",
    annotation_position="top left"
)

fig.update_layout(
    title='Weekly Transit Ridership: Realistic Dataset (2020-2024)',
    xaxis_title='Date',
    yaxis_title='Average Daily Ridership',
    height=500,
    hovermode='x unified'
)

fig.show()
fig.write_image(f"{fig_output_dir}/11_realistic_time_series.png", scale=2)
print("✓ Saved figure")

✓ Saved figure


**Visual observations:**
- Strong upward trends in all routes (pre-intervention)
- Confounders visible:
  - Gas spike (Mar-Jun 2022): Temporary bump, especially Suburban
  - Severe winter (Jan-Feb 2023): Dip across all routes
  - Competitor launch (Jul 2023): Possible flattening before intervention
- Intervention effect NOT visually obvious (small relative to trends + noise)
- High week-to-week variance throughout

---
## 3. Pre-Intervention Trends

In [5]:
df_pre = df[df['date'] < INTERVENTION_DATE].copy()

pre_trends = {}

for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_data = df_pre[df_pre['route_type'] == route].copy()
    route_data['weeks'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    
    X = sm.add_constant(route_data['weeks'])
    y = route_data['avg_ridership']
    model = sm.OLS(y, X).fit()
    
    trend = model.params['weeks']
    pre_trends[route] = trend
    
    print(f"\n{route}:")
    print(f"  Trend: {trend:+.2f} riders/week")
    print(f"  R²: {model.rsquared:.3f}")
    print(f"  4-year growth: {trend * 208:+.0f} riders")

# Parallel trends check
max_trend = max(pre_trends.values())
min_trend = min(pre_trends.values())
diff_pct = ((max_trend - min_trend) / min_trend) * 100

print(f"\n{'='*60}")
print("PARALLEL TRENDS CHECK")
print(f"{'='*60}")
print(f"Fastest: {max_trend:.2f} riders/week")
print(f"Slowest: {min_trend:.2f} riders/week")
print(f"Difference: {diff_pct:.0f}%")

if diff_pct > 50:
    print("\n⚠️  Pre-trends NOT parallel (>50% difference)")
    print("Decision: Use segment-specific ITS models")
else:
    print("\n✓ Pre-trends reasonably parallel")


PRE-INTERVENTION TREND ANALYSIS

Downtown:
  Trend: +2.49 riders/week
  R²: 0.828
  4-year growth: +518 riders

Suburban:
  Trend: +1.63 riders/week
  R²: 0.794
  4-year growth: +340 riders

Cross-town:
  Trend: +1.09 riders/week
  R²: 0.730
  4-year growth: +227 riders

PARALLEL TRENDS CHECK
Fastest: 2.49 riders/week
Slowest: 1.09 riders/week
Difference: 128%

⚠️  Pre-trends NOT parallel (>50% difference)
Decision: Use segment-specific ITS models


In [6]:
# Visualize pre-trends
fig = go.Figure()

for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_data = df_pre[df_pre['route_type'] == route].copy()
    
    # Scatter
    fig.add_trace(go.Scatter(
        x=route_data['date'],
        y=route_data['avg_ridership'],
        mode='markers',
        name=route,
        marker=dict(size=4, color=ROUTE_COLORS[route], opacity=0.4)
    ))
    
    # Trend line
    route_data['weeks'] = (route_data['date'] - route_data['date'].min()).dt.days / 7
    X = sm.add_constant(route_data['weeks'])
    y = route_data['avg_ridership']
    model = sm.OLS(y, X).fit()
    route_data['fitted'] = model.fittedvalues
    
    fig.add_trace(go.Scatter(
        x=route_data['date'],
        y=route_data['fitted'],
        mode='lines',
        name=f'{route} trend',
        line=dict(color=ROUTE_COLORS[route], width=2),
        showlegend=False
    ))

# Mark confounders
fig.add_vrect(
    x0=datetime(2022, 3, 1).timestamp() * 1000,
    x1=datetime(2022, 6, 30).timestamp() * 1000,
    fillcolor="yellow",
    opacity=0.2,
    annotation_text="Gas Spike"
)
fig.add_vrect(
    x0=datetime(2023, 1, 1).timestamp() * 1000,
    x1=datetime(2023, 2, 28).timestamp() * 1000,
    fillcolor="lightblue",
    opacity=0.2,
    annotation_text="Winter"
)
fig.add_vline(
    x=datetime(2023, 7, 1).timestamp() * 1000,
    line_dash="dot",
    line_color="gray",
    annotation_text="Competitor"
)

fig.update_layout(
    title='Pre-Intervention Trends (2020-2023)',
    xaxis_title='Date',
    yaxis_title='Average Daily Ridership',
    height=500
)

fig.show()
fig.write_image(f"{fig_output_dir}/12_realistic_pre_trends.png", scale=2)
print("✓ Saved figure")

✓ Saved figure


**Pre-trend findings:**
- Same as baseline: Non-parallel (128% difference)
- Will use segment-specific models
- Confounders create visible deviations from linear trends
- R² values (0.73-0.83) lower than baseline (0.99+) due to noise

---
## 4. Confounder Examination

The competitor launch (Jul 2023) is most problematic - happens 6 months before intervention.

In [7]:
# Focus on 2023 (competitor period)
competitor_start = datetime(2023, 7, 1)
mask = (df['date'] >= datetime(2023, 1, 1)) & (df['date'] < INTERVENTION_DATE)
df_competitor_period = df[mask].copy()

fig = go.Figure()

for route in ['Downtown', 'Suburban', 'Cross-town']:
    route_data = df_competitor_period[df_competitor_period['route_type'] == route]
    fig.add_trace(go.Scatter(
        x=route_data['date'],
        y=route_data['avg_ridership'],
        mode='lines+markers',
        name=route,
        line=dict(color=ROUTE_COLORS[route])
    ))

fig.add_vline(
    x=competitor_start.timestamp() * 1000,
    line_dash="dash",
    line_color="red",
    annotation_text="Competitor Launches"
)

fig.update_layout(
    title='Ridership Around Competitor Launch (2023)',
    xaxis_title='Date',
    yaxis_title='Average Daily Ridership',
    height=400
)

fig.show()
fig.write_image(f"{fig_output_dir}/13_competitor_period.png", scale=2)
print("✓ Saved figure")

✓ Saved figure


**Competitor confounder:**
- Launched Jul 2023, 6 months before express lanes
- Downtown shows visible decline after launch
- Creates downward pressure in immediate pre-intervention period

**Challenge for ITS:**
- Standard ITS projects pre-trend forward as counterfactual
- If competitor depresses ridership Jul-Dec 2023, counterfactual is biased downward
- May make express lanes look MORE effective than they are
- Need to acknowledge this in interpretation

---
## Summary

**Data quality:** ✓ Complete, no gaps or missing values

**Characteristics:**
- Small true effects: +50/+30/+15 riders
- High noise: σ = 36-60 riders (3x baseline)
- Signal-to-noise: 0.4-0.8 (vs 12-15 in baseline)
- Non-parallel pre-trends: 128% difference

**Confounding events identified:**
1. Gas spike (Mar-Jun 2022): Temporary boost
2. Severe winter (Jan-Feb 2023): 2-month dip
3. **Competitor launch (Jul 2023):** ⚠️ Most problematic - creates downward bias in counterfactual

**Expected challenges:**
- Wide confidence intervals (high noise)
- Possible non-significance for smaller effects
- Competitor confounder complicates interpretation
- Need careful uncertainty communication

**Next:** Fit ITS models and see what we can conclude despite messiness.