# Traffic jams and predicting

## Goal
Detect current traffic jams from the NDW speed feed and build a predictive model that forecasts whether a jam will occur in the near future (e.g. next 5–30 minutes).

## Data used
- Real-time / historical measurements: `df_speed`, `df_speed_enriched`, `df_speed_mega_enriched`
- Site/location info: `df_sites`, `df_sites_parsed`, `gdf_sites`, `gdf_sites_rd`
- Road segments / geometry: `gdf_speedloc`, `gdf_msi`, `gdf_join`
- Paths: `csv_path`, `output_dir`

## Detection (rule-based baseline)
- Simple threshold rule to label jam events:
    - low-speed: `avg_speed_kmh < speed_thresh` (e.g. 30 km/h)
    - optionally require flow > `min_flow` to ignore closed lanes (e.g. 50 veh/h)
    - or sudden drop: `speed_now < (1 - drop_pct) * historical_median_speed`
- Temporal consolidation:
    - require at least `min_duration` consecutive measurements (e.g. 10–15 min) per `site_id` to form a jam
    - merge adjacent sites on same road/km to form an event
- Candidate API: `detect_jams(df, speed_thresh=30, min_flow=50, min_duration=10) -> jams_df`

## Label creation for ML
- Use the detection output to create binary labels:
    - label = 1 if a jam occurs within prediction horizon (e.g. next 5/15/30 minutes)
- Join labels back to measurement times per `site_id` for supervised learning

## Feature engineering
- Instant features: `avg_speed_kmh`, `flow_veh_per_hour`
- Temporal features: rolling means / std (5, 15, 30 min), `speed_delta` (now − prev), time of day, weekday
- Spatial context: lane, `road`, `km`, neighboring sites' speeds (upstream/downstream)
- Historical features: median speed by site/hour (seasonality)
- Categorical encoding: `road`, `carriageway`, `lane`, `direction_ref`

## Modeling approach
- Baseline: logistic regression / calibrated probability
- Tree-based: RandomForest, XGBoost or LightGBM for performance
- Handle class imbalance: class_weight, focal loss, oversample minority or undersample majority
- Cross-validation: time-series CV (no leakage) — use rolling-window splits
- Evaluation metrics: Precision, Recall, F1, PR-AUC (prefer PR when classes are imbalanced), ROC-AUC

## Pipeline (high-level)
1. `jams_df = detect_jams(df_speed_mega_enriched, ...)`
2. `labels = create_labels(jams_df, horizon='15min')`
3. `X, y = build_features(df_speed_mega_enriched, labels, windows=[5,15,30])`
4. Train model with time-series CV:
     - `model = train_model(X_train, y_train, model='lgbm', eval_metric='pr_auc')`
5. Predict on live feed and save:
     - `preds = model.predict_proba(X_live)[:,1]`
     - save to `output_dir / "ndw_jam_predictions.csv"`

## Visualization / inspection
- Map jams and probabilities with geopandas (use `gdf_sites` / `gdf_sites_rd` geometry)
- Time series per site: plot speed + predicted jam probability
- Aggregate on road segments (`km`) and show current jam extents

## Quick checklist for next code cells
- Implement `detect_jams(...)` using `df_speed_mega_enriched`
- Build `build_features(...)` leveraging rolling windows (pandas `.groupby('site_id').rolling(...)`)
- Train with a lightweight model (e.g. LightGBM) and validate with time-based splits
- Create plotting cell: map jam events using `gdf_sites` or `gdf_join` and color by probability
- Save labels/predictions to `output_dir`

Note: reuse existing imports/variables in the notebook (do not re-import unless needed).