Skip to content

torviktor/ria-forecast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | Русский

RIA Event Probability Forecasting

Spatial predictive model for three event categories in Ukraine: strike_ukraine, air_defense, strike_russia.

Produces per-cell probability forecasts for 1-day, 3-day, and 7-day horizons. The data comes from the publicly available RIA Novosti interactive map of events (one JSON per day, 2022-06-09 onwards). This repository is a research and portfolio artefact; it is not an operational warning system.


Quick Start

# Activate virtual environment
.venv\Scripts\activate          # Windows
source .venv/bin/activate       # Linux/Mac

# Generate predictions for the latest date in the dataset
python predict.py

# Predict for a specific date
python predict.py --date 2026-04-15

# Use the simple baseline model (faster, no model files needed)
python predict.py --model baseline

# Change output path
python predict.py --output my_run/predictions.csv

Output is a CSV with ranked cells (see Output Format).


Installation

python -m venv .venv
.venv\Scripts\pip install -r requirements.txt

# Development extras (PyInstaller for desktop builds)
.venv\Scripts\pip install -r requirements-dev.txt

Requires Python 3.10+. Key dependencies: lightgbm>=4.0, h3>=4.0, pandas>=2.0.


Documentation

File Audience Language
README.md Developers cloning the repo EN
USER_README.md End users of the desktop build (ships inside the release ZIP) RU
docs/01_technical_description.docx Full technical description of the pipeline RU
docs/02_probability_interpretation.docx How to read the probabilities and the map RU
docs/03_executive_summary.docx ~3-page management summary RU

Figures used in the docs are stored under docs/diagrams/, screenshots under docs/screenshots/. Working drafts are kept in docs/drafts/ as development history.


Project Structure

ria_project/
├── predict.py                  # ← main inference entry point
├── app_launcher.py             # customtkinter GUI (entry point for the desktop build)
├── run_phase1.py               # Phase 1 orchestrator (cleaning, EDA, H3, panels)
├── run_phase1_finish.py        # Phase 1 finisher (data passport, training panels)
├── run_phase2.py               # Phase 2 training (baselines + LightGBM)
├── run_phase2_5.py             # Phase 2.5 ablation (Group C/D + is_unbalance)
├── build_dataset.py            # raw JSON → events_raw.parquet
├── build_map.py                # forecast map generator CLI
├── build_analytics_map.py      # analytics map generator CLI
├── build_exe.py                # PyInstaller onedir build of the GUI
├── build_package.py            # pack dist/RIA_Forecast into a release ZIP
├── ria_parser_v2.py            # RIA interactive-map parser
│
├── src/
│   ├── cleaning.py             # coord validation, dedup, category classification
│   ├── eda.py                  # H3 indexing, EDA tables/figures
│   ├── features.py             # feature engineering (Groups A–E)
│   ├── baselines.py            # RecencyWeightedRate, HistoricalRate, SpatialKDE
│   ├── lgbm_model.py           # LightGBM train / predict / save / load
│   ├── metrics.py              # ROC-AUC, PR-AUC, Brier, Precision@K
│   ├── splits.py               # expanding-window cross-validation
│   ├── risk_interpreter.py     # probability → risk label / colour / explanation
│   ├── map_builder.py          # forecast HTML map (folium)
│   ├── analytics_builder.py    # analytics HTML map (Leaflet)
│   ├── frontline_parser.py     # Deep State frontline → GeoJSON
│   ├── html_favicon.py         # inject favicon / brand badge into HTML maps
│   └── io.py                   # raw-data inventory, logging setup
│
├── configs/
│   ├── phase_2.yaml            # Phase 2 hyperparameters and paths
│   └── phase_2_5.yaml          # Phase 2.5 (Group C/D experiment)
│
├── data/
│   ├── raw/                    # (gitignored) daily RIA JSON files
│   ├── processed/              # cleaned parquet + geojson
│   └── reports/                # per-step quality reports
│
├── models/                     # (gitignored) trained LightGBM models — 540 + 108 files
│
├── output/
│   ├── phase_1/                # data_passport.md + EDA figures
│   ├── phase_2/                # validation_report.md, metrics, feature importance
│   └── phase_2_5/              # comparison_report.md + metrics
│
├── tools/                      # documentation build tooling (figures, docx, metrics report)
├── assets/icon.ico             # application / map icon
├── docs/                       # see "Documentation" above
├── requirements.txt            # runtime dependencies
├── requirements-dev.txt        # dev dependencies (PyInstaller)
└── RIA_Forecast.spec           # PyInstaller spec (generated by build_exe.py)

Data Availability

To keep the repository small, two large directories are excluded from version control:

  • data/raw/ — ~55 MB of daily JSON snapshots from the RIA interactive map. Rebuild with:
    python ria_parser_v2.py
    The parser is resumable; every day is stored as a separate JSON file, and already-fetched days are skipped.
  • models/ — ~280 MB of trained LightGBM artefacts. The baseline MVP (predict.py --model baseline) runs without them. To rebuild the LightGBM ensemble, run Phase 2 (see Reproducing Results).

Data in data/processed/ is partially tracked: the final curated artefacts (events_final.parquet, training_panel*.parquet, h3_r5_reference.parquet, ukraine_oblasts.geojson) are kept for reproducibility of later phases. Intermediate parquets are regenerated from raw data in seconds.

Generated map.html and analytics_map.html in the project root are artefacts of a pipeline run and are not tracked.


Data

File Description
events_final.parquet Raw events: date, lat/lng, category, H3 cell, oblast
training_panel.parquet Daily count per (H3-r5 cell, category). 404 active cells, 1412 dates
training_panel_regional.parquet Daily count per (oblast, category). 22 oblasts

Event categories (targets): strike_ukraine, air_defense, strike_russia Event categories (context features): battle, troops, infrastructure, sabotage


Models

MVP Model: RecencyWeightedRateBaseline

Exponentially decayed historical rate with half-life 30 days:

P(event in next h days | cell c) = 1 - exp(-rate_c × h)
rate_c = Σ_{t} count_c(t) × exp(-0.693 × (T-t) / 30)
Category ROC-AUC PR-AUC (h=1)
strike_ukraine 0.931–0.942 0.562
air_defense 0.931–0.970 0.209
strike_russia 0.687–0.712 0.047–0.265

LightGBM (Phase 2)

49 features in three groups:

Group Features
A — temporal day-of-week, month, sin/cos encodings, days since conflict start
B — autoregressive lags 1–28d, rolling stats 7–90d, EWMA 3/7/14d, days-since-last/first event
E — global national totals 7/14d, trend, strike_russia share

LightGBM improves over baseline for air_defense (+1.4–1.5% ROC-AUC). For strike_ukraine and strike_russia the baseline is competitive or better. Models trained with 12 expanding folds, 30-day test windows, 7-day gap.

Phase 2.5 Experiment (Group C + D + is_unbalance) — not used in production

Adding H3 spatial neighbour features (Group C) and cross-category features (Group D) with is_unbalance=True degraded performance for strike_ukraine (ΔPR ≈ −0.04) and air_defense (ΔPR ≈ −0.10) due to cross-category data drift after 2025 and calibration instability. Phase 2 architecture remains the production choice.


Output Format

predict.py produces output/predictions.csv:

Column Type Description
cell_id string H3-r5 cell token (e.g. 851fb4a7fffffff) or oblast name
cell_type string h3_r5 or area
category string strike_ukraine, air_defense, strike_russia
horizon_days int 1, 3, or 7
forecast_date date First day of the forecast window
predicted_prob float Probability of ≥1 event in the horizon window (0–1)
rank int Rank within (category, horizon) — 1 = highest probability
model string Model name used

Reproducing Results

# Build feature panels from raw event data (Phase 1)
python build_dataset.py
python run_phase1.py

# Train LightGBM models, baselines, write validation report (Phase 2)
python run_phase2.py --steps 1,5,6,7,8

# Run Phase 2.5 experiment (Group C/D + is_unbalance)
python run_phase2_5.py

Validation Summary

Temporal cross-validation: 12 expanding folds, 30-day test windows, 7-day gap. No data from the test window is visible at prediction time.

Leakage checks:

  • Smoke test (random labels → AUC ≤ 0.55): passed for all categories
  • days_since_last_event verified on shifted series (s.shift(1))
  • Sample weights (0.4×) applied to centroid-geocoded events

Key finding: EWMA-14 (39% feature importance) dominates, confirming that the recency-weighted baseline captures most of the predictable signal.


License

MIT. See LICENSE.


Disclaimer

This project models patterns in publicly reported events aggregated by a news source (RIA Novosti interactive map). It is a research and portfolio artefact, not an operational intelligence tool. The author does not take a political position; category names (strike_ukraine, strike_russia, etc.) are technical labels inherited from the source data classification and should be read as descriptors of where an event is recorded, not as claims about its legitimacy or attribution.

About

Probabilistic forecasting of conflict events from public news feeds

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages