RIA Event Probability Forecasting

Spatial predictive model for three event categories in Ukraine: strike_ukraine, air_defense, strike_russia.

Produces per-cell probability forecasts for 1-day, 3-day, and 7-day horizons. The data comes from the publicly available RIA Novosti interactive map of events (one JSON per day, 2022-06-09 onwards). This repository is a research and portfolio artefact; it is not an operational warning system.

Quick Start

# Activate virtual environment
.venv\Scripts\activate          # Windows
source .venv/bin/activate       # Linux/Mac

# Generate predictions for the latest date in the dataset
python predict.py

# Predict for a specific date
python predict.py --date 2026-04-15

# Use the simple baseline model (faster, no model files needed)
python predict.py --model baseline

# Change output path
python predict.py --output my_run/predictions.csv

Output is a CSV with ranked cells (see Output Format).

Installation

python -m venv .venv
.venv\Scripts\pip install -r requirements.txt

# Development extras (PyInstaller for desktop builds)
.venv\Scripts\pip install -r requirements-dev.txt

Requires Python 3.10+. Key dependencies: lightgbm>=4.0, h3>=4.0, pandas>=2.0.

Documentation

File	Audience	Language
README.md	Developers cloning the repo	EN
USER_README.md	End users of the desktop build (ships inside the release ZIP)	RU
docs/01_technical_description.docx	Full technical description of the pipeline	RU
docs/02_probability_interpretation.docx	How to read the probabilities and the map	RU
docs/03_executive_summary.docx	~3-page management summary	RU

Figures used in the docs are stored under docs/diagrams/, screenshots under docs/screenshots/. Working drafts are kept in docs/drafts/ as development history.

Project Structure

ria_project/
├── predict.py                  # ← main inference entry point
├── app_launcher.py             # customtkinter GUI (entry point for the desktop build)
├── run_phase1.py               # Phase 1 orchestrator (cleaning, EDA, H3, panels)
├── run_phase1_finish.py        # Phase 1 finisher (data passport, training panels)
├── run_phase2.py               # Phase 2 training (baselines + LightGBM)
├── run_phase2_5.py             # Phase 2.5 ablation (Group C/D + is_unbalance)
├── build_dataset.py            # raw JSON → events_raw.parquet
├── build_map.py                # forecast map generator CLI
├── build_analytics_map.py      # analytics map generator CLI
├── build_exe.py                # PyInstaller onedir build of the GUI
├── build_package.py            # pack dist/RIA_Forecast into a release ZIP
├── ria_parser_v2.py            # RIA interactive-map parser
│
├── src/
│   ├── cleaning.py             # coord validation, dedup, category classification
│   ├── eda.py                  # H3 indexing, EDA tables/figures
│   ├── features.py             # feature engineering (Groups A–E)
│   ├── baselines.py            # RecencyWeightedRate, HistoricalRate, SpatialKDE
│   ├── lgbm_model.py           # LightGBM train / predict / save / load
│   ├── metrics.py              # ROC-AUC, PR-AUC, Brier, Precision@K
│   ├── splits.py               # expanding-window cross-validation
│   ├── risk_interpreter.py     # probability → risk label / colour / explanation
│   ├── map_builder.py          # forecast HTML map (folium)
│   ├── analytics_builder.py    # analytics HTML map (Leaflet)
│   ├── frontline_parser.py     # Deep State frontline → GeoJSON
│   ├── html_favicon.py         # inject favicon / brand badge into HTML maps
│   └── io.py                   # raw-data inventory, logging setup
│
├── configs/
│   ├── phase_2.yaml            # Phase 2 hyperparameters and paths
│   └── phase_2_5.yaml          # Phase 2.5 (Group C/D experiment)
│
├── data/
│   ├── raw/                    # (gitignored) daily RIA JSON files
│   ├── processed/              # cleaned parquet + geojson
│   └── reports/                # per-step quality reports
│
├── models/                     # (gitignored) trained LightGBM models — 540 + 108 files
│
├── output/
│   ├── phase_1/                # data_passport.md + EDA figures
│   ├── phase_2/                # validation_report.md, metrics, feature importance
│   └── phase_2_5/              # comparison_report.md + metrics
│
├── tools/                      # documentation build tooling (figures, docx, metrics report)
├── assets/icon.ico             # application / map icon
├── docs/                       # see "Documentation" above
├── requirements.txt            # runtime dependencies
├── requirements-dev.txt        # dev dependencies (PyInstaller)
└── RIA_Forecast.spec           # PyInstaller spec (generated by build_exe.py)

Data Availability

To keep the repository small, two large directories are excluded from version control:

data/raw/ — ~55 MB of daily JSON snapshots from the RIA interactive map. Rebuild with:
```
python ria_parser_v2.py
```
The parser is resumable; every day is stored as a separate JSON file, and already-fetched days are skipped.
models/ — ~280 MB of trained LightGBM artefacts. The baseline MVP (predict.py --model baseline) runs without them. To rebuild the LightGBM ensemble, run Phase 2 (see Reproducing Results).

Data in data/processed/ is partially tracked: the final curated artefacts (events_final.parquet, training_panel*.parquet, h3_r5_reference.parquet, ukraine_oblasts.geojson) are kept for reproducibility of later phases. Intermediate parquets are regenerated from raw data in seconds.

Generated map.html and analytics_map.html in the project root are artefacts of a pipeline run and are not tracked.

Data

File	Description
`events_final.parquet`	Raw events: date, lat/lng, category, H3 cell, oblast
`training_panel.parquet`	Daily count per (H3-r5 cell, category). 404 active cells, 1412 dates
`training_panel_regional.parquet`	Daily count per (oblast, category). 22 oblasts

Event categories (targets): strike_ukraine, air_defense, strike_russia Event categories (context features): battle, troops, infrastructure, sabotage

Models

MVP Model: RecencyWeightedRateBaseline

Exponentially decayed historical rate with half-life 30 days:

P(event in next h days | cell c) = 1 - exp(-rate_c × h)
rate_c = Σ_{t} count_c(t) × exp(-0.693 × (T-t) / 30)

Category	ROC-AUC	PR-AUC (h=1)
strike_ukraine	0.931–0.942	0.562
air_defense	0.931–0.970	0.209
strike_russia	0.687–0.712	0.047–0.265

LightGBM (Phase 2)

49 features in three groups:

Group	Features
A — temporal	day-of-week, month, sin/cos encodings, days since conflict start
B — autoregressive	lags 1–28d, rolling stats 7–90d, EWMA 3/7/14d, days-since-last/first event
E — global	national totals 7/14d, trend, strike_russia share

LightGBM improves over baseline for air_defense (+1.4–1.5% ROC-AUC). For strike_ukraine and strike_russia the baseline is competitive or better. Models trained with 12 expanding folds, 30-day test windows, 7-day gap.

Phase 2.5 Experiment (Group C + D + is_unbalance) — not used in production

Adding H3 spatial neighbour features (Group C) and cross-category features (Group D) with is_unbalance=True degraded performance for strike_ukraine (ΔPR ≈ −0.04) and air_defense (ΔPR ≈ −0.10) due to cross-category data drift after 2025 and calibration instability. Phase 2 architecture remains the production choice.

Output Format

predict.py produces output/predictions.csv:

Column	Type	Description
`cell_id`	string	H3-r5 cell token (e.g. `851fb4a7fffffff`) or oblast name
`cell_type`	string	`h3_r5` or `area`
`category`	string	`strike_ukraine`, `air_defense`, `strike_russia`
`horizon_days`	int	1, 3, or 7
`forecast_date`	date	First day of the forecast window
`predicted_prob`	float	Probability of ≥1 event in the horizon window (0–1)
`rank`	int	Rank within (category, horizon) — 1 = highest probability
`model`	string	Model name used

Reproducing Results

# Build feature panels from raw event data (Phase 1)
python build_dataset.py
python run_phase1.py

# Train LightGBM models, baselines, write validation report (Phase 2)
python run_phase2.py --steps 1,5,6,7,8

# Run Phase 2.5 experiment (Group C/D + is_unbalance)
python run_phase2_5.py

Validation Summary

Temporal cross-validation: 12 expanding folds, 30-day test windows, 7-day gap. No data from the test window is visible at prediction time.

Leakage checks:

Smoke test (random labels → AUC ≤ 0.55): passed for all categories
days_since_last_event verified on shifted series (s.shift(1))
Sample weights (0.4×) applied to centroid-geocoded events

Key finding: EWMA-14 (39% feature importance) dominates, confirming that the recency-weighted baseline captures most of the predictable signal.

License

MIT. See LICENSE.

Disclaimer

This project models patterns in publicly reported events aggregated by a news source (RIA Novosti interactive map). It is a research and portfolio artefact, not an operational intelligence tool. The author does not take a political position; category names (strike_ukraine, strike_russia, etc.) are technical labels inherited from the source data classification and should be read as descriptors of where an event is recorded, not as claims about its legitimacy or attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RIA Event Probability Forecasting

Quick Start

Installation

Documentation

Project Structure

Data Availability

Data

Models

MVP Model: RecencyWeightedRateBaseline

LightGBM (Phase 2)

Phase 2.5 Experiment (Group C + D + is_unbalance) — not used in production

Output Format

Reproducing Results

Validation Summary

License

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
data		data
docs		docs
output		output
src		src
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.ru.md		README.ru.md
RIA_Forecast.spec		RIA_Forecast.spec
USER_README.md		USER_README.md
app_launcher.py		app_launcher.py
build_analytics_map.py		build_analytics_map.py
build_dataset.py		build_dataset.py
build_exe.py		build_exe.py
build_map.py		build_map.py
build_package.py		build_package.py
predict.py		predict.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ria_parser_v2.py		ria_parser_v2.py
run_phase1.py		run_phase1.py
run_phase1_finish.py		run_phase1_finish.py
run_phase2.py		run_phase2.py
run_phase2_5.py		run_phase2_5.py

Folders and files

Latest commit

History

Repository files navigation

RIA Event Probability Forecasting

Quick Start

Installation

Documentation

Project Structure

Data Availability

Data

Models

MVP Model: RecencyWeightedRateBaseline

LightGBM (Phase 2)

Phase 2.5 Experiment (Group C + D + is_unbalance) — not used in production

Output Format

Reproducing Results

Validation Summary

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages