Spatial predictive model for three event categories in Ukraine: strike_ukraine, air_defense, strike_russia.
Produces per-cell probability forecasts for 1-day, 3-day, and 7-day horizons. The data comes from the publicly available RIA Novosti interactive map of events (one JSON per day, 2022-06-09 onwards). This repository is a research and portfolio artefact; it is not an operational warning system.
# Activate virtual environment
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
# Generate predictions for the latest date in the dataset
python predict.py
# Predict for a specific date
python predict.py --date 2026-04-15
# Use the simple baseline model (faster, no model files needed)
python predict.py --model baseline
# Change output path
python predict.py --output my_run/predictions.csvOutput is a CSV with ranked cells (see Output Format).
python -m venv .venv
.venv\Scripts\pip install -r requirements.txt
# Development extras (PyInstaller for desktop builds)
.venv\Scripts\pip install -r requirements-dev.txtRequires Python 3.10+. Key dependencies: lightgbm>=4.0, h3>=4.0, pandas>=2.0.
| File | Audience | Language |
|---|---|---|
| README.md | Developers cloning the repo | EN |
| USER_README.md | End users of the desktop build (ships inside the release ZIP) | RU |
| docs/01_technical_description.docx | Full technical description of the pipeline | RU |
| docs/02_probability_interpretation.docx | How to read the probabilities and the map | RU |
| docs/03_executive_summary.docx | ~3-page management summary | RU |
Figures used in the docs are stored under docs/diagrams/, screenshots under docs/screenshots/. Working drafts are kept in docs/drafts/ as development history.
ria_project/
├── predict.py # ← main inference entry point
├── app_launcher.py # customtkinter GUI (entry point for the desktop build)
├── run_phase1.py # Phase 1 orchestrator (cleaning, EDA, H3, panels)
├── run_phase1_finish.py # Phase 1 finisher (data passport, training panels)
├── run_phase2.py # Phase 2 training (baselines + LightGBM)
├── run_phase2_5.py # Phase 2.5 ablation (Group C/D + is_unbalance)
├── build_dataset.py # raw JSON → events_raw.parquet
├── build_map.py # forecast map generator CLI
├── build_analytics_map.py # analytics map generator CLI
├── build_exe.py # PyInstaller onedir build of the GUI
├── build_package.py # pack dist/RIA_Forecast into a release ZIP
├── ria_parser_v2.py # RIA interactive-map parser
│
├── src/
│ ├── cleaning.py # coord validation, dedup, category classification
│ ├── eda.py # H3 indexing, EDA tables/figures
│ ├── features.py # feature engineering (Groups A–E)
│ ├── baselines.py # RecencyWeightedRate, HistoricalRate, SpatialKDE
│ ├── lgbm_model.py # LightGBM train / predict / save / load
│ ├── metrics.py # ROC-AUC, PR-AUC, Brier, Precision@K
│ ├── splits.py # expanding-window cross-validation
│ ├── risk_interpreter.py # probability → risk label / colour / explanation
│ ├── map_builder.py # forecast HTML map (folium)
│ ├── analytics_builder.py # analytics HTML map (Leaflet)
│ ├── frontline_parser.py # Deep State frontline → GeoJSON
│ ├── html_favicon.py # inject favicon / brand badge into HTML maps
│ └── io.py # raw-data inventory, logging setup
│
├── configs/
│ ├── phase_2.yaml # Phase 2 hyperparameters and paths
│ └── phase_2_5.yaml # Phase 2.5 (Group C/D experiment)
│
├── data/
│ ├── raw/ # (gitignored) daily RIA JSON files
│ ├── processed/ # cleaned parquet + geojson
│ └── reports/ # per-step quality reports
│
├── models/ # (gitignored) trained LightGBM models — 540 + 108 files
│
├── output/
│ ├── phase_1/ # data_passport.md + EDA figures
│ ├── phase_2/ # validation_report.md, metrics, feature importance
│ └── phase_2_5/ # comparison_report.md + metrics
│
├── tools/ # documentation build tooling (figures, docx, metrics report)
├── assets/icon.ico # application / map icon
├── docs/ # see "Documentation" above
├── requirements.txt # runtime dependencies
├── requirements-dev.txt # dev dependencies (PyInstaller)
└── RIA_Forecast.spec # PyInstaller spec (generated by build_exe.py)
To keep the repository small, two large directories are excluded from version control:
data/raw/— ~55 MB of daily JSON snapshots from the RIA interactive map. Rebuild with:The parser is resumable; every day is stored as a separate JSON file, and already-fetched days are skipped.python ria_parser_v2.py
models/— ~280 MB of trained LightGBM artefacts. The baseline MVP (predict.py --model baseline) runs without them. To rebuild the LightGBM ensemble, run Phase 2 (see Reproducing Results).
Data in data/processed/ is partially tracked: the final curated artefacts (events_final.parquet, training_panel*.parquet, h3_r5_reference.parquet, ukraine_oblasts.geojson) are kept for reproducibility of later phases. Intermediate parquets are regenerated from raw data in seconds.
Generated map.html and analytics_map.html in the project root are artefacts of a pipeline run and are not tracked.
| File | Description |
|---|---|
events_final.parquet |
Raw events: date, lat/lng, category, H3 cell, oblast |
training_panel.parquet |
Daily count per (H3-r5 cell, category). 404 active cells, 1412 dates |
training_panel_regional.parquet |
Daily count per (oblast, category). 22 oblasts |
Event categories (targets): strike_ukraine, air_defense, strike_russia
Event categories (context features): battle, troops, infrastructure, sabotage
Exponentially decayed historical rate with half-life 30 days:
P(event in next h days | cell c) = 1 - exp(-rate_c × h)
rate_c = Σ_{t} count_c(t) × exp(-0.693 × (T-t) / 30)
| Category | ROC-AUC | PR-AUC (h=1) |
|---|---|---|
| strike_ukraine | 0.931–0.942 | 0.562 |
| air_defense | 0.931–0.970 | 0.209 |
| strike_russia | 0.687–0.712 | 0.047–0.265 |
49 features in three groups:
| Group | Features |
|---|---|
| A — temporal | day-of-week, month, sin/cos encodings, days since conflict start |
| B — autoregressive | lags 1–28d, rolling stats 7–90d, EWMA 3/7/14d, days-since-last/first event |
| E — global | national totals 7/14d, trend, strike_russia share |
LightGBM improves over baseline for air_defense (+1.4–1.5% ROC-AUC). For strike_ukraine
and strike_russia the baseline is competitive or better. Models trained with 12 expanding
folds, 30-day test windows, 7-day gap.
Adding H3 spatial neighbour features (Group C) and cross-category features (Group D) with
is_unbalance=True degraded performance for strike_ukraine (ΔPR ≈ −0.04) and
air_defense (ΔPR ≈ −0.10) due to cross-category data drift after 2025 and calibration
instability. Phase 2 architecture remains the production choice.
predict.py produces output/predictions.csv:
| Column | Type | Description |
|---|---|---|
cell_id |
string | H3-r5 cell token (e.g. 851fb4a7fffffff) or oblast name |
cell_type |
string | h3_r5 or area |
category |
string | strike_ukraine, air_defense, strike_russia |
horizon_days |
int | 1, 3, or 7 |
forecast_date |
date | First day of the forecast window |
predicted_prob |
float | Probability of ≥1 event in the horizon window (0–1) |
rank |
int | Rank within (category, horizon) — 1 = highest probability |
model |
string | Model name used |
# Build feature panels from raw event data (Phase 1)
python build_dataset.py
python run_phase1.py
# Train LightGBM models, baselines, write validation report (Phase 2)
python run_phase2.py --steps 1,5,6,7,8
# Run Phase 2.5 experiment (Group C/D + is_unbalance)
python run_phase2_5.pyTemporal cross-validation: 12 expanding folds, 30-day test windows, 7-day gap. No data from the test window is visible at prediction time.
Leakage checks:
- Smoke test (random labels → AUC ≤ 0.55): passed for all categories
days_since_last_eventverified on shifted series (s.shift(1))- Sample weights (0.4×) applied to centroid-geocoded events
Key finding: EWMA-14 (39% feature importance) dominates, confirming that the recency-weighted baseline captures most of the predictable signal.
MIT. See LICENSE.
This project models patterns in publicly reported events aggregated by a news source (RIA Novosti interactive map). It is a research and portfolio artefact, not an operational intelligence tool. The author does not take a political position; category names (strike_ukraine, strike_russia, etc.) are technical labels inherited from the source data classification and should be read as descriptors of where an event is recorded, not as claims about its legitimacy or attribution.