# Milestone II: Model Architecture & Experiments (Group 4)

Project: Next-Day Wildfire Spread Prediction on mNDWS

## Abstract
We propose a next-day (t+1) burned-area prediction model using the modified Next Day Wildfire Spread (mNDWS) dataset (500 m VIIRS, CONUS-West, 2018–2023). The task is framed as binary image segmentation of pixels likely to burn tomorrow given multimodal inputs (weather, vegetation/drought, fuels, topography, impervious/water). We outline a simple, reproducible baseline (logistic regression and a compact U-Net), an ablation plan to quantify feature-family importance (wind, fuels, vegetation/drought, topography), and robustness slices (high-wind, WUI).

## Introduction
Wildfire spread forecasts inform evacuations, resource allocation, and risk communication. Physics-based simulators can be accurate but require detailed configuration and are costly to run at regional scale. Traditional fire danger indices rely on fixed formulas and cannot capture nonlinear interactions among fuels, wind, and drought. We target a practical question: which 500 m pixels will burn tomorrow? Our objectives are: (i) a clear, reproducible baseline; (ii) insight into which feature families drive performance; (iii) evaluation of robustness under challenging conditions such as high winds and wildland–urban interface (WUI) areas; and (iv) calibrated decision thresholds usable by practitioners.

## Literature Survey
Prior work spans physics-based coupled fire–atmosphere models (e.g., FIRETEC, WRF-Fire) and modern deep learning approaches for segmentation-based spread prediction. Physics models improve physical fidelity but are computationally expensive and complex to configure. Deep learning models trained on multimodal remote-sensing and weather inputs have shown promising accuracy and scalability for short-horizon spread prediction. Key gaps include robust evaluation under high-wind/WUI conditions, calibrated uncertainty for thresholding, and ablations that quantify the importance of feature families (wind, fuels, vegetation/drought, topography).

References are documented in Deliverable 1 and will be expanded here with formal citations in the final report.

## Method
- Task: Binary image segmentation of next-day burned pixels (`t+1`) at 500 m resolution.
- Dataset: mNDWS (2018–2023, CONUS-West) with multimodal covariates (weather, vegetation/drought, 3-D fuels embedding, topography, impervious/water). We use the official train/val/test splits.

### Data Pipeline (from notebooks)
- Source format: TFRecords provided by mNDWS; converted to NumPy NPZ tiles for PyTorch-friendly loading (see repo notebooks).
- Channels: 9 core raw channels + auxiliary channels (e.g., fuel1–3, impervious/water), yielding `in_ch ≈ 16` for models.
- Normalization: per-channel mean/std computed over training data (see `compute_channel_stats`).
- Dataloaders: `batch_size=16`, `num_workers=0` for portability; optional `WeightedRandomSampler` to upweight positive tiles.

### Models Implemented
- Pixel Logistic Regression: implemented as a 1×1 convolution over the input stack (per-pixel classifier). Loss: `BCEWithLogitsLoss` with `pos_weight` computed from class frequency (observed ≈80 in our subset). Optimizer: `AdamW(lr=1e-3, weight_decay=1e-4)`.
- Compact U-Net (PyTorch): depth-3 encoder–decoder with skip connections; base width 64; instantiated as `UNet(in_ch=16, out_ch=1, base=64)`. Optimizer: `AdamW(lr=2e-4, weight_decay=1e-4 or 5e-5)` with cosine annealing LR schedule; mixed precision used in one notebook.

### Losses and Class Imbalance
- Losses tried: BCE-with-logits, Focal Loss (γ≈1.5, α≈0.5), Soft Dice, and Focal Tversky (α≈0.6–0.7, β≈0.3–0.4, γ≈0.75).
- Composites: 0.5·Focal + 0.5·Focal Tversky in one setup.
- Class imbalance: `pos_weight` in BCE and/or `WeightedRandomSampler` (e.g., weight 5.0 for tiles containing positives).

### Augmentation, TTA, and Thresholding
- Training augmentations: horizontal/vertical flips with geometry-aware handling of wind components.
- Test-time augmentation (TTA): horizontal/vertical flips with probability aggregation to improve stability.
- Threshold selection: pick per-mode threshold on the validation set using precision–recall curves (optimize F1 or match prevalence).

### Metrics and Reporting
- Primary: Average Precision (AUPRC) computed via `sklearn.average_precision_score`.
- Also report: Precision–Recall curves, F1 at selected thresholds, training/validation loss curves.
- Efficiency: parameter count (U-Net base=64) and rough inference time per tile; run-time tracked in notebooks.

### Experiment Scope (current)
- Subset training to sanity-check pipeline: e.g., `max_samples≈1200/300/300` for train/val/test in early runs; `EPOCHS≈15–60`.
- Reproducibility: fixed seeds and consistent dataloader options within notebooks.

This section summarizes what is already implemented in the repository notebooks and will be the basis for expanded experiments and ablations below.

## Preliminary Experiments
Summary of exploratory runs captured in repo notebooks (small subsets used for speed):
- Pixel Logistic Regression (1×1 conv): trained with `BCEWithLogitsLoss` + `pos_weight` (≈80) and `AdamW(lr=1e-3)`. Produced stable optimization and reasonable AUPRC on validation subsets; used as sanity baseline.
- Compact U-Net (in_ch=16, base=64): trained with `AdamW(lr=2e-4, weight_decay=1e-4/5e-5)` for 15–60 epochs. Loss variants included BCE, Focal, Soft Dice, and Focal Tversky; a 0.5·Focal + 0.5·Focal Tversky combo performed robustly in early tests.
- Data handling: per-channel normalization, `batch_size=16`, positive-tile upweighting via `WeightedRandomSampler`.
- Thresholding and TTA: validation-based threshold selection; optional flip TTA improved stability on the test subset. Example artifacts in one run included validation-selected `thr_tta≈0.195` and test AUPRC in the ~0.39 range on a small held-out slice (illustrative only).

Planned next: scale beyond subset runs, add systematic ablations (feature families and loss variants), and log curves/metrics consistently.

## Error Analysis
The baseline logistic regression model is limited and fails to anticipate future fire spread and captures around 45% of true positives.(To be added: qualitative examples, failure modes by terrain/landcover, analysis under high-wind and WUI slices.)

## Ablation Study Plan
Change one factor at a time, consistent seeds, tracked runs. Planned ablations:
- Feature families: remove/add wind, fuels, vegetation/drought, topography to quantify contribution.
- Loss function: BCE vs focal loss.
- Input resolution/stacking choices where applicable.
- Threshold calibration method (Youden J, top-k by prevalence, cost-weighted).

## Next Steps
- Finalize data pipeline and input normalization using mNDWS stats.
- Train logistic regression baseline and compact U-Net.
- Produce training/validation curves and core metrics.
- Run planned ablations and robustness slices (high wind, WUI).
- Compile error analysis examples and discussion.

## Member Contributions
- Robert Clay Harris: [TBD]
- Hannah Richardson: [TBD]
- Chelsey Blowe: [TBD]