# WiDS Datathon 2026 Notebook
This is where your analysis begins. Use this notebook for EDA, modeling, and explanations.

## Project Title & Team Info

**Project Title**: _Workshop 1: WiDS University Datathon 2026_  
**Team Name**: _Team Alert_  
**University**: _Bucharest University of Economic Studies_  
**Course**: _Software Open Source for Statistics and Data Science_  
**Term**: _1st Semester, 2025_  

**Team Members**:  

- »öilicƒÉ Mihnea David (GitHub: [@David-Mihnea](https://github.com/David-Mihnea))
- Zamfir Robert Dan (GitHub: [@zamfirrobert20-prog](https://github.com/zamfirrobert20-prog))
- Radu Alexandru Claudiu (GitHub: [@raduclaudiu20-art](https://github.com/raduclaudiu20-art))
- SƒÉndulescu Crina (GitHub: [@ccrinasandulescu](https://github.com/ccrinasandulescu))
- Sasu Sabrina (GitHub: [@sasusabrina22](https://github.com/sasusabrina22?tab=repositories))
- Sandu Bianca (GitHub: [@sandubianca](https://github.com/sandubianca))



### üîπ Route 1: Accelerating Equitable Evacuations

**Core Question:**  
*How can we reduce delays in evacuation alerts and improve response times for the communities that are most at risk?*

This route focuses on analyzing how and when evacuation alerts are triggered ‚Äî and how we can improve timeliness and fairness in communication, especially for vulnerable populations.

## Dataset Overview

Summarize the datasets you used and how you processed them.

- `evac_zone_status_geo_event_map.csv`: maps wildfire events to evacuation zones
- `evac_zones_gis_evaczone.csv`: defines evacuation zones as spatial entities, including their identifiers, names, activity status
- `geo_events_geoevent.csv`: records of geographic events, including wildfire incidents, with their location
- `geo_events_geoeventchangelog.csv`: time-stamped updates to wildfire events, capturing changes in reported field


**Load Data**

In [None]:
from google.colab import files
files.upload()

In [None]:
!unzip DataWids.zip -d data/

In [None]:
#libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_theme()
plt.colormaps()

from sklearn.model_selection import train_test_split


In [None]:
geo_events    = pd.read_csv("/content/data/DataWids/geo_events_geoevent.csv", low_memory=True)
change_log   = pd.read_csv("/content/data/DataWids/geo_events_geoeventchangelog.csv", low_memory=True)
evac_zones      = pd.read_csv("/content/data/DataWids/evac_zones_gis_evaczone.csv", low_memory=True)
evac_map  = pd.read_csv("/content/data/DataWids/evac_zone_status_geo_event_map.csv", low_memory=True)

In [None]:
print("Loaded:")
for name, df in zip(["geo_events","change_log","evac_zones","evac_map"],
                    [geo_events,change_log,evac_zones,evac_map]):
    print(f"  {name:10s} {df.shape}")

**Data Cleaning**

In [None]:
import pandas as pd
import numpy as np
import json

def extract_json_field(js, key):
    if not isinstance(js, str) or "{" not in js:
        return None
    try:
        js = js.strip().strip('"').strip("'")
        parsed = json.loads(js)
        return parsed.get(key)
    except:
        return None

def extract_change_value(js, key):
    parsed = extract_json_field(js, key)
    if isinstance(parsed, list) and len(parsed) >= 2:
        return parsed[1]
    return None

# Extract 'geo_event_type' from the 'data' column first
geo_events['geo_event_type'] = geo_events['data'].apply(lambda x: extract_json_field(x, 'geo_event_type'))

# Now filter based on the extracted 'geo_event_type'
geo_events = geo_events[geo_events["geo_event_type"] == "wildfire"].copy()

for col in ["is_prescribed", "is_fps", "containment", "acreage"]:
    geo_events[col] = geo_events["data"].apply(lambda x: extract_json_field(x, col))

fields_to_keep = ["id", "geo_event_type", "date_created", "date_modified", "name", "notification_type", "lat", "lng",
                  "is_prescribed", "is_fps", "containment", "acreage"]
geo_events = geo_events[fields_to_keep].copy()

change_log['rate_of_spread'] = change_log['changes'].apply(lambda x: extract_change_value(x, 'radio_traffic_indicates_rate_of_spread'))
change_log['structure_threat'] = change_log['changes'].apply(lambda x: extract_change_value(x, 'radio_traffic_indicates_structure_threat'))
change_log['spotting'] = change_log['changes'].apply(lambda x: extract_change_value(x, 'radio_traffic_indicates_spotting'))

change_log = change_log[["geo_event_id", "date_created", "rate_of_spread", "structure_threat", "spotting"]].copy()

print("After cleaning:")
print(f"  events    {geo_events.shape}")
print(f"  changes   {change_log.shape}")
print(f"  evac_zones {evac_zones.shape}")
print(f"  evac_map  {evac_map.shape}")

print("\nExtracted fields preview:")
print(geo_events[["id", "geo_event_type", "containment", "acreage", "is_fps", "is_prescribed"]].head(10))

print("\nNon-null counts for extracted fields:")
print(geo_events[["containment", "acreage", "is_fps", "is_prescribed"]].notna().sum())

In [None]:
def to_dt(df, cols):
    for c in cols:
        if c in df.columns:
            df[c] = pd.to_datetime(df[c], errors="coerce", utc=True)
    return df

geo_events = to_dt(geo_events, ["date_created", "date_modified"])
change_log = to_dt(change_log, ["date_created"])

In [None]:
merged = change_log.merge(
    geo_events,
    left_on="geo_event_id",
    right_on="id",
    how="inner",
    suffixes=("_log", "_evt")
)

merged['alert_lag_min'] = (merged['date_created_log'] - merged['date_created_evt']).dt.total_seconds() / 60

merged = merged.merge(
    evac_map[["geo_event_id", "uid_v2"]].drop_duplicates(),
    left_on="id",
    right_on="geo_event_id",
    how="left"
)

merged = merged.merge(
    evac_zones.drop_duplicates("uid_v2"),
    on="uid_v2",
    how="left",
    suffixes=("", "_evac")
)

print("Merged dataset:", merged.shape)

In [None]:
missing_values = pd.DataFrame({
    'Variable': merged.columns,
    'Missing values count': merged.isnull().sum().values,
    'Missing values %': (merged.isnull().sum().values / len(merged) * 100)})

unique_values = pd.DataFrame({
    'Variable': merged.columns,
    'Unique values count': merged.nunique().values})

feature_types = pd.DataFrame({
    'Variable': merged.columns,
    'Data type': merged.dtypes.astype(str)})

summary_df = (missing_values
    .merge(unique_values, on='Variable', how='left')
    .merge(feature_types, on='Variable', how='left'))

summary_df = summary_df.sort_values(by='Missing values %', ascending=False)

summary_df.style.background_gradient(cmap='rocket_r').format({
    'Missing values %': '{:.2f}',
    'Unique values count': '{:,}',
    'Missing values count': '{:,}'})

High Missingness (> 90% missing)

Variables: spotting (99.89%), structure_threat (99.59%), status (98.60%), rate_of_spread (97.04%).

Interpretation: These fields represent specialized radio traffic reports and official evacuation orders. Their high sparsity is expected, as "Extreme" spread or "Structure Threats" occur only in high-severity escalations. For our Alert Engine, these are not predictors, but critical triggers (when present, they override standard priority).

Moderate Missingness (10% - 60%)

Variables: external_status (54.70%), display_name (44.78%), uid_v2 (27.57%), containment (11.96%), is_fps (11.86%).

Interpretation: This missingness reflects the lifecycle of a wildfire. Many incidents are controlled quickly (indicated by is_fps) before they are assigned to specific evacuation zones or formal county-level naming.

Low/No Missingness (0% - 3%)

Variables: acreage (2.94%), date_created_log (0.00%), alert_lag_min (0.00%).

Interpretation: These are our most reliable data points. acreage serves as the primary physical metric, while the complete timeline data (date_created) allows for a robust Alert Lag analysis, which is the core of our predictive modeling.

Unique Values Insights

Incident Scale: Over 42,000 unique id values confirm a vast dataset of distinct wildfire events.

Priority Triggers: notification_type and is_prescribed show very low cardinality (2 unique values), making them ideal categorical filters for segregating planned burns from emergency wildfires.

In [None]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
%matplotlib inline

In [None]:
merged.info()

We will not use 'pending_updates' as it contains only NaN values even though it's a numerical variable

# **Clusterization**

In [None]:
df = merged[["containment", "acreage", "alert_lag_min"]].dropna()

In [None]:
for col in df.columns:
    # Drop NaN values for the current column before plotting
    data_to_plot = df[col].dropna()

    # Only plot if there are actual values remaining after dropping NaNs
    if not data_to_plot.empty:
        plt.figure(figsize=(4, 3))
        plt.hist(data_to_plot, bins=20)
        plt.title(f"Histogram for variable: {col}")
        plt.xlabel(col)
        plt.ylabel("Frequency")
        plt.grid(True)
        plt.show()
    else:
        print(f"Skipping histogram for '{col}' as it contains only NaN values.")

* **Containment:** The distribution is extremely concentrated near 100%, indicating that most recorded wildfire events are reported as nearly fully contained, with relatively few observations at lower containment levels.

* **Acreage:** The distribution is highly right-skewed, showing that the majority of fires affect relatively small areas, while a small number of extreme events account for very large burned acreages.

* **Alert lag (minutes):** The distribution is strongly right-skewed, with most alerts issued within relatively short time spans, but with a long tail of cases experiencing very large delays.


 **Data Standardization**


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
Clus_dataSet = MinMaxScaler().fit_transform(df)
Clus_dataSet

In [None]:
for i, col in enumerate(df.columns):
    plt.figure(figsize=(6, 4))
    plt.hist(Clus_dataSet[:, i], bins=20)
    plt.title(f"Histogram for standardized variable: {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.grid(True)
    plt.show()

* **Standardized acreage:** After normalization, the distribution remains strongly right-skewed, with most observations concentrated near zero and a small number of extreme fires mapped close to the upper bound, indicating persistent scale heterogeneity even after standardization.

* **Standardized alert lag (minutes):** The standardized values are heavily concentrated near the lower end of the scale, confirming that most alerts occur relatively quickly, while a limited set of incidents exhibits disproportionately large delays.

* **Standardized containment:** The distribution is almost entirely concentrated near one, reflecting that containment values are uniformly high across events and that standardization does not alter the underlying lack of variability in this variable.


In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np # Import numpy for NaN check

costs = []
K_range = range(2, 11) # Reducem la 10 pentru lizibilitate, fiind suficient pentru datele tale

# Handle NaN values by dropping rows that contain them
# Convert Clus_dataSet to a pandas DataFrame to use dropna easily, then back to numpy array
import pandas as pd
Clus_dataSet_cleaned = pd.DataFrame(Clus_dataSet).dropna().values

# Check if Clus_dataSet_cleaned is empty after dropping NaNs
if Clus_dataSet_cleaned.shape[0] == 0:
    print("Warning: Clus_dataSet became empty after dropping NaN values. Cannot perform KMeans.")
else:
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(Clus_dataSet_cleaned)
        costs.append(kmeans.inertia_)

    plt.figure(figsize=(10, 6))
    plt.plot(K_range, costs, marker='o', linestyle='--', color='b')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia (Cost)')
    plt.title('Elbow Method: Determining Optimal Clusters for Alert Profiling')
    plt.xticks(K_range)
    plt.grid(True, alpha=0.3)
    plt.show()

**We will chose 3 clusters for our analysis due to sharp decrease from 2 to 3**

In [None]:
k_means = KMeans(init = "k-means++", n_clusters = 3, n_init = 12)
k_means.fit(Clus_dataSet)
labels_km = k_means.labels_
print(labels_km)

In [None]:
df["Clus_km"] = labels_km
df.head(5)

**Weight centers**

In [None]:
df.groupby('Clus_km').mean()

In [None]:
df['Clus_km'].value_counts()

The cluster centers and their sizes highlight clear quantitative differences between wildfire risk profiles. **Cluster 0**, which contains the majority of observations (‚âà 384,000 events), is characterized by very high containment (‚âà 99.1%), moderate average fire size (‚âà 29,700 acres), and a mean alert lag of about **8,856 minutes**, indicating routine or controlled incidents that still experience non-negligible delays due to volume and operational load. **Cluster 1**, with roughly **121,700 events**, shows similarly high containment (‚âà 99%) but an extremely large average acreage (‚âà 429,500 acres) and the longest alert lag (‚âà 16,078 minutes), reflecting large-scale, complex mega-fires where coordination and scale drive significant delays despite stabilization. **Cluster 2**, the smallest group (‚âà 12,600 events), stands out with very low containment (‚âà 23.7%), smaller average fire size (‚âà 8,468 acres), and a high alert lag (‚âà 13,324 minutes), quantitatively confirming that low containment and active fire dynamics can lead to severe delays even when fires are not large in spatial extent.




In [None]:
# We will make the reprezentation of clusters
# We will use acreage on X axis and alert_lag_min on Y axis

ax = df[df["Clus_km"] == 0][0:500].plot(
    kind='scatter',
    x='acreage',
    y='alert_lag_min',
    color='DarkBlue',
    label='Cluster 0: Routine/Controlled'
)

df[df["Clus_km"] == 1][0:500].plot(
    kind='scatter',
    x='acreage',
    y='alert_lag_min',
    color='Yellow',
    label='Cluster 1: Mega-Fires (Stable)',
    ax=ax
)

df[df["Clus_km"] == 2][0:500].plot(
    kind='scatter',
    x='acreage',
    y='alert_lag_min',
    color='Red',
    label='Cluster 2: ACTIVE RISK (23% Cont.)',
    ax=ax
)

# We will add logarithmic scale
plt.xscale('log')

plt.title('k-means results: Fire risk')
plt.xlabel('Surface (Acreage) - logarithmic scale')
plt.ylabel('Time (Minutes)')
plt.legend()
plt.show()

The scatter plot shows a clear separation of wildfire incidents into three distinct risk profiles based on burned area and alert delay. Cluster 0 (blue) groups routine or controlled fires, which generally have moderate acreage and shorter alert delays, although some variability remains due to operational complexity. Cluster 1 (yellow) corresponds to mega-fires, characterized by extremely large burned areas but relatively stable and consistent alert timing, reflected by the vertical concentration at very high acreage values. Cluster 2 (red) represents active-risk fires, where lower containment and ongoing fire dynamics are associated with longer and more variable alert delays, even at moderate acreage levels.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler # Added import
from sklearn.cluster import KMeans # Added import

# Define features used for clustering and scaling, moved to top for clarity and dependency
features = ["containment", "acreage", "alert_lag_min"]

# --- Fix Start: Define scaler and k_means to resolve NameError ---
# These objects were used but not defined in the current scope.
# We re-initialize and fit them based on the likely preceding clustering steps.

# 1. Initialize and fit the StandardScaler on the relevant features of the DataFrame.
# This scales the data, which is a common preprocessing step before K-Means.
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[features])

# 2. Initialize and fit the KMeans model.
# n_clusters=3 is inferred from the 'Clus_km' column values (0, 1, 2)
# and the plot labels/titles.
# random_state is added for reproducibility.
# n_init='auto' is a recommended setting for newer scikit-learn versions to choose
# the best centroids out of n_init initializations.
k_means = KMeans(n_clusters=3, random_state=42, n_init='auto')
k_means.fit(df_scaled) # Fit the model to the scaled data.
# --- Fix End ---

# 4. Centers for standardized space (folosim k_means pentru k=3)
centers_scaled = k_means.cluster_centers_

# 5. Centers in original values
centers_original = scaler.inverse_transform(centers_scaled)

# 6. Coordinates of centers
idx_acreage = features.index("acreage")
idx_lag = features.index("alert_lag_min")
center_x = centers_original[:, idx_acreage]
center_y = centers_original[:, idx_lag]

# 7. Final plot for 3 clusters
plt.figure(figsize=(14, 8))

# We will use a Sample of 5000
df_plot = df.sample(n=5000, random_state=42).copy()

scatter = plt.scatter(
    df_plot["acreage"],
    df_plot["alert_lag_min"],
    c=df_plot["Clus_km"],
    cmap='viridis',
    alpha=0.5,
    s=60,
    edgecolors='none'
)

# Add centers (red X-es)
plt.scatter(
    center_x,
    center_y,
    marker="X",
    s=400,
    c="red",
    edgecolor="black",
    linewidth=2,
    label="Centroizi (Cluster Centers)",
    zorder=10
)

# Add labels
for i, (x, y) in enumerate(zip(center_x, center_y)):
    plt.text(x, y, f"  C{i}", fontsize=14, fontweight='bold', color='red', zorder=11)


plt.xscale('log')
plt.xlim(0.01, 1000000)
plt.ylim(-1000, 150000)

plt.xlabel("Fire Size (Acreage) - Log Scale", fontsize=12)
plt.ylabel("Alert Lag (Minutes)", fontsize=12)
plt.title("K-means (k=3): Wildfire Risk Profiling\n(C2 = ACTIVE RISK - 23% Containment)", fontsize=15)

# Legend for 3 clusters
handles, _ = scatter.legend_elements(prop="colors", alpha=0.7)
# Maping numbers according to our mean
labels_3 = ["C0: Routine/Controlled", "C1: Mega-Fires (Stable)", "C2: ACTIVE RISK"]
plt.legend(handles, labels_3, title="Risk Segments", loc="upper left", bbox_to_anchor=(1, 1))

plt.grid(True, which="both", linestyle='--', alpha=0.1)
plt.tight_layout()
plt.show()

The K-means results with three clusters reveal distinct wildfire risk profiles when fire size and alert lag are considered jointly. Cluster C0 (Routine/Controlled) groups the majority of events, spanning a wide range of fire sizes but generally associated with lower to moderate alert delays, reflecting incidents that are operationally managed despite variability in scale. Cluster C1 (Mega-Fires, Stable) is concentrated at very large acreage values, showing that extremely large fires tend to exhibit more consistent alert timing, likely due to sustained monitoring and established response protocols. Cluster C2 (Active Risk) combines very large fire sizes with relatively higher and more variable alert lags, indicating situations where low containment and ongoing fire dynamics increase operational uncertainty and delay alert escalation.


In [None]:
from scipy.stats import f_oneway
import numpy as np

# ANOVA function for our risk variables (containment, acreage, alert_lag_min)
def anova_per_var(df, var_name, cluster_col="Clus_km"):
    # Extracting values for each cluster (0, 1, 2)
    groups = []
    for c in sorted(df[cluster_col].unique()):
        group_data = df.loc[df[cluster_col] == c, var_name].values
        groups.append(group_data)

    # F test and p value
    F, p = f_oneway(*groups)

    print(f"\nANOVA for variable: {var_name}")
    print(f"F-statistic = {F:.3f}")
    print(f"p-value = {p:.10f}")

    if p < 0.05:
        print(f"‚Üí Differences of {var_name} between the 3 profile risks are statistical significant (Œ± = 0.05).")
    else:
        print(f"‚Üí There are not any statistical differences of {var_name} between clusters.")

# Test
for col in features:
    anova_per_var(df, col)

# **Machine learning modelling**

In [None]:
# === ROUTE1 STEP 1: target (y) ===
T_DELAY_MIN = 60  # "delayed alert" (minutes)


ml_df = merged.copy()

ml_df = ml_df[ml_df["alert_lag_min"].notna()].copy()
ml_df = ml_df[ml_df["alert_lag_min"] >= 0].copy()

ml_df["delayed_alert"] = (ml_df["alert_lag_min"] > T_DELAY_MIN).astype(int)

ml_df["delayed_alert"].value_counts(dropna=False)


In [None]:
# === ROUTE1 STEP 2: features (X) ===

features = ["acreage", "containment", "is_fps", "is_prescribed"]

features = [c for c in features if c in ml_df.columns]

X = ml_df[features].copy()
y = ml_df["delayed_alert"].copy()

for c in ["acreage", "containment", "is_fps", "is_prescribed"]:
    if c in X.columns:
        X[c] = pd.to_numeric(X[c], errors="coerce")

mask = X.notna().all(axis=1) & y.notna()
X = X.loc[mask].copy()
y = y.loc[mask].copy()

print("X shape:", X.shape, "y shape:", y.shape)
print(y.value_counts(dropna=False))


**Train/Test split -- 70 to 30**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.30,
    random_state=42,
    stratify=y
)


**Decision tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=6, random_state=42)
dt.fit(X_train, y_train)


**Random forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)


**XGBoost**

In [None]:
xgb_model = None
try:
    from xgboost import XGBClassifier
    xgb_model = XGBClassifier(
        n_estimators=400,
        max_depth=5,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        eval_metric="logloss"
    )
    xgb_model.fit(X_train, y_train)
    print("XGBoost trained.")
except Exception as e:
    print("XGBoost not available / failed:", e)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

def eval_model(model, name):
    print(f"\n================ {name} =================")

    # Forecast
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)

    # Confusion Matrix
    fig, ax = plt.subplots(figsize=(5, 4))
    sns.heatmap(
        cm,
        annot=True,
        fmt="d",
        cmap="Blues",
        cbar=False,
        ax=ax
    )
    ax.set_title(f"Confusion Matrix ‚Äì {name}")
    ax.set_xlabel("Predicted label")
    ax.set_ylabel("True label")
    plt.tight_layout()
    plt.show()

    # === Statistics ===
    print("Classification report:")
    print(classification_report(y_test, y_pred, digits=4))

    # ===  ROC AUC  ===
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
        print(f"ROC AUC: {auc:.4f}")
    else:
        print("ROC AUC: not available (no predict_proba)")


In [None]:
eval_model(dt, "Decision Tree")
eval_model(rf, "Random Forest")
eval_model(xgb_model, "XGBoost")


**Decision Tree**

The confusion matrix shows that the model correctly identifies 130,283 delayed alerts (true positives) while missing 601 delayed cases (false negatives), resulting in a very high recall for the delayed class of 0.9954. However, performance on the non-delayed class is weak: only 874 non-delayed alerts are correctly classified, while 9,613 are incorrectly flagged as delayed, which explains the very low recall of 0.0833 for class 0. The overall accuracy is 0.9278, but this is driven mainly by the dominance of the delayed class. The ROC AUC of 0.8355 indicates limited discrimination ability, reflecting the model‚Äôs difficulty in separating non-delayed alerts from delayed ones.

**Random Forest**

The confusion matrix indicates 130,178 true positives and 706 false negatives, leading to a delayed-alert recall of 0.9946, meaning that almost all delayed alerts are detected. For the non-delayed class, the model correctly classifies 1,000 cases, but still misclassifies 9,487 non-delayed alerts as delayed, yielding a recall of 0.0954 for class 0. The overall accuracy is 0.9279, similar to the Decision Tree, but with slightly better identification of non-delayed cases. The ROC AUC of 0.8474 reflects improved class separation compared to the Decision Tree, although misclassification of non-delayed alerts remains substantial.

**XGBoost**

The confusion matrix shows 130,249 true positives and 635 false negatives, resulting in a delayed-alert recall of 0.9951, which means delayed alerts are almost never missed. For non-delayed alerts, 860 cases are correctly identified, while 9,627 are incorrectly labeled as delayed, corresponding to a recall of 0.0820 for class 0. The overall accuracy reaches 0.9274, again driven by strong performance on the delayed class. The ROC AUC value of 0.8465 indicates strong discrimination ability, slightly below Random Forest in this run, but still clearly higher than the Decision Tree.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve
)

def get_model_scores(model, X):

    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)[:, 1]
    if hasattr(model, "decision_function"):
        return model.decision_function(X)
    return None

def evaluate_models(models_dict, X_test, y_test):
    rows = []
    roc_data = {}

    for name, model in models_dict.items():
        y_pred = model.predict(X_test)
        scores = get_model_scores(model, X_test)

        # classic metrics
        acc = accuracy_score(y_test, y_pred)
        prec = precision_score(y_test, y_pred, zero_division=0)
        rec = recall_score(y_test, y_pred, zero_division=0)
        f1 = f1_score(y_test, y_pred, zero_division=0)

        # ROC AUC + ROC curve
        auc = np.nan
        fpr = tpr = thr = None
        if scores is not None:
            auc = roc_auc_score(y_test, scores)
            fpr, tpr, thr = roc_curve(y_test, scores)
            roc_data[name] = (fpr, tpr, auc)

        rows.append({
            "Model": name,
            "Accuracy": acc,
            "Precision": prec,
            "Recall": rec,
            "F1": f1,
            "ROC_AUC": auc
        })

    results_df = pd.DataFrame(rows).sort_values(by="ROC_AUC", ascending=False)
    return results_df, roc_data

# === MODELS ===
models = {
    "Decision Tree": dt,
    "Random Forest": rf
}
if xgb_model is not None:
    models["XGBoost"] = xgb_model

results_df, roc_data = evaluate_models(models, X_test, y_test)

print("=== Model Comparison Table ===")
display(results_df.style.format({
    "Accuracy": "{:.4f}",
    "Precision": "{:.4f}",
    "Recall": "{:.4f}",
    "F1": "{:.4f}",
    "ROC_AUC": "{:.4f}"
}))

# === ROC CURVES===
plt.figure(figsize=(8, 6))
for name, (fpr, tpr, auc) in roc_data.items():
    plt.plot(fpr, tpr, label=f"{name} (AUC={auc:.3f})")

#random line
plt.plot([0, 1], [0, 1], linestyle="--", label="Random (AUC=0.5)")

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves ‚Äì Model Comparison")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


The results indicate that all three models perform at a consistently high level, with accuracy values of approximately 92.7‚Äì92.8%, which shows that the available information captures well the factors associated with alert delays. This suggests that the problem is well defined and that the selected features contain meaningful signals for distinguishing between delayed and non-delayed alerts.

Among the evaluated models, Random Forest achieves the highest ROC AUC value (‚âà 0.847), followed very closely by XGBoost (‚âà 0.846), while the Decision Tree records a lower value (‚âà 0.836). Although the differences between Random Forest and XGBoost are small, the ROC AUC values indicate that the ensemble-based models provide superior discrimination between delayed and non-delayed alerts across different decision thresholds.

The ROC curves further confirm this result, as the Random Forest and XGBoost curves consistently lie above the Decision Tree curve for most values of the false positive rate. This indicates that, for the same proportion of false alarms, these models are able to correctly identify a larger share of delayed alerts. In practical terms, both ensemble models provide more reliable probabilistic signals that can be adjusted to different operational thresholds, without a disproportionate increase in unnecessary alerts.

XGBoost and Random Forest both demonstrate strong and stable performance, while the single Decision Tree, although slightly weaker in terms of discrimination power, still performs well and offers greater interpretability. Overall, the results support the use of an ensemble-based approach for the alerting system, as these models offer the best balance between predictive accuracy and risk discrimination, which is essential for supporting timely and effective alert decisions.


In [None]:
best_model_name = results_df.iloc[0]["Model"]
final_model = models[best_model_name]

print("Final model selected:", best_model_name)


In [None]:
def alert_policy(p_delay: float) -> str:
    if p_delay >= 0.70:
        return "HIGH"
    elif p_delay >= 0.40:
        return "MEDIUM"
    else:
        return "LOW"


In [None]:
PLAYBOOK = {
    "LOW": {
        "EN": "Low risk. Monitor official updates and keep notifications enabled.",
        "RO": "Risc redus. UrmƒÉri»õi actualizƒÉrile oficiale »ôi pƒÉstra»õi notificƒÉrile active.",
        "ES": "Riesgo bajo. Siga las actualizaciones oficiales y mantenga las notificaciones activas."
    },
    "MEDIUM": {
        "EN": "Medium risk. Prepare for evacuation. Stay alert and review local guidance.",
        "RO": "Risc mediu. PregƒÉti»õi-vƒÉ pentru evacuare. RƒÉm√¢ne»õi √Æn alertƒÉ »ôi urma»õi indica»õiile locale.",
        "ES": "Riesgo medio. Prep√°rese para evacuar. Mant√©ngase alerta y siga las indicaciones locales."
    },
    "HIGH": {
        "EN": "High risk. If evacuation is ordered, leave immediately. Follow official instructions.",
        "RO": "Risc ridicat. DacƒÉ existƒÉ ordin de evacuare, pleca»õi imediat. Urma»õi instruc»õiunile oficiale.",
        "ES": "Riesgo alto. Si hay orden de evacuaci√≥n, salga inmediatamente. Siga las instrucciones oficiales."
    }
}

def build_alert(p_delay: float, lang="RO"):
    level = alert_policy(p_delay)
    msg = PLAYBOOK[level].get(lang, PLAYBOOK[level]["EN"])
    return level, msg


In [None]:
def predict_delay_prob(model, X_row):
    if hasattr(model, "predict_proba"):
        return float(model.predict_proba(X_row)[:, 1][0])
    if hasattr(model, "decision_function"):
        score = float(model.decision_function(X_row)[0])
        return 1 / (1 + np.exp(-score))
    raise ValueError("Model has no probability or decision function.")

# Sample on records of X_test
demo_idx = X_test.sample(n=min(100, len(X_test)), random_state=42).index
demo_X = X.loc[demo_idx]

cols_show = [c for c in ["name", "acreage", "containment", "rate_of_spread", "structure_threat", "spotting", "alert_lag_min"] if c in ml_df.columns]
demo_info = ml_df.loc[demo_idx, cols_show].copy()

probs = []
levels = []
msg_ro = []
msg_en = []
msg_es = []
for i in range(demo_X.shape[0]):
    p = predict_delay_prob(final_model, demo_X.iloc[[i]])
    lv, mro = build_alert(p, "RO")
    _, men = build_alert(p, "EN")
    _, mes = build_alert(p, "ES")
    probs.append(p); levels.append(lv); msg_ro.append(mro); msg_en.append(men) ; msg_es.append(mes)

demo_info["p_delayed"] = probs
demo_info["alert_level"] = levels
demo_info["message_RO"] = msg_ro
demo_info["message_EN"] = msg_en
demo_info["message_ES"] = msg_es

demo_info.sort_values("p_delayed", ascending=False)

In [None]:
# What each cluster looks like
display(df.groupby("Clus_km")[["containment", "acreage", "alert_lag_min"]].mean())

T_DELAY_MIN = 60
df_tmp = df.copy()
df_tmp["delayed_alert"] = (df_tmp["alert_lag_min"] > T_DELAY_MIN).astype(int)

cluster_delay = df_tmp.groupby("Clus_km")["delayed_alert"].mean().rename("delayed_rate")
display(cluster_delay.to_frame())


In [None]:
import folium
from IPython.display import display, HTML

# Checking required columns
required_cols = {"lat", "lng", "alert_lag_min"}
if not required_cols.issubset(merged.columns):
    print("Missing required columns:", required_cols - set(merged.columns))
else:
    # Sample for performance
    geo_delay = (
        merged
        .dropna(subset=["lat", "lng", "alert_lag_min"])
        .sample(n=min(2000, len(merged)), random_state=42)
        .copy()
    )

    geo_delay["Lag (min)"] = geo_delay["alert_lag_min"].round(1)

    # Popup
    geo_delay["Popup"] = (
        "<b>" + geo_delay["name"].astype(str) + "</b><br>"
        "Alert lag: " + geo_delay["Lag (min)"].astype(str) + " min<br>"
        "Containment: " + geo_delay["containment"].astype(str) + "%<br>"
        "Acreage: " + geo_delay["acreage"].astype(str)
    )

    # Map
    m = folium.Map(
        location=[geo_delay["lat"].mean(), geo_delay["lng"].mean()],
        zoom_start=6,
        tiles="CartoDB positron"
    )

    # Markers according to ML thresholds
    for _, row in geo_delay.iterrows():
        lag = row["alert_lag_min"]

        if lag <= 60:
            color = "#4CAF50"      # green - OK
        elif lag <= 1440:
            color = "#F4A261"      # orange - delayed
        else:
            color = "#9B2226"      # dark red - severe delay

        folium.CircleMarker(
            location=[row["lat"], row["lng"]],
            radius=4,
            color=color,
            fill=True,
            fill_opacity=0.7,
            popup=folium.Popup(row["Popup"], max_width=300)
        ).add_to(m)

    # Legend
    legend_html = """
     <div style="
         position: fixed;
         bottom: 50px; left: 50px;
         width: 220px; height: 130px;
         background-color: white;
         border:2px solid grey;
         z-index:9999;
         font-size:14px;
         box-shadow: 2px 2px 4px rgba(0,0,0,0.3);
         padding: 8px;
     ">
         <b>Alert Lag (minutes)</b><br>
         <span style="color:#4CAF50;">‚óè</span> &nbsp; ‚â§ 60 min (OK)<br>
         <span style="color:#F4A261;">‚óè</span> &nbsp; 60‚Äì1440 min (Delayed)<br>
         <span style="color:#9B2226;">‚óè</span> &nbsp; &gt; 1440 min (Severe delay)
     </div>
    """

    m.get_root().html.add_child(folium.Element(legend_html))

    # Save + display
    m.save("delay_map_ml_thresholds.html")
    display(HTML(m._repr_html_()))

    print("Map saved as: delay_map_ml_thresholds.html")



# **Results**

**Model Performance Metrics**

The supervised learning models demonstrate strong and consistent predictive performance in identifying delayed wildfire alerts. All evaluated models achieve an overall accuracy of approximately 92.7‚Äì92.8%, confirming that alert delays can be reliably predicted using the selected operational features. This indicates that the problem formulation is appropriate and that the chosen variables capture meaningful signals related to alert delays.

Among the evaluated models, Random Forest and XGBoost exhibit the strongest overall performance, with very similar results across all metrics. Both models achieve high precision (‚âà 0.93) and exceptionally high recall (above 99%) for delayed alerts, indicating that nearly all truly delayed cases are correctly identified. This property is essential in an alerting context, where failing to detect delayed alerts represents the most costly type of error. In terms of discrimination power, Random Forest attains the highest ROC AUC (‚âà 0.847), closely followed by XGBoost (‚âà 0.846), while the Decision Tree records a lower value (‚âà 0.836). Although the numerical differences are modest, they are consistent across ROC curves and evaluation metrics, highlighting the advantage of ensemble-based methods over a single-tree model.

**Clustering-Based Risk Profiling**

The unsupervised clustering analysis identifies three statistically distinct wildfire risk profiles, each characterized by clearly different numerical patterns. One cluster corresponds to very large-scale incidents with near-total containment (‚âà 99%) but the longest alert delays (over 16,000 minutes) and the highest delayed alert rate (above 98%), highlighting the operational complexity associated with mega-fires. A second cluster also exhibits high containment (‚âà 99.1%) but significantly smaller average fire size (‚âà 29,700 acres) and shorter alert delays (‚âà 8,856 minutes), indicating faster alert handling when incident scale is reduced. The third cluster is defined by very low containment (‚âà 23.7%), moderate fire size (‚âà 8,468 acres), and long alert delays (over 13,300 minutes), demonstrating that low containment alone can drive substantial alert delays, independently of fire scale.

**Key Findings and Visual Insights**

Visualizations support and reinforce these findings. Histograms reveal highly skewed distributions for acreage and alert lag, justifying normalization prior to clustering. The elbow method clearly indicates three clusters as the optimal choice, balancing interpretability and explanatory power. Scatter plots in logarithmic scale show clear separation between routine, mega-fire, and active-risk profiles, while confusion matrices and ROC curves illustrate the strong predictive capability of the supervised models, particularly the ensemble approaches. Spatial visualizations further contextualize alert delays geographically, linking model outputs to real-world locations.

**Model Selection and Operational Implications**

Based on quantitative performance metrics and visual diagnostics, an ensemble-based model is selected for the alerting framework, with Random Forest and XGBoost both representing suitable choices due to their strong discrimination power and extremely high recall for delayed alerts. When combined with predefined probability thresholds, their outputs can be translated into actionable alert levels that distinguish low-risk situations from those requiring early escalation. The alignment between clustering-derived risk profiles and supervised predictions further validates the framework, as higher delay probabilities are consistently assigned to incidents characterized by extreme acreage or low containment.

**Limitations and Ethical Considerations**

Several limitations should be acknowledged. The analysis relies on historical incident data and does not incorporate real-time sensor inputs or physical fire-spread modeling. Population exposure and individual-level vulnerability are not directly modeled, and the system is designed to support decision-making rather than automate evacuation orders. From an ethical perspective, the framework prioritizes transparency and interpretability, ensuring that alert decisions can be audited and explained, while minimizing the risk of systematically missing high-risk delayed alerts.

## Team Contributions

| Name         | Contributions                                |
|--------------|----------------------------------------------|
| »öilicƒÉ Mihnea David       | EDA, Model Validation          |
| Radu Alexandru Claudiu           | 	EDA, Model Testing      |
| Zamfir Robert Dan     | Clustering, Model Training            |
| Sasu Sabrina          |Clustering, Model Validation         |
| SƒÉndulescu Crina     | Model Testing, Data Cleaning            |
| Sandu Bianca Antonia  | Model Training, Data Cleaning       |