# Random Forest Experiments with Engineered Features & Probabilities



In this notebook, we develop a hybrid model that integrates both the probabilistic baseline estimates (obtained in Notebook 5) and the engineered features used in previous machine learning models (lagged severity, time features, and weather data). The goal of this approach is to combine long-term temporal patterns (captured by the baseline probabilities) with recent contextual information (captured by lag features and weather), in order to improve the model's ability to predict traffic severity levels.

The previously calculated class probabilities are merged into the training dataset and provided as additional input features to a Random Forest classifier, which has shown competitive performance in earlier experiments.



In [1]:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

import joblib


#Load engineered data with lags and weather
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")

# Parse timestamp if not already
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = pd.to_datetime(df['timestamp']).dt.dayofweek


# Load baseline probabilities
prob_df = pd.read_csv("../results/baseline_probabilities.csv")

# Merge probabilities into main dataset
df_merged = pd.merge(
    df, 
    prob_df, 
    on=['road', 'hour', 'weekday'], 
    how='left'
)

# Fill missing probabilities with 0 (
df_merged[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df_merged[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)


# Prepare final features

# Define features to use
feature_cols = [
    'prev_1h_severity',
    'prev_2h_severity',
    'temperature_2m',
    'precipitation',
    'rain',
    'snowfall',
    'wind_speed_10m',
    'wind_gusts_10m',
    'cloud_cover',
    'prob_severity_0',
    'prob_severity_1',
    'prob_severity_2'
]

# Target variable
target = 'severity_level'

# Prepare train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df_merged[feature_cols],
    df_merged[target],
    test_size=0.3,
    random_state=42,
    stratify=df_merged[target]
)

# Train Random Forest model

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

#
# Evaluate

y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Hybrid Model (RF + Probabilities + Weather + Recent History)")
print("\nAccuracy:", round(accuracy, 4))
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", conf_matrix)

# Save the model
joblib.dump(model, "../models/8_hybrid_model.joblib")
print("Hybrid model saved to ../models/8_hybrid_model.joblib")


Hybrid Model (RF + Probabilities + Weather + Recent History)

Accuracy: 0.7529

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.92      0.86     18933
           1       0.25      0.13      0.17      3547
           2       0.15      0.06      0.09      1367

    accuracy                           0.75     23847
   macro avg       0.41      0.37      0.37     23847
weighted avg       0.69      0.75      0.71     23847


Confusion Matrix:
 [[17420  1166   347]
 [ 2974   449   124]
 [ 1119   162    86]]
Hybrid model saved to ../models/8_hybrid_model.joblib


**Results interpretation**

- The hybrid model achieves an accuracy of ~75%, comparable to previous models.
- It performs well on class 0 ("Good") — precision 0.81, recall 0.92.
- The model still struggles on minority classes (1="Minor", 2="Serious"), with limited recall and precision — likely due to class imbalance.
- Compared to the models in Notebooks 7a and 7b, adding baseline probabilities improves class 1 and 2 slightly (especially recall).
- Overall, this model demonstrates that combining prior probabilities and recent history adds some predictive benefit, though minority class prediction remains difficult.



## Class weighting

Since the dataset is heavily imbalanced, now I will apply class weighting in the Random Forest classifier, automatically adjusting the penalty for each class inversely proportional to its frequency.



In [2]:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split


df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")

# Parse timestamp 
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = pd.to_datetime(df['timestamp']).dt.dayofweek

# Load baseline probabilities
prob_df = pd.read_csv("../results/baseline_probabilities.csv")

# Merge probabilities into main dataset
df_merged = pd.merge(
    df, 
    prob_df, 
    on=['road', 'hour', 'weekday'], 
    how='left'
)


prob_df = prob_df.rename(columns={
    'p_good': 'prob_severity_0',
    'p_minor': 'prob_severity_1',
    'p_serious': 'prob_severity_2'
})

df_merged = pd.merge(
    df, 
    prob_df, 
    on=['road', 'hour', 'weekday'], 
    how='left'
)

# Fillinh missing probabilities with 0 
df_merged[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df_merged[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# Prepare final features

# Define features to use
feature_cols = [
    'prev_1h_severity',
    'prev_2h_severity',
    'temperature_2m',
    'precipitation',
    'rain',
    'snowfall',
    'wind_speed_10m',
    'wind_gusts_10m',
    'cloud_cover',
    'prob_severity_0',
    'prob_severity_1',
    'prob_severity_2'
]

# Target variable
target = 'severity_level'

# Prepare train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df_merged[feature_cols],
    df_merged[target],
    test_size=0.3,
    random_state=42,
    stratify=df_merged[target]
)

# Train Random Forest model with class weighting

model_weighted = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42
)
model_weighted.fit(X_train, y_train)

# Evaluate

y_pred_weighted = model_weighted.predict(X_test)

report_weighted = classification_report(y_test, y_pred_weighted)
conf_matrix_weighted = confusion_matrix(y_test, y_pred_weighted)
accuracy_weighted = accuracy_score(y_test, y_pred_weighted)

print("Hybrid Model (RF + Probabilities + Weather + Recent History) with Class Weighting")
print("\nAccuracy:", round(accuracy_weighted, 4))
print("\nClassification Report:\n", report_weighted)
print("\nConfusion Matrix:\n", conf_matrix_weighted)


# Save the model
joblib.dump(model_weighted, "../models/8b_weighted_model.joblib")
print("Weighted model saved to ../models/8b_weighted_model.joblib")

Hybrid Model (RF + Probabilities + Weather + Recent History) with Class Weighting

Accuracy: 0.7206

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.87      0.84     18933
           1       0.22      0.16      0.18      3547
           2       0.12      0.07      0.09      1367

    accuracy                           0.72     23847
   macro avg       0.38      0.37      0.37     23847
weighted avg       0.68      0.72      0.70     23847


Confusion Matrix:
 [[16527  1842   564]
 [ 2844   564   139]
 [ 1075   198    94]]
Weighted model saved to ../models/8b_weighted_model.joblib


**Results interpretation**

- Overall accuracy dropped slightly (as expected, since we're forcing model to care more about rare classes).
- Recall improved for minority classes 1 and 2 (more correct detections of delays).
- Precision for rare classes remains low — not surprising given class imbalance.

So it appears that class weighting helps recover more rare cases.

## Additional Time-based Features
In this step, we extend the hybrid model by incorporating additional time-based features:

- is_weekend: indicates whether the day is Saturday or Sunday.

- is_rush_hour: flags typical morning (7–9 AM) and evening (4–6 PM) rush periods.

- day_of_week: captures the weekday index (Monday = 0, Sunday = 6).

These features aim to help the model capture temporal patterns in traffic behavior which may not be fully captured by hour alone. The model still uses recent traffic history, weather features, and baseline severity probabilities.

In [3]:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split


df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")

# Parse timestamp if not already
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek

# Add additional time features
df['is_weekend'] = df['weekday'].isin([5, 6]).astype(int)
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 16, 17, 18]).astype(int)
df['day_of_week'] = df['weekday']  # (for modeling)


# Load baseline probabilities
prob_df = pd.read_csv("../results/baseline_probabilities.csv")

# Merge probabilities into main dataset
df_merged = pd.merge(
    df, 
    prob_df, 
    on=['road', 'hour', 'weekday'], 
    how='left'
)

# Rename columns if necessary (in case old file format)
if 'p_good' in df_merged.columns:
    df_merged.rename(columns={'p_good': 'prob_severity_0',
                               'p_minor': 'prob_severity_1',
                               'p_serious': 'prob_severity_2'}, inplace=True)

# Fill missing probabilities conservatively
df_merged[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df_merged[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# --------------------------------------------
# Features for model

feature_cols = [
    'prev_1h_severity',
    'prev_2h_severity',
    'temperature_2m',
    'precipitation',
    'rain',
    'snowfall',
    'wind_speed_10m',
    'wind_gusts_10m',
    'cloud_cover',
    'prob_severity_0',
    'prob_severity_1',
    'prob_severity_2',
    'is_weekend',
    'is_rush_hour',
    'day_of_week'
]

target = 'severity_level'

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df_merged[feature_cols],
    df_merged[target],
    test_size=0.3,
    random_state=42,
    stratify=df_merged[target]
)

# Train model (keeping balanced weights from previous step)

model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

# Evaluate

y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Hybrid Model (RF + Probabilities + Weather + Recent History + Time Features)")
print("\nAccuracy:", round(accuracy, 4))
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", conf_matrix)

joblib.dump(model, "../models/8c_best_hybrid_model.joblib")
print("Model saved to ../models/8c_best_hybrid_model.joblib")


Hybrid Model (RF + Probabilities + Weather + Recent History + Time Features)

Accuracy: 0.7169

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.87      0.84     18933
           1       0.21      0.16      0.19      3547
           2       0.12      0.08      0.09      1367

    accuracy                           0.72     23847
   macro avg       0.38      0.37      0.37     23847
weighted avg       0.68      0.72      0.70     23847


Confusion Matrix:
 [[16410  1936   587]
 [ 2800   582   165]
 [ 1063   201   103]]
Model saved to ../models/8c_best_hybrid_model.joblib


**Results interpretation**

- Accuracy dropped very slightly again (from ~0.72 before to 0.7169 now).

- Class 0 (Good) is still predicted quite well.

- Minor and Serious Delays (classes 1 and 2) continue to be under-predicted, though recall for class 1 is a bit better than for class 2.

- The time features I added (weekend, rush hour, day of week) don’t seem to strongly affect the model — likely because the traffic patterns in London don’t exhibit very sharp peaks in my current dataset, or the historical features already capture most of that variability.

Adding extra time-based features did not lead to significant improvements in overall model accuracy or class balance. This may indicate that temporal effects are already partially captured through the combination of recent history (lag features), weather, and baseline severity probabilities.



### Hyperparameter optimization


Next, we perform hyperparameter optimization on the Random Forest model using GridSearchCV. The goal is to explore different configurations (number of estimators, tree depth, splitting rules, and leaf sizes) to improve the model's ability to handle class imbalance and improve prediction accuracy across all severity levels. The macro F1 score is used as the optimization metric to give equal weight to each class.



In [4]:

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'class_weight': ['balanced']  # keep class weighting active
}

# Initialize RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=3,
    scoring='f1_macro',  # we use macro f1 to balance class imbalance
    verbose=2,
    n_jobs=-1
)

# Fit
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predict with best model
y_pred = best_rf.predict(X_test)

# Evaluation
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Random Forest with Hyperparameter Tuning")
print("Best parameters found:", grid_search.best_params_)
print("\nAccuracy:", round(accuracy, 4))
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", conf_matrix)

joblib.dump(best_rf, "../models/8d_rf_hyperparam_tuned.joblib")
print("Tuned model saved to ../models/8d_rf_hyperparam_tuned.joblib")

# Optional: save the full GridSearchCV object if you want to inspect later
joblib.dump(grid_search, "../models/8d_rf_gridsearchcv.joblib")


Fitting 3 folds for each of 24 candidates, totalling 72 fits
Random Forest with Hyperparameter Tuning
Best parameters found: {'class_weight': 'balanced', 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}

Accuracy: 0.5607

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.58      0.71     18933
           1       0.26      0.43      0.32      3547
           2       0.15      0.65      0.25      1367

    accuracy                           0.56     23847
   macro avg       0.44      0.55      0.42     23847
weighted avg       0.77      0.56      0.62     23847


Confusion Matrix:
 [[10964  4101  3868]
 [  926  1512  1109]
 [  186   287   894]]
Tuned model saved to ../models/8d_rf_hyperparam_tuned.joblib


['../models/8d_rf_gridsearchcv.joblib']

**Results interpretation**

Best parameters found:

- n_estimators = 200
- max_depth = 10
- min_samples_leaf = 2
- min_samples_split = 2
- class_weight = balanced

Accuracy dropped from ~75% (original RF) to 56% — which at first looks worse.

BUT:

Recall for minority classes (severity 1 and severity 2) significantly increased.

For severity 2 (serious):

- Recall improved from ~6% → 65%.
- F1-score also improved compared to before.

The model sacrifices some accuracy on the majority class (severity 0), but gains sensitivity on minority classes.
This is a typical trade-off when handling imbalanced classification:

Before tuning, the model was overly biased toward predicting the majority class.
After tuning, it started identifying minority cases more often — which is valuable for rare but important traffic events.


**Conclusion** 

The drop in accuracy to 56% can still be considered an improvement, depending on the objective.
The goal is not just overall accuracy.
The problem is imbalanced classification. The vast majority of samples are “Good” (severity 0), so a model can cheat — by always predicting severity 0 — and still get ~80% accuracy.

But that kind of model is useless if it completely fails to detect the minority classes (which represent actual traffic disruptions).

| Class       | Recall Before | Recall After |
| ----------- | ------------- | ------------ |
| 0 (Good)    | 92%           | 58%          |
| 1 (Minor)   | 13%           | 43%          |
| 2 (Serious) | 6%            | 65%          |

So while the model is now less confident in always predicting "Good", it has become much better at detecting real disruptions.

If the objective is realistic traffic forecasting, we'd want the model to actually predict delays, not just "Good" status.

So this version is more balanced and more useful — even if the accuracy metric alone is lower.



### Fine-tuning 

Next, we try fine-tune the Random Forest model using a soft class weighting approach.
Specifically, we use class_weight='balanced_subsample' and limit the model complexity (e.g., max_depth=15) to balance overall accuracy and recall for minority classes.
The goal is to retain 65–70% accuracy while improving detection of severity levels 1 and 2.

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

# Features from hybrid model (same as before)
feature_cols = [
    'prev_1h_severity',
    'prev_2h_severity',
    'temperature_2m',
    'precipitation',
    'rain',
    'snowfall',
    'wind_speed_10m',
    'wind_gusts_10m',
    'cloud_cover',
    'prob_severity_0',
    'prob_severity_1',
    'prob_severity_2'
]

target = 'severity_level'

# Re-split in case needed (or reuse X_train, X_test, etc.)
X = df_merged[feature_cols]
y = df_merged[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train with softer class weighting + moderate depth
model = RandomForestClassifier(
    n_estimators=150,
    max_depth=15,
    min_samples_split=2,
    min_samples_leaf=2,
    class_weight='balanced_subsample',
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)

print("Tuned Random Forest (Balanced Subsample, Medium Depth)\n")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

joblib.dump(model, "../models/8e_rf_moderate_tuned.joblib")
print("Model saved to ../models/8e_rf_moderate_tuned.joblib")


Tuned Random Forest (Balanced Subsample, Medium Depth)

Accuracy: 0.638

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.70      0.78     18933
           1       0.27      0.39      0.32      3547
           2       0.16      0.43      0.24      1367

    accuracy                           0.64     23847
   macro avg       0.44      0.51      0.44     23847
weighted avg       0.75      0.64      0.68     23847


Confusion Matrix:
 [[13249  3401  2283]
 [ 1456  1383   708]
 [  402   382   583]]
Model saved to ../models/8e_rf_moderate_tuned.joblib


**Results interpretation**

Tuned Random Forest Summary (Balanced Subsample + Controlled Complexity)
 
Accuracy: 63.8%
(Down from ~75% in the original model, but much more balanced)

Recall improvements:

- Severity 1: increased to 39% (vs. ~13% before)

- Severity 2: increased to 43% (vs. ~6% before)

Trade-off:
The model sacrifices some overall accuracy (especially for class 0) to better capture minority classes (1 and 2). This leads to more equitable predictions in real-world use cases where detecting delays is critical.

-----
### Experiment: Adding Baseline Entropy as a Feature

I compute the entropy of the baseline severity probability distribution for each instance, capturing the uncertainty in the baseline estimate. Higher entropy implies more ambiguity. We add this as an extra feature to our existing model and evaluate its contribution to classification performance.

In [6]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from scipy.stats import entropy

# Load engineered traffic data with weather features
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")

# Load baseline probabilities (using correct column names)
prob_df = pd.read_csv("../results/baseline_probabilities.csv")
prob_df = prob_df.rename(columns={
    'prob_severity_0': 'p_good',
    'prob_severity_1': 'p_minor',
    'prob_severity_2': 'p_serious'
})

# Ensure timestamp-related columns exist
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.weekday

# Merge probabilities into main dataset
df_merged = pd.merge(df, prob_df, on=['road', 'hour', 'weekday'], how='left')
df_merged[['p_good', 'p_minor', 'p_serious']] = df_merged[
    ['p_good', 'p_minor', 'p_serious']
].fillna(0)

# Compute entropy of the probability distribution
df_merged['baseline_entropy'] = entropy(
    df_merged[['p_good', 'p_minor', 'p_serious']].values.T,
    base=2
)

# Define features
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'p_good', 'p_minor', 'p_serious',
    'baseline_entropy'
]
target = 'severity_level'

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df_merged[feature_cols],
    df_merged[target],
    test_size=0.3,
    random_state=42,
    stratify=df_merged[target]
)

# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Model with Baseline Entropy Feature")
print("\nAccuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

joblib.dump(model, "../models/8f_rf_with_entropy.joblib")
print("Model with entropy saved to ../models/8f_rf_with_entropy.joblib")

Model with Baseline Entropy Feature

Accuracy: 0.7508

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.92      0.86     18933
           1       0.25      0.13      0.17      3547
           2       0.14      0.06      0.08      1367

    accuracy                           0.75     23847
   macro avg       0.40      0.37      0.37     23847
weighted avg       0.69      0.75      0.71     23847


Confusion Matrix:
 [[17372  1205   356]
 [ 2965   452   130]
 [ 1126   160    81]]
Model with entropy saved to ../models/8f_rf_with_entropy.joblib


**Results interpretation**

- Overall accuracy stayed roughly the same (from ~0.7529 before to 0.7508 now).

- Recall for classes 1 and 2 slightly improved (compared to the original baseline), even if marginally.

- Entropy provided a tiny boost in class balance awareness, but not dramatically — likely because the Random Forest already leverages decision uncertainty fairly well.

---

### Combining all of the above into one model

In [7]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import entropy

# Load main dataset
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek

# Load and rename baseline probabilities
baseline_probs = pd.read_csv("../results/baseline_probabilities.csv")
baseline_probs = baseline_probs.rename(columns={
    'prob_severity_0': 'p_good',
    'prob_severity_1': 'p_minor',
    'prob_severity_2': 'p_serious'
})

# Merge probabilities
df = pd.merge(df, baseline_probs, on=['road', 'hour', 'weekday'], how='left')
df[['p_good', 'p_minor', 'p_serious']] = df[['p_good', 'p_minor', 'p_serious']].fillna(0)

# Calculate entropy
df['entropy'] = df[['p_good', 'p_minor', 'p_serious']].apply(lambda row: entropy(row + 1e-9, base=2), axis=1)

# Final feature list
features = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'p_good', 'p_minor', 'p_serious',
    'entropy'
]
X = df[features]
y = df['severity_level']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Tuned Random Forest
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=2,
    min_samples_split=2,
    class_weight='balanced_subsample',
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Final Model: All Features + Tuning + Entropy")
print("\nAccuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

joblib.dump(model, "../models/8g_final_model_with_entropy.joblib")
print("Final model saved to ../models/8g_final_model_with_entropy.joblib")


Final Model: All Features + Tuning + Entropy

Accuracy: 0.5587

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.58      0.71     18933
           1       0.26      0.42      0.32      3547
           2       0.15      0.66      0.24      1367

    accuracy                           0.56     23847
   macro avg       0.44      0.55      0.42     23847
weighted avg       0.77      0.56      0.62     23847


Confusion Matrix:
 [[10931  4058  3944]
 [  920  1494  1133]
 [  182   286   899]]
Final model saved to ../models/8g_final_model_with_entropy.joblib


---

### Comparison

| Model Variant                                      | Accuracy  | Recall (0) | Recall (1) | Recall (2) | Macro F1 | Notes                                  |
| -------------------------------------------------- | --------- | ---------- | ---------- | ---------- | -------- | -------------------------------------- |
| **Baseline Probability Only**                      | 0.795     | **1.00**   | 0.03       | 0.00       | 0.32     | Predicts mostly class 0                |
| **Recent History + Weather (RF)**                  | 0.7529    | 0.92       | 0.13       | 0.06       | 0.37     | Balanced, simple RF                    |
| **+ Class Weighting**                              | 0.7206    | 0.87       | 0.16       | 0.07       | 0.37     | Slight boost to class 1/2              |
| **+ Hyperparameter Tuning (Full Grid Search)**     | 0.5587    | 0.58       | 0.42       | **0.66**   | 0.42     | Boosts minority recall, hurts accuracy |
| ** Tuned RF (Balanced Subsample, Medium Depth)** | **0.638** | 0.70       | 0.39       | 0.43       | **0.44** | Best trade-off                       |
| **+ Entropy Feature**                              | 0.7508    | 0.92       | 0.13       | 0.06       | 0.37     | Didn’t improve much over base RF       |


The tuned RF seems to be the best choice, having a good balance, so I will now include entropy in that model to see if we have a further improvement.

### Incorporating entropy in Tuned RF

This experiment adds a new feature representing the entropy (uncertainty) of baseline class probabilities. It aims to help the model better judge when to rely more on its own prediction vs. when the baseline gives strong signals (i.e., low entropy = confident prior, high entropy = uncertain). We include it on top of the tuned Random Forest with class balancing.



In [8]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek

# Load baseline probabilities
prob_df = pd.read_csv("../results/baseline_probabilities.csv")

# Rename columns for consistency
prob_df = prob_df.rename(columns={
    'p_good': 'prob_severity_0',
    'p_minor': 'prob_severity_1',
    'p_serious': 'prob_severity_2'
})

# Merge baseline probabilities
df = pd.merge(df, prob_df, on=['road', 'hour', 'weekday'], how='left')
df[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df[[
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2'
]].fillna(0)

# Compute entropy of probabilities
def compute_entropy(row):
    probs = np.array([row['prob_severity_0'], row['prob_severity_1'], row['prob_severity_2']])
    probs = np.clip(probs, 1e-9, 1)  # avoid log(0)
    return -np.sum(probs * np.log(probs))

df['prob_entropy'] = df.apply(compute_entropy, axis=1)

# Feature columns
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2',
    'prob_entropy'
]

# Target
target = 'severity_level'

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df[feature_cols], df[target],
    test_size=0.3, random_state=42, stratify=df[target]
)

# Train tuned RF with entropy
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=2,
    min_samples_split=2,
    class_weight='balanced_subsample',
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Model with Entropy + All Previous Improvements\n")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

joblib.dump(model, "../models/8h_model_with_entropy_and_all_improvements.joblib")
print("Saved: 8h_model_with_entropy_and_all_improvements.joblib")


Model with Entropy + All Previous Improvements

Accuracy: 0.5587

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.58      0.71     18933
           1       0.26      0.42      0.32      3547
           2       0.15      0.66      0.24      1367

    accuracy                           0.56     23847
   macro avg       0.44      0.55      0.42     23847
weighted avg       0.77      0.56      0.62     23847

Confusion Matrix:
 [[10931  4058  3944]
 [  920  1494  1133]
 [  182   286   899]]
Saved: 8h_model_with_entropy_and_all_improvements.joblib


**Results interpretation**

- The overall accuracy dropped to ~56%, and Class 0 (Good): Very high precision, but low recall (~58%)

- Classes 1 & 2 (Minor/Serious): Recall improved, but precision remains low

- The model is now more willing to predict classes 1 and 2, which improves recall, but at the cost of many false positives for those classes.

This model is useful if the goal is to catch more minor/serious cases (better recall for classes 1–2), even at the expense of many false alarms.

For a balanced performance, the tuned RF (without entropy) is the best choice (and best overall candidate):
- Tuned Random Forest (Balanced Subsample, Medium Depth)
- Accuracy: 0.638, better balance of precision/recall across classes



Entropy is a valid idea conceptually (especially in a hybrid model), but it didn't help in this case, likely due to overlap with existing features and misleading signals from uncertain baseline distributions.

If included, it should ideally be paired with strong regularization or feature selection.



------

### Adding Feature Interactions

I will now experiment by adding Feature Interactions 


In [9]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load main dataset
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek

# Load probabilities with correct column names
prob_df = pd.read_csv("../results/baseline_probabilities.csv")
# Check column names
# print(prob_df.columns)

# Merge using the correct names
df_merged = pd.merge(df, prob_df, on=["road", "hour", "weekday"], how="left")

# Fill missing values
df_merged[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df_merged[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# Add interaction features
df_merged['precip_x_cloud'] = df_merged['precipitation'] * df_merged['cloud_cover']
df_merged['temp_x_wind'] = df_merged['temperature_2m'] * df_merged['wind_speed_10m']
df_merged['rushhour_x_prev1h'] = df_merged['is_rush_hour'] * df_merged['prev_1h_severity']

# Feature list
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2',
    'precip_x_cloud', 'temp_x_wind', 'rushhour_x_prev1h'
]
target = 'severity_level'

# Split
X_train, X_test, y_train, y_test = train_test_split(
    df_merged[feature_cols], df_merged[target],
    test_size=0.3, random_state=42, stratify=df_merged[target]
)

#Tuned model
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=2,
    min_samples_split=2,
    class_weight="balanced_subsample",
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Model with Feature Interactions (Best Config)")
print("\nAccuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


joblib.dump(model, "../models/8i_rf_feature_interactions.joblib")
print("Saved: 8i_rf_feature_interactions.joblib")


Model with Feature Interactions (Best Config)

Accuracy: 0.5544

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.57      0.70     18933
           1       0.25      0.42      0.32      3547
           2       0.15      0.67      0.24      1367

    accuracy                           0.55     23847
   macro avg       0.44      0.55      0.42     23847
weighted avg       0.77      0.55      0.62     23847


Confusion Matrix:
 [[10815  4108  4010]
 [  888  1496  1163]
 [  176   281   910]]
Saved: 8i_rf_feature_interactions.joblib


**Results interpretation**

I tested adding the interaction terms: 

- precip_x_cloud = precipitation × cloud_cover
- temp_x_wind = temperature × wind_speed
- rushhour_x_prev1h = is_rush_hour × prev_1h_severity

They did not improve the model's performance. In fact, results were very similar — slightly worse than the previous best (without interactions)

The drop in overall accuracy suggests the added features might be introducing noise or redundancy, rather than new signal.

Therefore I'll leave out the interaction features, as they don’t provide measurable benefit here.



----
### Best Model Summary:

Model type: Random Forest Classifier

Key strategies included:

- Lag Features (Recent History): prev_1h_severity, prev_2h_severity

  Captures short-term temporal patterns in traffic severity per road

- Weather Features (Standardized): temperature_2m, precipitation, rain, snowfall, wind_speed_10m, wind_gusts_10m, cloud_cover

  External conditions affecting traffic.

- Baseline Probability Features: p_good, p_minor, p_serious

  Historical severity probabilities by road, hour, and weekday.

- Class Weighting: class_weight='balanced_subsample'

  Addresses imbalance by giving more weight to underrepresented classes (1 and 2).

- Hyperparameter Tuning: max_depth=10, min_samples_leaf=2, min_samples_split=2, n_estimators=200

  Tuned to improve generalization and performance across classes.


**Strong overall accuracy (~0.64)**

**Most balanced performance across classes (especially classes 1 & 2)**

**Good trade-off between precision and recall**

**Reasonable complexity and fast training/inference time**