 
# Machine Learning Models Based on Recent Traffic History & Weather

In this notebook, I train and evaluate the same machine learning models as in Notebook 7a, but now include weather features as additional predictors. The goal is to assess whether incorporating weather data improves prediction performance. We again compare Random Forest, Logistic Regression, and Gradient Boosting models on the extended feature set.



In [1]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import joblib

# Load data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather_scaled.csv")

# Define features and target
features = [
    'prev_1h_severity', 'prev_2h_severity', 'hour', 'day_of_week', 'is_weekend', 'is_rush_hour',
    'temperature_2m', 'precipitation', 'rain', 'snowfall', 'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover'
]
X = df[features]
y = df['severity_level']

# rain-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# define models
models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# Train and evaluate, save each model
for name, model in models.items():
    print(f"\nModel: {name}")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred, zero_division=0))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

    # Save model
    filename = f"../models/7b_{name.lower().replace(' ', '_')}.joblib"
    joblib.dump(model, filename)
    print(f"Saved model to {filename}")




Model: Random Forest
              precision    recall  f1-score   support

           0       0.80      0.96      0.87     12622
           1       0.25      0.07      0.11      2365
           2       0.15      0.03      0.05       911

    accuracy                           0.77     15898
   macro avg       0.40      0.35      0.35     15898
weighted avg       0.68      0.77      0.71     15898

Confusion Matrix:
 [[12059   444   119]
 [ 2144   176    45]
 [  799    82    30]]
Saved model to ../models/7b_random_forest.joblib

Model: Logistic Regression
              precision    recall  f1-score   support

           0       0.79      1.00      0.89     12622
           1       0.00      0.00      0.00      2365
           2       0.00      0.00      0.00       911

    accuracy                           0.79     15898
   macro avg       0.26      0.33      0.30     15898
weighted avg       0.63      0.79      0.70     15898

Confusion Matrix:
 [[12620     2     0]
 [ 2365     0   

# Observations

Random Forest:

- Adding weather slightly improved minority class recall (esp. for class 1).
- Precision for class 1 increased from 0.19 to 0.25.
- But overall still struggles.

Logistic Regression & Gradient Boosting:

- Exactly same behavior as before — weather features didn’t help at all.
- This is because both models underperform heavily with such imbalance and weak signals.
- Accuracy dropped slightly (79% to 77%) — which is not really bad... it's expected: now the model tries to correctly classify minority classes instead of just predicting class 0.


 Random Forest benefits a little from weather features — weak but non-zero signal.
Weather data likely introduces a very mild correlation to traffic severity.
The class imbalance still dominates.

Random Forest is the strongest candidate.