# Experimenting with XGBoost 


This notebook mirrors the structure of Notebook 8 (Random Forest), but focuses entirely on XGBoost, with progressive experiments aiming to improve performance through feature engineering, tuning, imbalance handling, and SMOTE.

The aim is to see if it can outperform the best-tuned RF model.

### Baseline - using the engineered dataset with lags and weather but without using probabilities or enhancements

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

# Load data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek

# Select baseline features
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover'
]
target = 'severity_level'

X = df[feature_cols]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Train XGBoost (same config)
model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=10,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="multi:softmax",
    num_class=3,
    eval_metric="mlogloss",
    use_label_encoder=False,
    random_state=42
)
model.fit(X_train, y_train)

#  Evaluate
y_pred = model.predict(X_test)

print("XGBoost Baseline (No Probabilities or Interaction Features)")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

#  Save
joblib.dump(model, "../models/9_xgboost_baseline.joblib")
print("Baseline XGBoost model saved to ../models/9_xgboost_baseline.joblib")


Parameters: { "use_label_encoder" } are not used.



XGBoost Baseline (No Probabilities or Interaction Features)
Accuracy: 0.783

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.98      0.88     18933
           1       0.25      0.04      0.06      3547
           2       0.18      0.02      0.04      1367

    accuracy                           0.78     23847
   macro avg       0.41      0.35      0.33     23847
weighted avg       0.68      0.78      0.71     23847


Confusion Matrix:
 [[18516   325    92]
 [ 3376   128    43]
 [ 1270    68    29]]
Baseline XGBoost model saved to ../models/9_xgboost_baseline.joblib


--------

### XGBoost model with enhanced features
I replicated all the enhancements that were applied in the final Random Forest model (from Notebook 8) to ensure a fair comparison.
Will now use Baseline Probabilities, Interaction Features, Time Features (implicitly)



In [2]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib

# Load data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 16, 17, 18]).astype(int)

# Load and merge baseline probabilities
prob_df = pd.read_csv("../results/baseline_probabilities.csv")
df = pd.merge(df, prob_df, on=['road', 'hour', 'weekday'], how='left')
df[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# Creating interaction features
df['precip_x_cloud'] = df['precipitation'] * df['cloud_cover']
df['temp_x_wind'] = df['temperature_2m'] * df['wind_speed_10m']
df['rushhour_x_prev1h'] = df['is_rush_hour'] * df['prev_1h_severity']

# Define features and target
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2',
    'precip_x_cloud', 'temp_x_wind', 'rushhour_x_prev1h'
]
target = 'severity_level'

X = df[feature_cols]
y = df[target]

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train XGBoost model
model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=10,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softmax',
    num_class=3,
    eval_metric='mlogloss',
    use_label_encoder=False,
    random_state=42
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)

print("XGBoost with Engineered Features")
print("\nAccuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Save model
joblib.dump(model, "../models/9b_xgboost_features.joblib")
print("XGBoost model saved to ../models/9b_xgboost_features.joblib")


Parameters: { "use_label_encoder" } are not used.



XGBoost with Engineered Features

Accuracy: 0.7795

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.96      0.88     18933
           1       0.33      0.10      0.15      3547
           2       0.23      0.06      0.10      1367

    accuracy                           0.78     23847
   macro avg       0.46      0.37      0.38     23847
weighted avg       0.70      0.78      0.73     23847

Confusion Matrix:
 [[18158   587   188]
 [ 3112   346    89]
 [ 1161   121    85]]
XGBoost model saved to ../models/9b_xgboost_features.joblib


--------

### Hyperparameter Tuning

I will now try to improve XGBoost performance via a small grid search over key parameters:

- n_estimators
- max_depth
- learning_rate
- subsample
- colsample_bytree



In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import joblib

# Load and prepare data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 16, 17, 18]).astype(int)

# Load baseline probabilities
probs = pd.read_csv("../results/baseline_probabilities.csv")
df = pd.merge(df, probs, on=["road", "hour", "weekday"], how="left")
df[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# Feature engineering
df['precip_x_cloud'] = df['precipitation'] * df['cloud_cover']
df['temp_x_wind'] = df['temperature_2m'] * df['wind_speed_10m']
df['rushhour_x_prev1h'] = df['is_rush_hour'] * df['prev_1h_severity']

#  Define features and target
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2',
    'precip_x_cloud', 'temp_x_wind', 'rushhour_x_prev1h'
]
target = 'severity_level'

X = df[feature_cols]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Set up parameter grid for tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [6, 10],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8],
    'colsample_bytree': [0.8]
}

xgb_clf = xgb.XGBClassifier(
    objective='multi:softmax',
    num_class=3,
    eval_metric='mlogloss',
    use_label_encoder=False,
    random_state=42
)

# perform GridSearch
grid_search = GridSearchCV(
    estimator=xgb_clf,
    param_grid=param_grid,
    cv=3,
    verbose=1,
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

#Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Params:", grid_search.best_params_)
print("\nXGBoost Tuned Model")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Saving the tuned model
joblib.dump(best_model, "../models/9c_xgboost_tuned.joblib")
print("tuned XGBoost model saved to ../models/9c_xgboost_tuned.joblib")


Fitting 3 folds for each of 8 candidates, totalling 24 fits


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

Best Params: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.8}

XGBoost Tuned Model
Accuracy: 0.7939

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.99      0.89     18933
           1       0.41      0.04      0.07      3547
           2       0.30      0.01      0.01      1367

    accuracy                           0.79     23847
   macro avg       0.50      0.34      0.32     23847
weighted avg       0.71      0.79      0.71     23847

Confusion Matrix:
 [[18794   132     7]
 [ 3406   129    12]
 [ 1309    50     8]]
tuned XGBoost model saved to ../models/9c_xgboost_tuned.joblib


**Results (after tuning):**

- Accuracy: 0.7939
- Macro F1-score: 0.32
- Macro Recall: 0.34

Slight increase in precision for classes 1 and 2.

Although class imbalance still leads to poor recall for classes 1 and 2, tuning shows that the model is now more confident when it does predict them.

---------

### XGBoost with Enhanced Hyperparameter Tuning & Class Imbalance Handling

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import xgboost as xgb
import joblib

# Load and prepare data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 16, 17, 18]).astype(int)

# Load baseline probabilities
probs = pd.read_csv("../results/baseline_probabilities.csv")
df = pd.merge(df, probs, on=["road", "hour", "weekday"], how="left")
df[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# Feature engineering
df['precip_x_cloud'] = df['precipitation'] * df['cloud_cover']
df['temp_x_wind'] = df['temperature_2m'] * df['wind_speed_10m']
df['rushhour_x_prev1h'] = df['is_rush_hour'] * df['prev_1h_severity']

# Define features and target
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2',
    'precip_x_cloud', 'temp_x_wind', 'rushhour_x_prev1h'
]
X = df[feature_cols]
y = df['severity_level']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8],
    'colsample_bytree': [0.8],
    'scale_pos_weight': [1, 3, 5]  # imbalance handling
}

xgb_clf = xgb.XGBClassifier(
    objective="multi:softmax",
    num_class=3,
    eval_metric="mlogloss",
    use_label_encoder=False,
    random_state=42
)

grid = GridSearchCV(xgb_clf, param_grid, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
grid.fit(X_train, y_train)

# Evaluate best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Params:", grid.best_params_)
print("\nXGBoost Tuned (with imbalance handling)")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Save model
joblib.dump(best_model, "../models/9d_xgboost_tuned_imbalance.joblib")
print("Model saved to ../models/9d_xgboost_tuned_imbalance.joblib")


Fitting 3 folds for each of 54 candidates, totalling 162 fits


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parame

Best Params: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 200, 'scale_pos_weight': 1, 'subsample': 0.8}

XGBoost Tuned (with imbalance handling)
Accuracy: 0.7946

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.99      0.89     18933
           1       0.45      0.03      0.06      3547
           2       0.00      0.00      0.00      1367

    accuracy                           0.79     23847
   macro avg       0.42      0.34      0.32     23847
weighted avg       0.70      0.79      0.71     23847


Confusion Matrix:
 [[18835    98     0]
 [ 3431   115     1]
 [ 1324    43     0]]
Model saved to ../models/9d_xgboost_tuned_imbalance.joblib


**Results interpretation:**

- This model achieved the best accuracy (0.7946).

- However, it's still struggling a lot with class 2, and class 1 recall is poor — indicating that accuracy is dominated by class 0. 

- Class 2 F1-score is 0, which is far from ideal.

- scale_pos_weight did not help much because class 2 is very small and hard to predict without additional features or resampling.


--------

### XGBoost with enhanced features & SMOTE

I will now try some more improvements on 9b now since 9c and 9d don't provide promising results.
9b is already best balanced model across all classes.

Class 2 is not ignored, unlike in tuned models.


In [5]:
# !pip install imbalanced-learn

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import pandas as pd
import joblib

# Load data
df = pd.read_csv("../data/engineered_traffic_with_lags_and_weather.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['weekday'] = df['timestamp'].dt.dayofweek
df['is_rush_hour'] = df['hour'].isin([7, 8, 9, 16, 17, 18]).astype(int)

# Merge probabilities
probs = pd.read_csv("../results/baseline_probabilities.csv")
df = pd.merge(df, probs, on=["road", "hour", "weekday"], how="left")
df[['prob_severity_0', 'prob_severity_1', 'prob_severity_2']] = df[
    ['prob_severity_0', 'prob_severity_1', 'prob_severity_2']
].fillna(0)

# Feature interactions
df['precip_x_cloud'] = df['precipitation'] * df['cloud_cover']
df['temp_x_wind'] = df['temperature_2m'] * df['wind_speed_10m']
df['rushhour_x_prev1h'] = df['is_rush_hour'] * df['prev_1h_severity']

# Feature set and target
feature_cols = [
    'prev_1h_severity', 'prev_2h_severity',
    'temperature_2m', 'precipitation', 'rain', 'snowfall',
    'wind_speed_10m', 'wind_gusts_10m', 'cloud_cover',
    'prob_severity_0', 'prob_severity_1', 'prob_severity_2',
    'precip_x_cloud', 'temp_x_wind', 'rushhour_x_prev1h'
]
target = 'severity_level'

X = df[feature_cols]
y = df[target]

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train XGBoost
model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="multi:softmax",
    num_class=3,
    eval_metric="mlogloss",
    random_state=42
)
model.fit(X_resampled, y_resampled)

# Evaluate
y_pred = model.predict(X_test)
print("XGBoost + SMOTE")
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Save model
joblib.dump(model, "../models/9e_xgboost_smote.joblib")
print("Model saved to ../models/9e_xgboost_smote.joblib")




XGBoost + SMOTE
Accuracy: 0.7776

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.95      0.88     18933
           1       0.36      0.12      0.18      3547
           2       0.25      0.15      0.19      1367

    accuracy                           0.78     23847
   macro avg       0.48      0.40      0.41     23847
weighted avg       0.72      0.78      0.73     23847


Confusion Matrix:
 [[17926   595   412]
 [ 2947   412   188]
 [ 1026   136   205]]
Model saved to ../models/9e_xgboost_smote.joblib


**Results interpretation:**

Compared to the previous best model (9b)...

Positives:
- Class 2 recall improved from ~0.06 (in 9b) to 0.15 — that’s a 2.5x gain, which is significant for a minority class.
- Class 1 recall improved slightly from 0.10 → 0.12.
- Class-wise F1-scores improved across both class 1 and 2.
- Macro F1-score and macro recall also improved: From 0.38 to 0.41 (F1), and 0.37 → 0.40 (recall) - a more balanced model across classes.
- Weighted metrics are maintained at a high level, so performance on the majority class (0) wasn’t sacrificed too much.

Negatives:
- Slight drop in overall accuracy: From 0.7795 to 0.7776. But this is acceptable given better minority class performance.
- Class 0’s recall decreased a bit, but is still very high at 0.95.

**SMOTE helped increase fairness across classes — especially class 2 — with a small cost to accuracy. This model (9e) is arguably more balanced than 9b or 9d and would be a strong candidate for a final comparison between this model and the best performing RF model.**