
Objective:
- Validate model stability using cross-validation
- Optimize model performance using hyperparameter tuning
- Compare baseline vs tuned models
- Check for overfitting

Models:
- Logistic Regression
- Random Forest

Target
- BlueWins


In [1]:
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report


In [2]:
# load data and models
x_train = pd.read_csv('../data/processed/x_train.csv')
y_train = pd.read_csv('../data/processed/y_train.csv').squeeze() # ensures target is 1D

logreg_model = joblib.load('../models/logistic_regression.pkl')
rf_model = joblib.load('../models/random_forest.pkl')


In [3]:
# cross-validation LR
lr_cv_scores = cross_val_score(logreg_model, x_train, y_train, cv=5, scoring='accuracy')

print("Logistic Regression CV Scores:", lr_cv_scores)
print("Mean Accuracy:", lr_cv_scores.mean())
print("Std Dev:", lr_cv_scores.std())

Logistic Regression CV Scores: [0.95833758 0.95979763 0.95837153 0.95925435 0.95731894]
Mean Accuracy: 0.9586160062476656
Std Dev: 0.0008512817527055732


In [4]:
# cross-validation RF
rf_cv_scores = cross_val_score(rf_model, x_train, y_train, cv=5, scoring='accuracy')

print("\nRandom Forest CV Scores:", rf_cv_scores)
print("Mean Accuracy:", rf_cv_scores.mean())
print("Std Dev:", rf_cv_scores.std())


Random Forest CV Scores: [0.95942413 0.95793012 0.96023904 0.96006927 0.95898272]
Mean Accuracy: 0.959329055040576
Std Dev: 0.0008321050461439898


- Cross-validation evaluates model performance across multiple data splits
- Low standard deviation indicates stable model performance
- Similar mean scores suggest comparable model effectiveness

In [None]:
# Hyperparameter tuning - RF
# Hyperparameter tuning is used to optimize model performance by systematically searching for the best combination of parameters using cross-validation

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [None]:
print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

In [None]:
# evaluation of the model
best_rf = grid_search.best_estimator_

y_pred_train = best_rf.predict(x_train)

print("Train Accuracy:", accuracy_score(y_train, y_pred_train))
print(classification_report(y_train, y_pred_train))


In [None]:
# compares original model vs model after hyperparameter tuning
baseline_score = rf_cv_scores.mean()
tuned_score = grid_search.best_score_

print(f"Baseline RF CV Score: {baseline_score:.4f}")
print(f"Tuned RF CV Score: {tuned_score:.4f}")


In [None]:
# checks for overfitting 
train_score = best_rf.score(x_train, y_train)
cv_score = grid_search.best_score_

print("Train Accuracy:", train_score) # how well model fits training data
print("Cross-Validation Accuracy:", cv_score) # how well model generalizes (new data)


In [None]:
joblib.dump(best_rf, '../models/random_forest_tuned.pkl')


In [None]:
results = pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest (Baseline)", "Random Forest (Tuned)"],
    "CV Accuracy": [
        lr_cv_scores.mean(),
        rf_cv_scores.mean(),
        grid_search.best_score_
    ]
})

results

We validated and optimized models using cross-validation and hyperparameter tuning. Though we only tuned RF, as it's more likely to benefit from the optimization. Going forward, we will be using the optimized Random forest model.