# 🏎️ Model 1: Random Forest - Multi-Target Prediction

**Goal:** Predict multiple F1 race outcomes using Random Forest

**Targets:**
- 🏆 **Classification**: win, podium, points_finish, top5 (binary 0/1)
- 📊 **Regression**: position (1-20)

**Process:**
1. Train baseline models (Classifier + Regressor)
2. Hyperparameter optimization for both
3. Compare baseline vs optimized
4. Evaluate and visualize all targets
5. Save best models

## Step 1 : Import Libraries

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import joblib
import os
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    accuracy_score,
    r2_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    make_scorer
)

sns.set_style("darkgrid")
plt.rcParams['figure.figsize'] = (12,6)
plt.rcParams['font.size'] = 10

## Step 2 : Loading Data 

In [35]:
print("Loading processed data")

train_df = pd.read_parquet('../data/processed/train_data_v2.parquet')
test_df = pd.read_parquet('../data/processed/test_data_v2.parquet')

train_weights = np.load('../data/processed/train_weights.npy')
test_weights = np.load('../data/processed/test_weights.npy')

with open('../data/processed/metadata_v2.json', 'r') as f:
    metadata = json.load(f)

print(f"\n Data loaded successfully!")
print(f"   Training samples: {len(train_df)} (2024 season)")
print(f"   Test samples: {len(test_df)} (2025 season)")
print(f"   Features: {len(metadata['feature_columns'])}")
print(f"   Target columns: {metadata['target_columns']}")    

Loading processed data

 Data loaded successfully!
   Training samples: 460 (2024 season)
   Test samples: 385 (2025 season)
   Features: 72
   Target columns: ['win', 'podium', 'points_finish', 'top5', 'position']


## Step 3 : Preparing Features and Targets

In [36]:
feature_cols = metadata['feature_columns']

classification_targets = ['win','podium','points_finish','top5']
regression_target = 'position'
all_targets = classification_targets + [regression_target]

X_train = train_df[feature_cols]
X_test = test_df[feature_cols]

y_train_class = train_df[classification_targets]
y_test_class = test_df[classification_targets]

y_train_reg = train_df[regression_target]
y_test_reg = train_df[regression_target]

print(f"\n Data prepared for training:")
print(f"   Features shape: {X_train.shape}")
print(f"   Classification targets: {classification_targets}")
print(f"   Regression target: {regression_target}")

print(f"\n Target Distribution (Training):")
for target in classification_targets:
    pos_rate = train_df[target].mean() * 100
    print(f"   {target}: {pos_rate:.1f}% positive")
print(f"   position: mean={y_train_reg.mean():.2f}, std={y_train_reg.std():.2f}")


 Data prepared for training:
   Features shape: (460, 72)
   Classification targets: ['win', 'podium', 'points_finish', 'top5']
   Regression target: position

 Target Distribution (Training):
   win: 5.2% positive
   podium: 15.7% positive
   points_finish: 50.9% positive
   top5: 26.1% positive
   position: mean=10.38, std=5.81


## Step 4 : Evaluation Functions

In [37]:
def calculate_classification_metrics(y_true,y_pred,y_prob=None,weights=None,target_name="Target"):

    acc = accuracy_score(y_true, y_pred, sample_weight = weights)
    prec = precision_score(y_true, y_pred, sample_weight = weights)
    rec = recall_score(y_true,y_pred,sample_weight = weights)
    f1 = f1_score(y_true,y_pred,sample_weight = weights)

    auc = roc_auc_score(y_true, y_prob,sample_weight = weights) if y_prob is not None else None

    return {
        'Accuracy' : acc,
        'Precision' : prec,
        'Recall' : rec,
        'F1' : f1,
        'AUC' : auc
    }


def calculate_regression_metrics(y_true, y_pred,weights=None):

    mae = mean_absolute_error(y_true, y_pred, sample_weight = weights)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred, sample_weight=weights))
    r2 = r2_score(y_true, y_pred,sample_weight = weights)

    within_1 = np.mean(np.abs(y_true - y_pred) <= 1) * 100
    within_2 = np.mean(np.abs(y_true - y_pred) <= 2) * 100
    within_3 = np.mean(np.abs(y_true - y_pred) <= 3) * 100
    
    return {
        'MAE': mae,
        'RMSE': rmse,
        'R2': r2,
        'Within_1': within_1,
        'Within_2': within_2,
        'Within_3': within_3
    }


def print_classification_metrics(metrics, dataset_name="Dataset"):

    print(f"\n{'='*70}")
    print(f" {dataset_name.upper()}")
    print(f"{'='*70}")
    print(f"   Accuracy:  {metrics['Accuracy']:.3f} ({metrics['Accuracy']*100:.1f}%)")
    print(f"   Precision: {metrics['Precision']:.3f}")
    print(f"   Recall:    {metrics['Recall']:.3f}")
    print(f"   F1 Score:  {metrics['F1']:.3f}")
    if metrics['AUC'] is not None:
        print(f"   ROC AUC:   {metrics['AUC']:.3f}")
    print(f"{'='*70}")


def print_regression_metrics(metrics, dataset_name="Dataset"):
    
    print(f"\n{'='*70}")
    print(f" {dataset_name.upper()}")
    print(f"{'='*70}")
    print(f"   MAE:  {metrics['MAE']:.3f} positions")
    print(f"   RMSE: {metrics['RMSE']:.3f} positions")
    print(f"   R²:   {metrics['R2']:.3f}")
    print(f"\n🎯 ACCURACY:")
    print(f"   Within 1 position:  {metrics['Within_1']:.1f}%")
    print(f"   Within 2 positions: {metrics['Within_2']:.1f}%")
    print(f"   Within 3 positions: {metrics['Within_3']:.1f}%")
    print(f"{'='*70}")

print(" Evaluation functions defined!")

 Evaluation functions defined!


# PART A : CLASSIFICATION MODELS (win,podium,top5,points_finish)

## Step 5 : Baseline classification models

In [38]:
baseline_class_params = {
    'n_estimators': 100,
    'max_depth': 15,
    'min_samples_split': 10,
    'min_samples_leaf': 4,
    'max_features': 'sqrt',
    'class_weight': 'balanced',
    'random_state': 42,
    'n_jobs': -1
}

print("\n Baseline Classification Parameters:")
for param, value in baseline_class_params.items():
    if param not in ['random_state', 'n_jobs']:
        print(f"   {param}: {value}")

baseline_class_models = {}
baseline_class_results = {}

for target in classification_targets:
    print(f"\n  Training baseline classifier for '{target}'...")
    
    model = RandomForestClassifier(**baseline_class_params)
    model.fit(X_train, y_train_class[target], sample_weight=train_weights)
    
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    train_prob = model.predict_proba(X_train)[:, 1]
    test_prob = model.predict_proba(X_test)[:, 1]
    
    train_metrics = calculate_classification_metrics(
        y_train_class[target], train_pred, train_prob, train_weights, target
    )
    test_metrics = calculate_classification_metrics(
        y_test_class[target], test_pred, test_prob, test_weights, target
    )
    
    baseline_class_models[target] = model
    baseline_class_results[target] = {
        'train_metrics': train_metrics,
        'test_metrics': test_metrics,
        'train_pred': train_pred,
        'test_pred': test_pred,
        'train_prob': train_prob,
        'test_prob': test_prob
    }
    
    print(f"     {target}: Train Accuracy = {train_metrics['Accuracy']*100:.1f}%, "
          f"F1 = {train_metrics['F1']*100:.1f}%, AUC = {train_metrics['AUC']*100:.1f}%, "
          f"Precision = {train_metrics['Precision']*100:.1f}%, Recall = {train_metrics['Recall']*100:.1f}%")

    print(f"     {target}: Test Accuracy = {test_metrics['Accuracy']*100:.1f}%, "
          f"F1 = {test_metrics['F1']*100:.1f}%, AUC = {test_metrics['AUC']*100:.1f}%, "
          f"Precision = {test_metrics['Precision']*100:.1f}%, Recall = {test_metrics['Recall']*100:.1f}%")

print("\n✅ All baseline classification models trained!")


 Baseline Classification Parameters:
   n_estimators: 100
   max_depth: 15
   min_samples_split: 10
   min_samples_leaf: 4
   max_features: sqrt
   class_weight: balanced

  Training baseline classifier for 'win'...
     win: Train Accuracy = 97.4%, F1 = 80.0%, AUC = 99.8%, Precision = 66.7%, Recall = 100.0%
     win: Test Accuracy = 94.5%, F1 = 52.3%, AUC = 95.8%, Precision = 47.7%, Recall = 57.9%

  Training baseline classifier for 'podium'...
     podium: Train Accuracy = 91.4%, F1 = 78.1%, AUC = 98.4%, Precision = 65.7%, Recall = 96.1%
     podium: Test Accuracy = 90.9%, F1 = 75.3%, AUC = 95.2%, Precision = 65.4%, Recall = 88.8%

  Training baseline classifier for 'points_finish'...
     points_finish: Train Accuracy = 91.8%, F1 = 92.0%, AUC = 97.9%, Precision = 91.8%, Recall = 92.2%
     points_finish: Test Accuracy = 78.4%, F1 = 79.3%, AUC = 84.7%, Precision = 79.4%, Recall = 79.2%

  Training baseline classifier for 'top5'...
     top5: Train Accuracy = 93.3%, F1 = 88.3%, AUC =


### **Baseline Classification Results**

- This cell trains the four baseline classifiers, which clearly demonstrate **overfitting**.

- The models performed perfectly on the 2024 training data (e.g., the `win` model achieved 100% recall and an 80% F1), but their performance dropped significantly on the unseen 2025 test set (the `win` model's F1 fell to 52.3%).

- This large gap between training and test scores shows the models "memorized" the 2024 season instead of learning general patterns.

- The next hyperparameter tuning step is essential to reduce this overfitting and improve real-world performance.

## Step 6 : Optimize Classification models