# XGBoost Model Development - F1 Race Prediction

In this notebook, we will develop an XGBoost model for F1 race predictions.

## Contents:
1. Data loading and preprocessing
2. Model training and validation
3. Hyperparameter optimization
4. Model evaluation and feature importance analysis
5. Predictions and results analysis


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
import xgboost as xgb
import optuna
from optuna import Trial, visualization
import warnings
warnings.filterwarnings('ignore')

# Initialize plotly for notebooks
import plotly.io as pio
pio.templates.default = "plotly_white"


In [2]:
# Load the dataset
print("Loading dataset...")
df = pd.read_csv('../data/processed/xgboost/final_model_dataset.csv')
print(f"Dataset shape: {df.shape}")
print(f"Number of columns: {len(df.columns)}")

# Display first few rows
df.head()


Loading dataset...
Dataset shape: (3318, 52)
Number of columns: 52


Unnamed: 0,year,round,country,driver_nationality,constructor_nationality,grid_position,is_2017_plus_era,is_2022_plus_era,is_covid_season_2020,has_sprint_format,...,is_high_humidity,is_low_humidity,is_windy_conditions,has_fastf1_data,race_size,typology,enneagram,heaight,why_choose_number_categorical,target_position
0,2018,1,Australia,German,Italian,3.0,1,0,0,0,...,0.0,0.0,0.0,True,20,INFJ,9w1,175,Karting Success,1
1,2018,1,Australia,British,German,1.0,1,0,0,0,...,0.0,0.0,0.0,True,20,ISFP,3w4,174,Personal Connection,2
2,2018,1,Australia,Finnish,Italian,2.0,1,0,0,0,...,0.0,0.0,0.0,True,20,ISTP,5w4,175,Karting Success,3
3,2018,1,Australia,Australian,Austrian,5.0,1,0,0,0,...,0.0,0.0,0.0,True,20,ENFP,7w6,179,Karting Success,4
4,2018,1,Australia,Spanish,British,11.0,1,0,0,0,...,0.0,0.0,0.0,True,20,ENTJ,1w2,171,Karting Success,5


In [3]:
# Get dataset information
print("Dataset information:")
df.info()


Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3318 entries, 0 to 3317
Data columns (total 52 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   year                             3318 non-null   int64  
 1   round                            3318 non-null   int64  
 2   country                          3318 non-null   object 
 3   driver_nationality               3318 non-null   object 
 4   constructor_nationality          3318 non-null   object 
 5   grid_position                    3318 non-null   float64
 6   is_2017_plus_era                 3318 non-null   int64  
 7   is_2022_plus_era                 3318 non-null   int64  
 8   is_covid_season_2020             3318 non-null   int64  
 9   has_sprint_format                3318 non-null   int64  
 10  driver_career_total_points       3318 non-null   float64
 11  driver_career_avg_points         3318 non-null   float64
 12 

In [4]:
# Check for missing values
print("Missing values analysis:")
missing_info = df.isnull().sum()
missing_columns = missing_info[missing_info > 0]
if len(missing_columns) > 0:
    print(missing_columns.sort_values(ascending=False))
else:
    print("No missing values found!")


Missing values analysis:
main_compound                    956
driver_abbreviation              952
is_intermediate                  952
is_windy_conditions              952
is_low_humidity                  952
is_high_humidity                 952
is_high_track_temp               952
is_cold_conditions               952
is_hot_conditions                952
driver_number                    952
is_wet                           952
is_medium_primary                952
is_hard_primary                  952
is_soft_primary                  952
why_choose_number_categorical    175
typology                          87
enneagram                         87
temp_differential                 40
wind_speed                        40
humidity                          40
track_temp                        40
air_temp                          40
dtype: int64


In [5]:
# Separate numeric and categorical columns
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = df.select_dtypes(exclude=[np.number]).columns.tolist()

print(f"Numeric columns ({len(numeric_columns)}):")
print(numeric_columns[:10], "...")  # Show first 10

print(f"\nCategorical columns ({len(categorical_columns)}):")
print(categorical_columns)


Numeric columns (43):
['year', 'round', 'grid_position', 'is_2017_plus_era', 'is_2022_plus_era', 'is_covid_season_2020', 'has_sprint_format', 'driver_career_total_points', 'driver_career_avg_points', 'driver_career_avg_position'] ...

Categorical columns (9):
['country', 'driver_nationality', 'constructor_nationality', 'driver_abbreviation', 'main_compound', 'has_fastf1_data', 'typology', 'enneagram', 'why_choose_number_categorical']


In [6]:
# Determine target variable
# According to dataset_info.txt, target is target_position (race finish position)
target_column = 'target_position'

print(f"Target variable: {target_column}")
print("Target distribution:")
df[target_column].describe()


Target variable: target_position
Target distribution:


count    3318.000000
mean       10.494274
std         5.764167
min         1.000000
25%         5.250000
50%        10.000000
75%        15.000000
max        20.000000
Name: target_position, dtype: float64

In [None]:
# Prepare the dataset
# Encode categorical variables
le_dict = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    le_dict[col] = le

# Split features and target
feature_columns = [col for col in df.columns if col != target_column]
X = df[feature_columns]
y = df[target_column]

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")


Feature matrix shape: (3318, 51)
Target variable shape: (3318,)


In [None]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")


Training set size: (2654, 51)
Test set size: (664, 51)


## 2. Baseline XGBoost Model

First, let's create a baseline model with default parameters.


In [None]:
# Create baseline XGBoost model
baseline_model = xgb.XGBRegressor(
    random_state=42,
    verbosity=1
)

print("Training baseline model...")
baseline_model.fit(X_train, y_train)


Training baseline model...


In [10]:
# Evaluate baseline model performance
y_pred_baseline = baseline_model.predict(X_test)

# Calculate Top-3 and Top-5 accuracy
def calculate_topk_accuracy(y_true, y_pred, k):
    """Calculate how many predictions are within k positions of actual"""
    correct = np.abs(y_true - y_pred) <= k
    return correct.mean()

baseline_metrics = {
    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred_baseline)),
    'MAE': mean_absolute_error(y_test, y_pred_baseline),
    'R2': r2_score(y_test, y_pred_baseline),
    'Top-3 Accuracy': calculate_topk_accuracy(y_test, y_pred_baseline, 3),
    'Top-5 Accuracy': calculate_topk_accuracy(y_test, y_pred_baseline, 5)
}

print("Baseline Model Performance:")
for metric, value in baseline_metrics.items():
    if 'Accuracy' in metric:
        print(f"{metric}: {value:.2%}")
    else:
        print(f"{metric}: {value:.4f}")


Baseline Model Performance:
RMSE: 4.2227
MAE: 2.9705
R2: 0.4614
Top-3 Accuracy: 64.46%
Top-5 Accuracy: 81.17%


## 3. Cross-Validation Model Evaluation


In [11]:
# Evaluate model performance using cross-validation
cv_scores = cross_val_score(
    baseline_model, X_train, y_train,
    cv=5, scoring='neg_root_mean_squared_error'
)

print(f"CV RMSE Scores: {cv_scores}")
print(f"CV RMSE Mean: {-cv_scores.mean():.4f}")
print(f"CV RMSE Std: {cv_scores.std():.4f}")


CV RMSE Scores: [-4.25705194 -4.08816957 -4.650846   -4.36111307 -4.28251839]
CV RMSE Mean: 4.3279
CV RMSE Std: 0.1844


## 4. Hyperparameter Optimization

Let's use Optuna for hyperparameter optimization.

In [12]:
# Optuna objective function
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
        'random_state': 42,
        'verbosity': 0
    }
    
    model = xgb.XGBRegressor(**params)
    
    # We want to maximize Top-3 Accuracy, but Optuna minimizes by default
    # So we'll use negative Top-3 Accuracy
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Calculate Top-3 accuracy (we want to maximize this)
    top3_accuracy = calculate_topk_accuracy(y_test, y_pred, 3)
    
    # Return negative because Optuna minimizes
    return -top3_accuracy


In [13]:
# Create Optuna study
study = optuna.create_study(
    direction='minimize',  # Minimize because we return negative Top-3 accuracy
    study_name='xgboost_f1_prediction',
    storage='sqlite:///xgboost_optimization.db',
    load_if_exists=True
)

print("Starting Optuna optimization (optimizing for Top-3 Accuracy)...")
study.optimize(objective, n_trials=50, timeout=600)  # 50 trials or 10 minutes

print("\nOptimization completed!")
print("Best parameters:")
print(study.best_params)
print(f"Best Top-3 Accuracy: {-study.best_value:.2%}")  # Convert back to positive


[I 2025-09-29 22:01:49,982] Using an existing study with name 'xgboost_f1_prediction' instead of creating a new one.


Starting Optuna optimization (optimizing for Top-3 Accuracy)...


[I 2025-09-29 22:01:50,412] Trial 50 finished with value: -0.6581325301204819 and parameters: {'max_depth': 4, 'learning_rate': 0.017738827359057557, 'n_estimators': 422, 'min_child_weight': 6, 'subsample': 0.9324711959678738, 'colsample_bytree': 0.5752962075577248, 'reg_alpha': 4.513416896838153, 'reg_lambda': 2.8509281398258337}. Best is trial 46 with value: -0.8433734939759037.
[I 2025-09-29 22:01:51,094] Trial 51 finished with value: -0.6566265060240963 and parameters: {'max_depth': 5, 'learning_rate': 0.026772684657947236, 'n_estimators': 578, 'min_child_weight': 8, 'subsample': 0.9702924734806871, 'colsample_bytree': 0.634512350204665, 'reg_alpha': 6.002455866085112, 'reg_lambda': 4.249965498448076}. Best is trial 46 with value: -0.8433734939759037.
[I 2025-09-29 22:01:51,747] Trial 52 finished with value: -0.6581325301204819 and parameters: {'max_depth': 4, 'learning_rate': 0.012495365157665618, 'n_estimators': 685, 'min_child_weight': 8, 'subsample': 0.8094799083621989, 'colsam


Optimization completed!
Best parameters:
{'max_depth': 5, 'learning_rate': 0.019850452680855984, 'n_estimators': 606, 'min_child_weight': 8, 'subsample': 0.8167221884669023, 'colsample_bytree': 0.7562065999812263, 'reg_alpha': 6.6853646799780435, 'reg_lambda': 3.10940013512786}
Best Top-3 Accuracy: 84.34%


In [14]:
# Visualize optimization results
fig = optuna.visualization.plot_optimization_history(study)
fig.show()

fig = optuna.visualization.plot_param_importances(study)
fig.show()


## 5. Training with Optimized Model


In [15]:
# Create model with best parameters
best_params = study.best_params
optimized_model = xgb.XGBRegressor(**best_params)

print("Training optimized model...")
optimized_model.fit(X_train, y_train)


Training optimized model...


In [16]:
# Evaluate optimized model performance
y_pred_optimized = optimized_model.predict(X_test)

optimized_metrics = {
    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred_optimized)),
    'MAE': mean_absolute_error(y_test, y_pred_optimized),
    'R2': r2_score(y_test, y_pred_optimized),
    'Top-3 Accuracy': calculate_topk_accuracy(y_test, y_pred_optimized, 3),
    'Top-5 Accuracy': calculate_topk_accuracy(y_test, y_pred_optimized, 5)
}

print("Optimized Model Performance:")
for metric, value in optimized_metrics.items():
    if 'Accuracy' in metric:
        print(f"{metric}: {value:.2%}")
    else:
        print(f"{metric}: {value:.4f}")

print("\n" + "="*60)
print("COMPARISON: Baseline vs Optimized")
print("="*60)
for metric in baseline_metrics:
    baseline_val = baseline_metrics[metric]
    optimized_val = optimized_metrics[metric]
    
    if 'Accuracy' in metric or 'R2' in metric:
        # For accuracy and R2, higher is better
        improvement = optimized_val - baseline_val
        symbol = "↑" if improvement > 0 else "↓"
        print(f"{metric:20s}: Baseline={baseline_val:.2%}, Optimized={optimized_val:.2%}, Δ={improvement:+.2%} {symbol}")
    else:
        # For RMSE and MAE, lower is better
        improvement = baseline_val - optimized_val
        symbol = "↑" if improvement > 0 else "↓"
        print(f"{metric:20s}: Baseline={baseline_val:.4f}, Optimized={optimized_val:.4f}, Δ={improvement:+.4f} {symbol}")


Optimized Model Performance:
RMSE: 3.9008
MAE: 2.7751
R2: 0.5404
Top-3 Accuracy: 65.06%
Top-5 Accuracy: 84.34%

COMPARISON: Baseline vs Optimized
RMSE                : Baseline=4.2227, Optimized=3.9008, Δ=+0.3219 ↑
MAE                 : Baseline=2.9705, Optimized=2.7751, Δ=+0.1955 ↑
R2                  : Baseline=46.14%, Optimized=54.04%, Δ=+7.90% ↑
Top-3 Accuracy      : Baseline=64.46%, Optimized=65.06%, Δ=+0.60% ↑
Top-5 Accuracy      : Baseline=81.17%, Optimized=84.34%, Δ=+3.16% ↑


## 6. Feature Importance Analysis


In [17]:
# Get feature importance scores
feature_importance = optimized_model.get_booster().get_score(importance_type='gain')

# Convert to DataFrame
importance_df = pd.DataFrame([
    {'feature': k, 'importance': v}
    for k, v in feature_importance.items()
]).sort_values('importance', ascending=False)

print("Top 20 most important features:")
importance_df.head(20)


Top 20 most important features:


Unnamed: 0,feature,importance
5,grid_position,404.580536
21,driver_track_avg_position,246.224167
23,constructor_track_avg_position,188.487366
16,constructor_career_avg_points,135.388367
11,driver_career_avg_position,81.997437
14,driver_career_podiums,62.831779
10,driver_career_avg_points,62.325756
7,is_covid_season_2020,55.530739
6,is_2022_plus_era,55.327023
9,driver_career_total_points,50.516235


In [18]:
# Feature importance visualization
fig = px.bar(
    importance_df.head(20),
    x='importance',
    y='feature',
    orientation='h',
    title='XGBoost Feature Importance Scores (Top 20)',
    labels={'importance': 'Importance Score (Gain)', 'feature': 'Feature'}
)
fig.update_layout(height=600, showlegend=False)
fig.show()


## 7. Prediction Analysis and Residual Analysis


In [19]:
# Create predictions and actual values comparison
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred_optimized,
    'Error': y_test - y_pred_optimized
})

print("Prediction results samples:")
results_df.head(10)


Prediction results samples:


Unnamed: 0,Actual,Predicted,Error
2054,15,16.177143,-1.177143
2914,16,7.156647,8.843353
444,5,5.497445,-0.497445
889,10,14.612656,-4.612656
2365,6,8.495016,-2.495016
1007,8,10.373625,-2.373625
2563,5,3.518329,1.481671
1609,10,6.133414,3.866586
1963,4,6.596242,-2.596242
184,5,7.370552,-2.370552


In [20]:
# Residual distribution analysis
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Residual Distribution', 'Error Distribution Histogram')
)

# Residual scatter plot
fig.add_trace(
    go.Scatter(
        x=y_pred_optimized,
        y=results_df['Error'],
        mode='markers',
        opacity=0.5,
        name='Residuals'
    ),
    row=1, col=1
)

# Add horizontal line at 0
fig.add_hline(y=0, line_dash="dash", line_color="red", row=1, col=1)

# Error histogram
fig.add_trace(
    go.Histogram(
        x=results_df['Error'],
        nbinsx=30,
        name='Error Distribution',
        showlegend=False
    ),
    row=1, col=2
)

fig.update_xaxes(title_text="Predicted Position", row=1, col=1)
fig.update_yaxes(title_text="Residual (Error)", row=1, col=1)
fig.update_xaxes(title_text="Error Value", row=1, col=2)
fig.update_yaxes(title_text="Frequency", row=1, col=2)

fig.update_layout(height=500, title_text="Residual Analysis")
fig.show()


In [21]:
# Position comparison plot
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=results_df['Actual'],
        y=results_df['Predicted'],
        mode='markers',
        opacity=0.5,
        name='Predictions'
    )
)

# Perfect prediction line
fig.add_trace(
    go.Scatter(
        x=[0, 20],
        y=[0, 20],
        mode='lines',
        line=dict(dash='dash', color='red'),
        name='Perfect Prediction'
    )
)

fig.update_layout(
    title='Actual vs Predicted Positions',
    xaxis_title='Actual Position',
    yaxis_title='Predicted Position',
    showlegend=True,
    height=600
)

fig.show()


## 8. Model Saving and Results


In [22]:
# Save the model
import joblib
import os

# Create models directory if it doesn't exist
os.makedirs('../models/xgboost', exist_ok=True)

# Save model and encoders
joblib.dump(optimized_model, '../models/xgboost/xgboost_model.pkl')
joblib.dump(le_dict, '../models/xgboost/label_encoders.pkl')

print("Model saved successfully!")


Model saved successfully!


In [None]:
# Final results summary
print("=== F1 RACE PREDICTION MODEL RESULTS ===")
print(f"\nDataset size: {df.shape}")
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

print("\n=== MODEL PERFORMANCE ===")
print("Baseline Model:")
for metric, value in baseline_metrics.items():
    print(f"  {metric}: {value:.4f}")

print("\nOptimized Model:")
for metric, value in optimized_metrics.items():
    print(f"  {metric}: {value:.4f}")

print("\n=== TOP IMPORTANT FEATURES ===")
for i, row in importance_df.head(10).iterrows():
    print(f"  {row['feature']}: {row['importance']:.2f}")

print(f"\nModel file: ../models/xgboost/xgboost_model.pkl")
print("Optimization database: xgboost_optimization.db")

## 9. XGBoost Ranking Model (Learning to Rank)

F1 races are fundamentally a ranking problem - we want to predict the order of drivers, not just individual positions. XGBoost's ranking objective (`rank:pairwise`) is specifically designed for this.

### Why Ranking Model?
- **Pairwise comparison:** Learns which driver will finish ahead of another
- **Better for ordinal data:** Position 1 vs 2 is more important than 19 vs 20
- **Race-specific grouping:** Considers all drivers in a race together


In [24]:
# Prepare data for ranking
# We need group information - how many drivers per race
df_full = pd.read_csv('../data/processed/xgboost/final_model_dataset.csv')

# Create race identifier (year + round)
df_full['race_id'] = df_full['year'].astype(str) + '_' + df_full['round'].astype(str)

# Encode categorical variables for ranking model
le_dict_rank = {}
for col in df_full.select_dtypes(exclude=[np.number]).columns:
    if col not in ['race_id', 'target_position']:
        le = LabelEncoder()
        df_full[col] = le.fit_transform(df_full[col])
        le_dict_rank[col] = le

# Prepare features and target
feature_cols_rank = [col for col in df_full.columns if col not in ['target_position', 'race_id']]
X_rank = df_full[feature_cols_rank]
y_rank = df_full['target_position']

print(f"Ranking dataset shape: {X_rank.shape}")
print(f"Number of races: {df_full['race_id'].nunique()}")


Ranking dataset shape: (3318, 52)
Number of races: 166


In [25]:
# Split data by races (not randomly by individual records)
# Get unique race IDs
race_ids = df_full['race_id'].unique()
np.random.seed(42)
np.random.shuffle(race_ids)

# 80-20 split
split_idx = int(len(race_ids) * 0.8)
train_race_ids = race_ids[:split_idx]
test_race_ids = race_ids[split_idx:]

# Create train/test masks
train_mask = df_full['race_id'].isin(train_race_ids)
test_mask = df_full['race_id'].isin(test_race_ids)

X_train_rank = X_rank[train_mask]
y_train_rank = y_rank[train_mask]
X_test_rank = X_rank[test_mask]
y_test_rank = y_rank[test_mask]

# Create group arrays (number of drivers per race)
train_groups = df_full[train_mask].groupby('race_id').size().values
test_groups = df_full[test_mask].groupby('race_id').size().values

print(f"Training races: {len(train_race_ids)}")
print(f"Test races: {len(test_race_ids)}")
print(f"Training samples: {len(X_train_rank)}")
print(f"Test samples: {len(X_test_rank)}")
print(f"Training groups: {len(train_groups)}")
print(f"Test groups: {len(test_groups)}")
print(f"Average drivers per race: {train_groups.mean():.1f}")


Training races: 132
Test races: 34
Training samples: 2638
Test samples: 680
Training groups: 132
Test groups: 34
Average drivers per race: 20.0


In [26]:
# Create XGBoost Ranker model
ranker_model = xgb.XGBRanker(
    objective='rank:pairwise',
    learning_rate=0.1,
    n_estimators=300,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbosity=1
)

print("Training XGBoost Ranker...")
ranker_model.fit(
    X_train_rank, 
    y_train_rank,
    group=train_groups,
    verbose=True
)

print("\n✓ Ranking model trained!")


Training XGBoost Ranker...

✓ Ranking model trained!


In [27]:
# Evaluate ranking model
y_pred_ranker = ranker_model.predict(X_test_rank)

# For ranking, we need to evaluate within each race group
test_df = df_full[test_mask].copy()
test_df['predicted_score'] = y_pred_ranker

# Calculate metrics per race and then average
def evaluate_ranking_per_race(group):
    """Evaluate ranking within a single race"""
    actual = group['target_position'].values
    predicted_scores = group['predicted_score'].values
    
    # Convert scores to ranks (lower score = better rank)
    predicted_ranks = predicted_scores.argsort().argsort() + 1
    
    # Calculate metrics
    top3_acc = calculate_topk_accuracy(actual, predicted_ranks, 3)
    top5_acc = calculate_topk_accuracy(actual, predicted_ranks, 5)
    mae = np.abs(actual - predicted_ranks).mean()
    
    return pd.Series({
        'top3_accuracy': top3_acc,
        'top5_accuracy': top5_acc,
        'mae': mae
    })

# Evaluate per race
race_metrics = test_df.groupby('race_id').apply(evaluate_ranking_per_race)

ranker_metrics = {
    'Top-3 Accuracy': race_metrics['top3_accuracy'].mean(),
    'Top-5 Accuracy': race_metrics['top5_accuracy'].mean(),
    'MAE': race_metrics['mae'].mean()
}

print("XGBoost Ranker Performance:")
for metric, value in ranker_metrics.items():
    if 'Accuracy' in metric:
        print(f"{metric}: {value:.2%}")
    else:
        print(f"{metric}: {value:.4f}")


XGBoost Ranker Performance:
Top-3 Accuracy: 71.32%
Top-5 Accuracy: 84.41%
MAE: 2.9735


In [28]:
# Compare all three models
print("=" * 80)
print("FINAL MODEL COMPARISON")
print("=" * 80)

comparison_df = pd.DataFrame({
    'Model': ['Baseline Regressor', 'Optimized Regressor', 'XGBoost Ranker'],
    'Top-3 Accuracy': [
        baseline_metrics.get('Top-3 Accuracy', 0),
        optimized_metrics.get('Top-3 Accuracy', 0),
        ranker_metrics['Top-3 Accuracy']
    ],
    'Top-5 Accuracy': [
        baseline_metrics.get('Top-5 Accuracy', 0),
        optimized_metrics.get('Top-5 Accuracy', 0),
        ranker_metrics['Top-5 Accuracy']
    ],
    'MAE': [
        baseline_metrics.get('MAE', 0),
        optimized_metrics.get('MAE', 0),
        ranker_metrics['MAE']
    ]
})

print("\n")
print(comparison_df.to_string(index=False))

# Visualize comparison
fig = go.Figure()

for metric in ['Top-3 Accuracy', 'Top-5 Accuracy']:
    fig.add_trace(go.Bar(
        name=metric,
        x=comparison_df['Model'],
        y=comparison_df[metric] * 100,
        text=[f"{v:.1f}%" for v in comparison_df[metric] * 100],
        textposition='auto',
    ))

fig.update_layout(
    title='Model Performance Comparison',
    xaxis_title='Model',
    yaxis_title='Accuracy (%)',
    barmode='group',
    height=500
)

fig.show()


FINAL MODEL COMPARISON


              Model  Top-3 Accuracy  Top-5 Accuracy      MAE
 Baseline Regressor        0.644578        0.811747 2.970531
Optimized Regressor        0.650602        0.843373 2.775074
     XGBoost Ranker        0.713235        0.844118 2.973529


In [29]:
# Example: Show predictions for a sample race
sample_race = test_race_ids[0]
sample_data = test_df[test_df['race_id'] == sample_race].copy()

# Sort by predicted score (lower = better)
sample_data['predicted_rank'] = sample_data['predicted_score'].rank(method='min')
sample_data = sample_data.sort_values('predicted_rank')

print(f"\n{'='*80}")
print(f"SAMPLE RACE PREDICTION: {sample_race}")
print(f"{'='*80}\n")

# Show top 10 predictions
display_cols = ['driver_abbreviation', 'grid_position', 'target_position', 'predicted_rank']
if 'driver_abbreviation' in sample_data.columns:
    print(sample_data[display_cols].head(10).to_string(index=False))
else:
    print("Driver names not available in encoded data")
    
print(f"\nTop-3 Prediction Accuracy for this race: ", end="")
top3_correct = (np.abs(sample_data['target_position'].values[:10] - 
                       sample_data['predicted_rank'].values[:10]) <= 3).mean()
print(f"{top3_correct:.0%}")



SAMPLE RACE PREDICTION: 2024_13

 driver_abbreviation  grid_position  target_position  predicted_rank
                  25            2.0                1             1.0
                  13            5.0                3             2.0
                  22            1.0                2             3.0
                  29            4.0                6             4.0
                   2            7.0               11             5.0
                  35            3.0                5             6.0
                  28           17.0                8             7.0
                  18            6.0                4             8.0
                  27            9.0               12             9.0
                  32            8.0               10            10.0

Top-3 Prediction Accuracy for this race: 80%


In [30]:
# Save the ranking model
import joblib
import os

os.makedirs('../models/xgboost', exist_ok=True)

joblib.dump(ranker_model, '../models/xgboost/xgboost_ranker_model.pkl')
joblib.dump(le_dict_rank, '../models/xgboost/ranker_label_encoders.pkl')

print("✓ XGBoost Ranker model saved!")
print("  Model: ../models/xgboost/xgboost_ranker_model.pkl")
print("  Encoders: ../models/xgboost/ranker_label_encoders.pkl")


✓ XGBoost Ranker model saved!
  Model: ../models/xgboost/xgboost_ranker_model.pkl
  Encoders: ../models/xgboost/ranker_label_encoders.pkl


### Key Differences: Regressor vs Ranker

**XGBoost Regressor:**
- Predicts absolute position (1, 2, 3, ...)
- Treats each prediction independently
- Optimizes for MSE/RMSE

**XGBoost Ranker:**
- Predicts relative ordering (who finishes ahead)
- Learns pairwise comparisons within races
- Optimizes for ranking quality (NDCG, MAP)
- Better for "top-k" predictions

**When to use each:**
- **Regressor:** When you need exact position numbers
- **Ranker:** When you care about relative order (podium predictions, fantasy F1)
