# Final Model - Top 20 - 2024
This notebook trains the final model to predict Top 20 finishes for the 2024 Tour de France, using selected features and data. It then saves the best-performing model for future use or deployment.

## Key Steps:
- Loads the cleaned dataset and filters the 2025 startlist
- Trains a GradientBoostingClassifier using a pipeline and GridSearchCV
- Evaluates on the 2025 data and saves the best model to disk

## Import Libraries

In [1]:
from pathlib import Path
import sys
import joblib
import pandas as pd
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

## Set Folder Path and Read CSVs

In [2]:
def find_project_root(start: Path, anchor_dirs=("src", "Data")) -> Path:
    """
    Walk up the directory tree until we find a folder that
    contains all anchor_dirs (e.g. 'src' and 'Data').
    """
    path = start.resolve()
    for parent in [path] + list(path.parents):
        if all((parent / d).is_dir() for d in anchor_dirs):
            return parent
    raise FileNotFoundError("Could not locate project root")

In [3]:
# Locate the project root regardless of notebook depth
project_root = find_project_root(Path.cwd())

# ----- Code modules --------------------------------------------------
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

from data_prep import preprocess_tdf_data   # import data preproc function

# ----- Data ----------------------------------------------------------
raw_data_path = project_root / "Data" / "Raw"
processed_data_path = project_root / "Data" / "Processed"
print("Raw data folder:", raw_data_path)
print("Processed data folder:", processed_data_path)

Raw data folder: C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Raw
Processed data folder: C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Processed


In [4]:
# Go up two levels to reach the project root
project_root = Path.cwd().parents[1]
src_path = project_root / 'src'

# Add to sys.path if not already there
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

# Now you can import your function
from data_prep import preprocess_tdf_data

In [5]:
# Race metadata
df = pd.read_csv(processed_data_path / "tdf_prepared_2011_2024.csv")

In [6]:
# import missing_value_handler
from missing_value_handler import FillWithSentinel

In [7]:
cleaner = FillWithSentinel()
df = cleaner.fit_transform(df)

In [8]:
# Filter out DNF or DSQ from TDF_Pos
df = df[~df['TDF_Pos'].isin(['DNF', 'DSQ'])]

In [9]:
df = df.dropna(subset=['TDF_Pos'])

In [10]:
# Convert TDF_Pos to numeric
df['TDF_Pos'] = pd.to_numeric(df['TDF_Pos'])

# 1 if TDF_Pos <= 20, else 0
df['is_top20'] = (df['TDF_Pos'] <= 20).astype(int)

In [11]:
# Set date range for 2015+, and exclude 2020 & 2021
df = df[(df['Year'] >= 2015) & (~df['Year'].isin([2020, 2021]))]

In [12]:
features = ['Best_Pos_BT_UWT', 'Best_Pos_BT_PT',
            'FC_Pos_YB', 'best_recent_tdf_result',
            'best_recent_other_gt_result', 'rode_giro']

train_mask = (df['Year'] <= 2023)
test_mask  = (df['Year'] == 2024)

X_train = df.loc[train_mask, features]
y_train = df.loc[train_mask, 'is_top20']
X_test  = df.loc[test_mask, features]
y_test  = df.loc[test_mask, 'is_top20']

In [13]:
df

Unnamed: 0,Rider_ID,Year,Age,TDF_Pos,Best_Pos_BT_UWT,Best_Pos_BT_PT,Best_Pos_AT_UWT_YB,Best_Pos_AT_PT_YB,Best_Pos_UWT_YB,Best_Pos_PT_YB,...,rode_giro,FC_Points,FC_Pos,Best_Pos_AT_UWT,Best_Pos_AT_PT,Best_Pos_UWT,Best_Pos_PT,Best_Pos_BT_UWT_YB,Best_Pos_BT_PT_YB,is_top20
5,3,2015,33,5.0,1.0,999,1.0,999,1.0,999,...,1.0,2299.0,7,999.0,999.0,1.0,999.0,1.0,999.0,1
7,3,2017,35,9.0,2.0,2.0,4.0,1.0,1.0,1.0,...,0.0,2231.0,6,5.0,999.0,2.0,2.0,1.0,999.0,1
25,8,2016,36,34.0,19.0,999,24.0,4.0,24.0,4.0,...,0.0,220.0,298,999.0,999.0,19.0,999.0,57.0,20.0,0
31,9,2016,32,84.0,35.0,26.0,13.0,9.0,13.0,9.0,...,0.0,220.0,298,999.0,999.0,35.0,26.0,21.0,16.0,0
51,13,2015,36,36.0,14.0,999,999,32.0,18.0,32.0,...,1.0,228.0,282,7.0,999.0,7.0,999.0,18.0,999.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20516,120433,2024,22,39.0,4.0,999,10.0,999,10.0,6.0,...,0.0,993.0,56,2.0,2.0,2.0,2.0,23.0,6.0,0
20610,126678,2024,21,41.0,13.0,999,42.0,999,15.0,1.0,...,0.0,1071.0,47,4.0,999.0,4.0,999.0,15.0,1.0,0
20802,153042,2024,25,23.0,12.0,9.0,999,7.0,60.0,7.0,...,0.0,508.0,161,999.0,15.0,12.0,9.0,60.0,56.0,0
20861,156417,2024,20,47.0,21.0,8.0,999,46.0,999,14.0,...,0.0,219.0,318,999.0,61.0,21.0,8.0,999.0,14.0,0


# Run The Model

In [None]:
model_dir = project_root / "Models"
model_dir.mkdir(parents=True, exist_ok=True)

In [None]:
# Load best model from GridSearchCV (already trained and saved)
model = joblib.load(model_dir / "final_model.pkl")


In [None]:
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [None]:
# Ensure the directory exists
(project_root / "Data" / "Processed").mkdir(parents=True, exist_ok=True)


In [None]:
joblib.dump(model, model_dir / "final_model.pkl")
print(f"Final model saved to: {model_dir / 'final_model.pkl'}")

In [None]:
df.loc[test_mask, 'predicted_top20_proba'] = y_proba
df.loc[test_mask, 'predicted_top20'] = y_pred

In [None]:
output_path = project_root / "Data" / "Processed" / "2024_predictions.csv"
df.loc[test_mask].to_csv(output_path, index=False)


## Conclusion

The final model has been saved for deployment or further use. It was selected after careful hyperparameter tuning and evaluation. Future work may involve testing the model on new race editions or extending the feature set.

Model path: `Models/final_model.pkl`
