# Date Range Testing
This notebook investigates the impact of different date ranges on the model. Testing with RandomForest and GradientBoost models to see how it affects different models. 

## Key Steps:
- Defines multiple training sets using varying date ranges, with some excluding Covid years
- Trains models on each range using consistent pipeline
- Compares performance of Accuracy, Recall, F1 score and AUC

In [1]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, recall_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

from collections import defaultdict

from sklearn.ensemble import HistGradientBoostingClassifier

In [2]:
def find_project_root(start: Path, anchor_dirs=("src", "Data")) -> Path:
    """
    Walk up the directory tree until we find a folder that
    contains all anchor_dirs (e.g. 'src' and 'Data').
    """
    path = start.resolve()
    for parent in [path] + list(path.parents):
        if all((parent / d).is_dir() for d in anchor_dirs):
            return parent
    raise FileNotFoundError("Could not locate project root")

In [3]:
# Locate the project root regardless of notebook depth
project_root = find_project_root(Path.cwd())

# ----- Code modules --------------------------------------------------
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

from data_prep import preprocess_tdf_data   # import data preproc function

# ----- Data ----------------------------------------------------------
data_raw_path = project_root / "Data" / "Raw"
print("Raw data folder:", data_raw_path)


Raw data folder: C:\Users\Shaun Ricketts\Documents\Projects\Cycling\Tour de France Predictor - 2025\Data\Raw


In [4]:
prepared_df = pd.read_csv(data_raw_path / "tdf_prepared_2011_2024.csv")

In [5]:
# import missing_value_handler
from missing_value_handler import FillWithSentinel

In [6]:
cleaner = FillWithSentinel()
final_df = cleaner.fit_transform(prepared_df)

In [7]:
# Filter out DNF or DSQ from TDF_Pos
final_df = final_df[~final_df['TDF_Pos'].isin(['DNF', 'DSQ'])]

In [8]:
final_df = final_df.dropna(subset=['TDF_Pos'])

In [9]:
# Convert TDF_Pos to numeric
final_df['TDF_Pos'] = pd.to_numeric(final_df['TDF_Pos'])

# 1 if TDF_Pos <= 20, else 0
final_df['is_top20'] = (final_df['TDF_Pos'] <= 20).astype(int)

In [10]:
final_df = final_df[final_df['Year'] >= 2012]

In [11]:
final_df

Unnamed: 0,Rider_ID,Year,TDF_Pos,Best_Pos_BT_UWT,Best_Pos_BT_PT,Best_Pos_AT_UWT_YB,Best_Pos_AT_PT_YB,Best_Pos_UWT_YB,Best_Pos_PT_YB,FC_Points_YB,FC_Pos_YB,best_tdf_result,best_other_gt_result,best_recent_tdf_result,best_recent_other_gt_result,tdf_debut,gt_debut,rode_giro,Age,is_top20
3,3,2013,4.0,3.0,2.0,1.0,999,1.0,999,1719.0,12,1.0,1.0,5.0,1.0,0.0,0.0,0.0,31,1
5,3,2015,5.0,1.0,999,1.0,999,1.0,999,2893.0,2,1.0,1.0,4.0,1.0,0.0,0.0,1.0,33,1
7,3,2017,9.0,2.0,2.0,4.0,1.0,1.0,1.0,2095.0,8,1.0,1.0,5.0,1.0,0.0,0.0,0.0,35,1
9,4,2012,32.0,3.0,6.0,999,999,1.0,2.0,1445.0,20,3.0,2.0,13.0,6.0,0.0,0.0,0.0,39,0
11,5,2012,76.0,86.0,24.0,111.0,999,64.0,39.0,93.0,402,8.0,3.0,40.0,15.0,0.0,0.0,0.0,32,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20516,120433,2024,39.0,4.0,999,10.0,999,10.0,6.0,483.0,175,999.0,999.0,999.0,999.0,1.0,0.0,0.0,22,0
20610,126678,2024,41.0,13.0,999,42.0,999,15.0,1.0,1055.0,58,999.0,42.0,999.0,42.0,1.0,0.0,0.0,21,0
20802,153042,2024,23.0,12.0,9.0,999,7.0,60.0,7.0,110.0,393,999.0,999.0,999.0,999.0,1.0,0.0,0.0,25,0
20861,156417,2024,47.0,21.0,8.0,999,46.0,999,14.0,252.0,280,999.0,999.0,999.0,999.0,1.0,1.0,0.0,20,0


In [12]:
feat5 = [
    "Best_Pos_BT_UWT",
    "Best_Pos_BT_PT",
    "Best_Pos_AT_UWT_YB",
    "Best_Pos_AT_PT_YB",
    "Age"]

feat7 = ['Best_Pos_BT_UWT', 'Best_Pos_BT_PT',
       'FC_Pos_YB', 'best_recent_tdf_result',
       'best_recent_other_gt_result', 'rode_giro', 'Age']

target = "is_top20"

In [13]:
# ───────────────────────────────────────────────────────────────
# 1.  Define training-only ranges  (2024 is always the test set)
# ───────────────────────────────────────────────────────────────
train_ranges = {
    "2012_2023"         : (2012, 2023),
    "2015_2023"         : (2015, 2023),
    "2018_2023"         : (2018, 2023),
    "no_2020"           : (2012, 2023),  
    "no_2020_2021"      : (2012, 2023),
    "2015_no_2020"      : (2015, 2023),  
    "2015_no_2020_2021" : (2015, 2023),
    
}

drop_years = {
    "no_2020": {2020},
    "no_2020_2021": {2020, 2021},
    "2015_no_2020": {2020},
    "2015_no_2020_2021": {2020, 2021},
}

In [14]:
# Constant test slice (all 2024 rows)
X_test = final_df.loc[final_df["Year"] == 2024, feat5]
y_test = final_df.loc[final_df["Year"] == 2024, target]

In [15]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 

In [16]:
# ------------------------------------------------------------------
# 1.  Prepare containers
# ------------------------------------------------------------------
pred_cols = defaultdict(list)   # keeps one small df per run
metrics   = defaultdict(dict)   # what you were already filling

In [17]:
# ------------------------------------------------------------------
# 2.  Train/evaluate and SAVE the predictions
# ------------------------------------------------------------------
for model_name, pipe in {
        "rf": Pipeline([
            ("impute", SimpleImputer(strategy="mean")),
            ("rf", RandomForestClassifier(max_depth=10,
                                          min_samples_leaf=3,
                                          class_weight="balanced",
                                          random_state=42))
        ]),
        "hgb": Pipeline([
            ("impute", SimpleImputer(strategy="mean")),
            ("hgb", HistGradientBoostingClassifier(learning_rate=0.1,
                                                   max_depth=None,
                                                   l2_regularization=0.1,
                                                   random_state=42))
        ])
    }.items():

    for key, (start, end) in train_ranges.items():
        # ----------------- slice training data -----------------
        mask = (final_df["Year"].between(start, end))
        if key in drop_years:
            mask &= ~final_df["Year"].isin(drop_years[key])

        X_train, y_train = final_df.loc[mask, feat5], final_df.loc[mask, target]
        if X_train.empty:
            continue

        # ----------------- CV & fit -----------------
        cv_auc  = cross_val_score(pipe, X_train, y_train,
                                  cv=StratifiedKFold(5, shuffle=True, random_state=42),
                                  scoring="roc_auc", n_jobs=-1)
        pipe.fit(X_train, y_train)

        # ----------------- 2024 predictions -----------------
        y_pred = pipe.predict(X_test)
        y_prob = pipe.predict_proba(X_test)[:, 1]

        # store metrics as before ...........................
        metrics[(model_name, key)] = {
            "AUC"     : round(roc_auc_score(y_test, y_prob), 3),
            "Accuracy": round(accuracy_score(y_test, y_pred), 3),
            "F1"      : round(f1_score(y_test, y_pred), 3),
            "Recall"  : round(recall_score(y_test, y_pred), 3),
            "CV AUC"  : f"{cv_auc.mean():.4f} ± {cv_auc.std():.4f}"
        }

        # store predictions in a tiny DataFrame for merging ..
        tmp = X_test.reset_index()[["index"]]          # keep original row-index
        tmp[["Rider_ID", "Year"]] = final_df.loc[tmp["index"], ["Rider_ID", "Year"]].values
        tmp[f"{model_name}_{key}_prob"] = y_prob
        tmp[f"{model_name}_{key}_pred"] = y_pred
        pred_cols[(model_name, key)] = tmp.drop(columns="index")

In [18]:
# Assuming your metrics dictionary is already defined
df_metrics = pd.DataFrame.from_dict(metrics, orient='index')

# Convert the multi-index to columns for clarity
df_metrics.index = pd.MultiIndex.from_tuples(df_metrics.index, names=["Model", "Train_Set"])

# Optional: reset the index to have 'Model' and 'Train_Set' as columns
df_metrics = df_metrics.reset_index()

In [19]:
# ------------------------------------------------------------------
# 3.  Combine all prediction columns side-by-side
# ------------------------------------------------------------------
preds_2024 = pd.concat(pred_cols.values(), axis=1).loc[:,~pd.concat(pred_cols.values(), axis=1).columns.duplicated()]
#   Rider_ID | Year | rf_2015_no_2020_prob | rf_2015_no_2020_pred | hgb_2015_no_2020_prob | ...


In [20]:
# ------------------------------------------------------------------
# 4.  Join back to final_df  (and keep whole history if you like)
# ------------------------------------------------------------------
pred_vs_act_2024 = final_df.merge(preds_2024, on=["Rider_ID", "Year"], how="left")
pred_vs_act_2024 = pred_vs_act_2024[pred_vs_act_2024["Year"]==2024]

In [21]:
pred_vs_act_2024 = pred_vs_act_2024[['Rider_ID', 'TDF_Pos', 
'rf_2012_2023_prob', 
'rf_2015_2023_prob', 
'rf_2018_2023_prob',
'rf_no_2020_prob', 
'rf_no_2020_2021_prob',  
'rf_2015_no_2020_prob',
'rf_2015_no_2020_2021_prob',
'hgb_2012_2023_prob', 
'hgb_2015_2023_prob',  
'hgb_2018_2023_prob',
'hgb_no_2020_prob', 
'hgb_no_2020_2021_prob', 
'hgb_2015_no_2020_prob', 
'hgb_2015_no_2020_2021_prob'
]]

pred_vs_act_2024_rf = pred_vs_act_2024[['Rider_ID', 'TDF_Pos', 
'rf_2012_2023_prob', 
'rf_2015_2023_prob', 
'rf_2018_2023_prob',
'rf_no_2020_prob', 
'rf_no_2020_2021_prob',  
'rf_2015_no_2020_prob',
'rf_2015_no_2020_2021_prob',
]]

pred_vs_act_2024_hgb = pred_vs_act_2024[['Rider_ID', 'TDF_Pos', 
'hgb_2012_2023_prob', 
'hgb_2015_2023_prob',  
'hgb_2018_2023_prob',
'hgb_no_2020_prob', 
'hgb_no_2020_2021_prob', 
'hgb_2015_no_2020_prob', 
'hgb_2015_no_2020_2021_prob'
]]

In [22]:
pred_vs_act_2024_rf  = pred_vs_act_2024_rf.copy()
pred_vs_act_2024_hgb = pred_vs_act_2024_hgb.copy()

# Select only the columns with the probability values (exclude Rider_ID, TDF_Pos)
prob_cols_rf = [
'rf_2012_2023_prob', 
'rf_2015_2023_prob', 
'rf_2018_2023_prob',
'rf_no_2020_prob', 
'rf_no_2020_2021_prob',  
'rf_2015_no_2020_prob',
'rf_2015_no_2020_2021_prob',
]

prob_cols_hgb = [
'hgb_2012_2023_prob', 
'hgb_2015_2023_prob',  
'hgb_2018_2023_prob',
'hgb_no_2020_prob', 
'hgb_no_2020_2021_prob', 
'hgb_2015_no_2020_prob', 
'hgb_2015_no_2020_2021_prob'
]

# Calculate the difference between max and min values in those columns per row
pred_vs_act_2024_rf['max_prob'] = pred_vs_act_2024_rf[prob_cols_rf].max(axis=1)
pred_vs_act_2024_hgb['max_prob'] = pred_vs_act_2024_hgb[prob_cols_hgb].max(axis=1)

pred_vs_act_2024_rf['min_prob'] = pred_vs_act_2024_rf[prob_cols_rf].min(axis=1)
pred_vs_act_2024_hgb['min_prob'] = pred_vs_act_2024_hgb[prob_cols_hgb].min(axis=1)

pred_vs_act_2024_rf['max_diff'] = pred_vs_act_2024_rf[prob_cols_rf].max(axis=1) - pred_vs_act_2024_rf[prob_cols_rf].min(axis=1)
pred_vs_act_2024_hgb['max_diff'] = pred_vs_act_2024_hgb[prob_cols_hgb].max(axis=1) - pred_vs_act_2024_hgb[prob_cols_hgb].min(axis=1)




In [23]:
# First, isolate the columns of interest (all probability columns)
prob_cols_rf = pred_vs_act_2024_rf.columns.drop(['Rider_ID', 'TDF_Pos', 'max_diff'])
prob_cols_hgb = pred_vs_act_2024_hgb.columns.drop(['Rider_ID', 'TDF_Pos', 'max_diff'])

# Find column with max value per row
pred_vs_act_2024_rf['max_model'] = pred_vs_act_2024_rf[prob_cols_rf].idxmax(axis=1)
pred_vs_act_2024_hgb['max_model'] = pred_vs_act_2024_hgb[prob_cols_hgb].idxmax(axis=1)

# Find column with min value per row
pred_vs_act_2024_rf['min_model'] = pred_vs_act_2024_rf[prob_cols_rf].idxmin(axis=1)
pred_vs_act_2024_hgb['min_model'] = pred_vs_act_2024_hgb[prob_cols_hgb].idxmin(axis=1)

In [24]:
pred_vs_act_2024_rf = pred_vs_act_2024_rf[['Rider_ID', 'TDF_Pos', 'max_prob', 'min_prob', 'max_diff', 'max_model', 'min_model']]
pred_vs_act_2024_hgb = pred_vs_act_2024_hgb[['Rider_ID', 'TDF_Pos', 'max_prob', 'min_prob', 'max_diff', 'max_model', 'min_model']]

## Results

In [25]:
df_metrics.sort_values(by='F1', ascending=False)

Unnamed: 0,Model,Train_Set,AUC,Accuracy,F1,Recall,CV AUC
10,hgb,no_2020,0.921,0.922,0.703,0.65,0.8814 ± 0.0230
1,rf,2015_2023,0.94,0.894,0.694,0.85,0.8743 ± 0.0181
6,rf,2015_no_2020_2021,0.936,0.894,0.694,0.85,0.8848 ± 0.0400
11,hgb,no_2020_2021,0.924,0.922,0.686,0.6,0.8682 ± 0.0118
0,rf,2012_2023,0.944,0.879,0.667,0.85,0.8772 ± 0.0134
2,rf,2018_2023,0.925,0.887,0.667,0.8,0.8777 ± 0.0189
5,rf,2015_no_2020,0.935,0.879,0.667,0.85,0.8890 ± 0.0260
3,rf,no_2020,0.948,0.872,0.654,0.85,0.8892 ± 0.0252
7,hgb,2012_2023,0.927,0.915,0.647,0.55,0.8694 ± 0.0168
4,rf,no_2020_2021,0.938,0.872,0.64,0.8,0.8829 ± 0.0184


In [26]:
pred_vs_act_2024_rf.sort_values(by='max_diff', ascending=False).head(10)

Unnamed: 0,Rider_ID,TDF_Pos,max_prob,min_prob,max_diff,max_model,min_model
1830,37446,18.0,0.816437,0.429949,0.386488,rf_no_2020_prob,rf_2018_2023_prob
1964,72919,108.0,0.433886,0.090151,0.343735,rf_no_2020_prob,rf_2015_no_2020_2021_prob
1991,110247,62.0,0.421361,0.080994,0.340367,rf_2018_2023_prob,rf_2015_no_2020_2021_prob
1693,19784,31.0,0.77343,0.438252,0.335178,rf_no_2020_2021_prob,rf_2018_2023_prob
659,716,42.0,0.68398,0.370468,0.313513,rf_no_2020_2021_prob,rf_2015_no_2020_prob
1706,20384,11.0,0.815268,0.50817,0.307097,rf_no_2020_2021_prob,rf_2018_2023_prob
1775,27307,71.0,0.495154,0.18853,0.306624,rf_2018_2023_prob,rf_no_2020_prob
1630,16752,104.0,0.276436,0.000618,0.275818,rf_2015_no_2020_2021_prob,rf_2018_2023_prob
1618,16687,127.0,0.33378,0.061534,0.272246,rf_2012_2023_prob,rf_2015_no_2020_prob
1939,64106,74.0,0.659547,0.39668,0.262867,rf_2015_no_2020_2021_prob,rf_2018_2023_prob


## Decision
Training dataset chosen: 2015_no_2020_2021 (2015 - 2023, excluding 2020 and 2021)

For this model, since we are trying to identify top 20s, there are few positive cases compared to negative,
so focussing on recall and F1 is more valuable than AUC and accuracy. 

The top performers on F1 score were:
1. hgb_no_2020
2. rf_2015_2023
2. rf_2015_no_2020_2021
4. hgb_no_2020_2021
5. rf_2012_2023

All HistGradientBoostingClassifier models had significantly lower recalls so we will mainly focus on RandomForest results.

hgb_no_2020:
- Has the highest F1 and the highest recall of the hgb models, at 0.65, but I don't believe the slight performance gain is worth the 
recall drop. 

2015_2023:
- Same recall and F1 score, slightly higher AUC, slightly lower CV AUC, suggests less robust generalisation
- Keeps the potential noise of covid years without any benefit to model strength

2012_2023 (all data):
- Slightly lower F1 and accuracy, indicates older data may be less relevant to current racing

no_2020
- Lower F1 scores, indicates dropping in conjunction with 2021 is stronger, potentially since we are looking at early 2021 results which still had many covid issues

2018-2023:
- Low recall and F1 score. Smaller dataset, possible underfitting.

Overall 2015_no_2020_2021 has the best compromise between data quantity, relevance, model quality, having consistenyly strong results across all metrics.