# Scenario Testing
This notebook tests different strategies for handling missing and ambiguous race results (e.g., DNF/DSQ) in the Tour de France dataset. It compares multiple imputation and fallback scenarios to assess their impact on predictive performance.

## Key Steps:
- Defines five data handling scenarios using sentinel values, nulls, or fallback logic
- Builds a classification pipeline to predict Top 20 finishes
- Evaluates each scenario using cross-validation and test performance on 2024 data

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path

# visualisation tools
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn - Core Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# sklearn - Evaluation
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score
)

## Set Folder Path and Read CSVs

In [2]:
def find_project_root(start: Path, anchor_dirs=("src", "Data")) -> Path:
    """
    Walk up the directory tree until we find a folder that
    contains all anchor_dirs (e.g. 'src' and 'Data').
    """
    path = start.resolve()
    for parent in [path] + list(path.parents):
        if all((parent / d).is_dir() for d in anchor_dirs):
            return parent
    raise FileNotFoundError("Could not locate project root")

In [3]:
# Locate the project root regardless of notebook depth
project_root = find_project_root(Path.cwd())

# ----- Code modules --------------------------------------------------
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

In [4]:
from data_prep import preprocess_tdf_data   # import data preproc function

In [5]:
# ----- Data ----------------------------------------------------------
raw_data_path = project_root / "Data" / "Raw"
processed_data_path = project_root / "Data" / "Processed"
print("Raw data folder:", raw_data_path)
print("Processed data folder:", processed_data_path)

Raw data folder: C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Raw
Processed data folder: C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Processed


In [6]:
prepared_df = pd.read_csv(processed_data_path / "tdf_prepared_2011_2024.csv",
                         usecols = ['Rider_ID', 'Year', 'Age', 'TDF_Pos', 'Best_Pos_BT_UWT',
                           'Best_Pos_BT_PT', 'Best_Pos_UWT_YB', 'Best_Pos_PT_YB', 'FC_Pos_YB', 'best_recent_tdf_result', 
                           'best_recent_other_gt_result', 'rode_giro'])

## Dealing with Nulls and DNFs (etc.)

- Scenario 1: Replace DNFs with Nulls
- Scenario 2: Replace DNFs and Nulls with Sentinel (999)
- Scenario 3: Replace DNFs with Sentinel and leave Nulls
- Scenario 4: Replace nulls/DNFs in Best_UWT results with Best_PT results (with weight), if still null use sentinel
- Scenario 5: Replace nulls/DNFs in Best_UWT & Best_PT with previous year

Set a value for the sentinel

Filter down columns to only ones likely to use in final model...

In [7]:
sentinel = 999

In [8]:
# Get cols with "DNF" value ("DNS" in same col as "DNF")
dnf_columns = (prepared_df == "DNF").any()[lambda x: x].index.tolist()

In [9]:
# Get cols with nulls
null_columns = (prepared_df.isnull()).any()[lambda x: x].index.tolist()

In [10]:
prepared_df

Unnamed: 0,Rider_ID,Year,TDF_Pos,Best_Pos_BT_UWT,Best_Pos_BT_PT,Best_Pos_UWT_YB,Best_Pos_PT_YB,FC_Pos_YB,best_recent_tdf_result,best_recent_other_gt_result,rode_giro,Age
0,2,2011,,67.0,,,,1,,,,40
1,3,2011,5.0,1.0,,,,1,1.0,1.0,0.0,29
2,3,2012,,,,1.0,,2,,,,30
3,3,2013,4.0,3.0,2.0,1.0,,12,5.0,1.0,0.0,31
4,3,2014,DNF,1.0,,3.0,2.0,13,4.0,1.0,0.0,32
...,...,...,...,...,...,...,...,...,...,...,...,...
21229,220860,2023,,,,,,500,,,,19
21230,229373,2024,,,,,,499,,,,19
21231,230418,2024,,,,,,499,,,,20
21232,231012,2024,,,,,,499,,,,20


In [11]:
prepared_df.columns

Index(['Rider_ID', 'Year', 'TDF_Pos', 'Best_Pos_BT_UWT', 'Best_Pos_BT_PT',
       'Best_Pos_UWT_YB', 'Best_Pos_PT_YB', 'FC_Pos_YB',
       'best_recent_tdf_result', 'best_recent_other_gt_result', 'rode_giro',
       'Age'],
      dtype='object')

### Scenario 1

In [12]:
# Replace "DNF" with null and create _null indicator columns
for col in dnf_columns:
    prepared_df[col + "_null"] = prepared_df[col].replace("DNF", np.nan)
    prepared_df[col + "_null"] = prepared_df[col + "_null"].replace("DSQ", np.nan)

In [13]:
# Create List for outputted cols
null_columns_list = [
 'Best_Pos_BT_UWT_null',
 'Best_Pos_BT_PT_null',]

### Scenario 2

In [14]:
for col in null_columns:
    prepared_df[col + "_sent"] = prepared_df[col].replace({"DNF": np.nan, "DSQ": np.nan})
    prepared_df[col + '_sent_flag'] = prepared_df[col].isnull().astype(int)
    prepared_df[col + '_sent'] = prepared_df[col + "_sent"].fillna(sentinel)
    prepared_df[col + '_sent'] = prepared_df[col + '_sent'].astype(float).astype(int)

In [15]:
sent_columns_list = [col for col in prepared_df.columns if col.endswith("_sent")]
sent_flag_columns_list = [col for col in prepared_df.columns if col.endswith("_sent_flag")]

In [16]:
sent_columns_list = [
 'Best_Pos_BT_UWT_sent',
 'Best_Pos_BT_PT_sent',
 'best_recent_tdf_result_sent',
 'best_recent_other_gt_result_sent']

In [17]:
sent_flag_columns_list = [
 'Best_Pos_BT_UWT_sent_flag',
 'Best_Pos_BT_PT_sent_flag',
 'best_recent_tdf_result_sent_flag',
 'best_recent_other_gt_result_sent_flag']

### Scenario 3

In [18]:
for col in dnf_columns:
    prepared_df[col + "_dnf_flag"] = prepared_df[col].isin(["DNF", "DSQ"]).astype(int)  # Boolean indicator
    prepared_df[col + "_dnf_sent"] = prepared_df[col].replace({"DNF": sentinel, "DSQ": sentinel})

In [19]:
dnf_flag_columns_list = [col for col in prepared_df.columns if col.endswith("_dnf_flag")]
dnf_sent_columns_list = [col for col in prepared_df.columns if col.endswith("_dnf_sent")]

In [20]:
dnf_sent_columns_list = [
 'Best_Pos_BT_UWT_dnf_sent',
 'Best_Pos_BT_PT_dnf_sent',
 'best_recent_tdf_result_sent',
 'best_recent_other_gt_result_sent']

In [21]:
dnf_flag_columns_list = [
 'Best_Pos_BT_UWT_dnf_flag',
 'Best_Pos_BT_PT_dnf_flag',
 'best_recent_tdf_result_sent_flag',
 'best_recent_other_gt_result_sent_flag']

### Scenario 4

Set a weight for use of pro-tour result

In [22]:
pt_weight_add = 3
pt_weight_mult = 1.5

In [23]:
def generate_filled_from_pt_cols(df, pt_weight_add=3, pt_weight_mult=1.5):
    uwt_pt_pairs = [
        ("Best_Pos_BT_UWT", "Best_Pos_BT_PT"),
        # add more if needed
    ]
    
    filled_cols = []
    flag_cols = []

    df = df.copy()

    for uwt_col, pt_col in uwt_pt_pairs:
        def fill_with_pt(row):
            val = row[uwt_col]
            pt_val = row[pt_col]

            if pd.isna(val) or val in ["DNF", "DSQ"]:
                if pd.notna(pt_val) and pt_val not in ["DNF", "DSQ"]:
                    try:
                        return (float(pt_val) + pt_weight_add) * pt_weight_mult
                    except:
                        return 999  # sentinel
                else:
                    return 999
            else:
                try:
                    return float(val)
                except:
                    return 999

        filled_col_name = f"{uwt_col}_filled_from_pt_add{pt_weight_add}_mult{pt_weight_mult}"
        flag_col_name = f"{filled_col_name}_flag"

        df[filled_col_name] = df.apply(fill_with_pt, axis=1)
        df[flag_col_name] = (df[filled_col_name] == 999).astype(int)

        filled_cols.append(filled_col_name)
        flag_cols.append(flag_col_name)

    return df, filled_cols, flag_cols


In [24]:
uwt_pt_pairs = [
    ("Best_Pos_BT_UWT", "Best_Pos_BT_PT"),
    #("Best_Pos_AT_UWT_YB", "Best_Pos_AT_PT_YB"),
    #("Best_Pos_UWT_YB", "Best_Pos_PT_YB"),
]

for uwt_col, pt_col in uwt_pt_pairs:
    def fill_with_pt(row):
        val = row[uwt_col]
        pt_val = row[pt_col]

        if pd.isna(val) or val in ["DNF", "DSQ"]:
            if pd.notna(pt_val) and pt_val not in ["DNF", "DSQ"]:
                try:
                    return (float(pt_val) + pt_weight_add) * pt_weight_mult
                except:
                    return sentinel
            else:
                return sentinel
        else:
            try:
                return float(val)
            except:
                return sentinel

    filled_col_name = f"{uwt_col}_filled_from_pt"
    prepared_df[filled_col_name] = prepared_df.apply(fill_with_pt, axis=1)
    prepared_df[filled_col_name + "_flag"] = prepared_df[filled_col_name].isin([999]).astype(int)

In [25]:
filled_from_pt_columns_list = [
 'Best_Pos_BT_UWT_filled_from_pt',
 'best_recent_tdf_result_sent',
 'best_recent_other_gt_result_sent']

In [26]:
filled_from_pt_columns_flag_list = [
 'Best_Pos_BT_UWT_filled_from_pt_flag',
 'best_recent_tdf_result_sent_flag',
 'best_recent_other_gt_result_sent_flag']

### Scenario 5

In [27]:
bt_yb_pairs = [
    ("Best_Pos_BT_UWT", "Best_Pos_UWT_YB"),
    ("Best_Pos_BT_PT", "Best_Pos_PT_YB"),
]

for bt_col, yb_col in bt_yb_pairs:
    def fill_with_yb(row):
        val = row[bt_col]
        yb_val = row[yb_col]

        if pd.isna(val) or val in ["DNF", "DSQ"]:
            if pd.notna(yb_val) and yb_val not in ["DNF", "DSQ"]:
                try:
                    return float(yb_val)
                except:
                    return sentinel
            else:
                return sentinel
        else:
            try:
                return float(val)
            except:
                return sentinel

    filled_col_name = f"{bt_col}_filled_from_yb"
    prepared_df[filled_col_name] = prepared_df.apply(fill_with_yb, axis=1)
    prepared_df[filled_col_name + "_flag"] = prepared_df[filled_col_name].isin([999]).astype(int)

In [28]:
filled_from_yb_columns_list = [
 'Best_Pos_BT_UWT_filled_from_yb',
 'Best_Pos_BT_PT_filled_from_yb',                              
 'best_recent_tdf_result_sent',
 'best_recent_other_gt_result_sent']

In [29]:
filled_from_yb_columns_flag_list = [
 'Best_Pos_BT_UWT_filled_from_yb_flag',
 'Best_Pos_BT_PT_filled_from_yb_flag',
 'best_recent_tdf_result_sent_flag',
 'best_recent_other_gt_result_sent_flag']

### Check Scenarios logic worked

Count number of sentinel values in each column

In [30]:
# Identify relevant columns to check
sentinel_cols = [col for col in prepared_df.columns if col.endswith('_sent') 
                 or col.endswith('_filled_from_pt') or col.endswith('_filled_from_yb')]

# Count the number of sentinel values in each
sentinel_counts = prepared_df[sentinel_cols].apply(lambda col: (col == sentinel).sum()).sort_values(ascending=False)

print("Sentinel value counts per column:")
print(sentinel_counts)

Sentinel value counts per column:
best_recent_tdf_result_sent         19427
best_recent_other_gt_result_sent    19425
TDF_Pos_sent                        19068
rode_giro_sent                      18608
Best_Pos_UWT_YB_sent                13477
Best_Pos_BT_UWT_sent                13094
Best_Pos_BT_UWT_filled_from_yb      11350
Best_Pos_PT_YB_sent                  9685
Best_Pos_BT_PT_sent                  8756
Best_Pos_BT_UWT_filled_from_pt       5919
Best_Pos_BT_PT_filled_from_yb        5042
Best_Pos_BT_PT_dnf_sent              1981
Best_Pos_PT_YB_dnf_sent              1031
Best_Pos_BT_UWT_dnf_sent             1009
Best_Pos_UWT_YB_dnf_sent              678
TDF_Pos_dnf_sent                      460
dtype: int64


Confirm new filled columns aren't empty or completely filled with sentinel

In [31]:
filled_cols = [col for col in prepared_df.columns if col.endswith('_filled_from_pt') or col.endswith('_filled_from_yb')]

for col in filled_cols:
    total = len(prepared_df)
    sentinel_count = (prepared_df[col] == sentinel).sum()
    null_count = prepared_df[col].isnull().sum()
    unique_vals = prepared_df[col].nunique(dropna=True)

    print(f"{col}:")
    print(f"  Total rows: {total}")
    print(f"  Sentinel count: {sentinel_count}")
    print(f"  Null count: {null_count}")
    print(f"  Unique non-null values: {unique_vals}")
    print()

Best_Pos_BT_UWT_filled_from_pt:
  Total rows: 21234
  Sentinel count: 5919
  Null count: 0
  Unique non-null values: 281

Best_Pos_BT_UWT_filled_from_yb:
  Total rows: 21234
  Sentinel count: 11350
  Null count: 0
  Unique non-null values: 172

Best_Pos_BT_PT_filled_from_yb:
  Total rows: 21234
  Sentinel count: 5042
  Null count: 0
  Unique non-null values: 168



Spot-check the logic of fallback columns (e.g. Scenario 4)

In [32]:
# Compare original, fallback, and final filled values
check_sample = prepared_df[
    ['Best_Pos_BT_UWT', 'Best_Pos_BT_PT', 'Best_Pos_BT_UWT_filled_from_pt']
].sample(10)

print(check_sample)

      Best_Pos_BT_UWT Best_Pos_BT_PT  Best_Pos_BT_UWT_filled_from_pt
20054            34.0           29.0                            34.0
15118             NaN           84.0                           130.5
11438             NaN            NaN                           999.0
21108             NaN          120.0                           184.5
3430             36.0            NaN                            36.0
3021              NaN          105.0                           162.0
19022             NaN           58.0                            91.5
426              63.0           20.0                            63.0
1899              NaN           40.0                            64.5
13416             NaN            NaN                           999.0


In [33]:
prepared_df[prepared_df["Year"]==2012].shape

(1486, 49)

In [34]:
prepared_df[prepared_df["Year"]==2023].shape

(1661, 49)

In [35]:
prepared_df[prepared_df["Year"]==2024].shape

(1629, 49)

Set year to start from 2012 as data from 2011 will include "YB" (Year Before) data which has no data filled

In [36]:
prepared_df = prepared_df[prepared_df['Year'] >= 2012]

In [37]:
# Filter out DNF or DSQ from TDF_Pos
prepared_df = prepared_df[~prepared_df['TDF_Pos'].isin(['DNF', 'DSQ'])]

In [38]:
# Filter out nulls from TDF_Pos
prepared_df = prepared_df.dropna(subset=['TDF_Pos'])

In [39]:
# Convert TDF_Pos to numeric
prepared_df['TDF_Pos'] = pd.to_numeric(prepared_df['TDF_Pos'])

# 1 if TDF_Pos <= 20, else 0
prepared_df['is_top20'] = (prepared_df['TDF_Pos'] <= 20).astype(int)

## Scenario Testing

In [40]:
core_features = ['Age', 'FC_Pos_YB']

In [41]:
scenario_dict = {}

# Add your static scenarios
scenario_dict['null'] = {
    'X': prepared_df[core_features + null_columns_list],
    'y': prepared_df['is_top20']
}

scenario_dict['sent'] = {
    'X': prepared_df[core_features + sent_columns_list + sent_flag_columns_list],
    'y': prepared_df['is_top20']
}

scenario_dict['dnf_sent'] = {
    'X': prepared_df[core_features + dnf_sent_columns_list + dnf_flag_columns_list],
    'y': prepared_df['is_top20']
}

# Define wider range of PT weight scenarios including no weight
pt_weight_scenarios = [
    {"name": "filled_from_pt_no_weight", "add": 0, "mult": 1.0},
    {"name": "filled_from_pt_low_weight", "add": 1, "mult": 1.2},
    {"name": "filled_from_pt_medium_weight", "add": 3, "mult": 1.5},
    {"name": "filled_from_pt_high_weight", "add": 5, "mult": 2.0},
    {"name": "filled_from_pt_very_high_weight", "add": 7, "mult": 2.5},
]

for pt_scenario in pt_weight_scenarios:
    scenario_name = pt_scenario["name"]
    df_with_filled, filled_cols, flag_cols = generate_filled_from_pt_cols(
        prepared_df,
        pt_weight_add=pt_scenario["add"],
        pt_weight_mult=pt_scenario["mult"]
    )

    scenario_dict[scenario_name] = {
        "X": df_with_filled[core_features + filled_cols + flag_cols],
        "y": df_with_filled["is_top20"],
        "pt_add": pt_scenario["add"],
        "pt_mult": pt_scenario["mult"]
    }


Using RandomForestClassifier as it seemed to perform best from initial tests (very strong recall)

In [43]:
cv_splitter = StratifiedKFold(n_splits=5, shuffle=False)

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    #('scaler', StandardScaler()), 
    ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced'))
])

param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

results = []

for scenario_name, scenario_data in scenario_dict.items():
    
    pt_add = scenario_data.get("pt_add")
    pt_mult = scenario_data.get("pt_mult")

    print(f"\n==============================")
    print(f"Scenario: {scenario_name} | PT add: {pt_add} | PT mult: {pt_mult}")
    print(f"==============================")

    y_binary = scenario_data['y']

    train_mask = (prepared_df['Year'] >= 2012) & (prepared_df['Year'] <= 2023)
    test_mask = (prepared_df['Year'] == 2024)

    X_train = scenario_data['X'].loc[train_mask]
    y_train = y_binary.loc[train_mask]
    X_test = scenario_data['X'].loc[test_mask]
    y_test = y_binary.loc[test_mask]

    grid_search = GridSearchCV(pipeline, param_grid, cv=cv_splitter, scoring='roc_auc', n_jobs=-1, verbose=0)
    grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_

    top20_probs = best_model.predict_proba(X_test)[:, 1]
    y_test_pred = best_model.predict(X_test)

    print(f"Best Parameters: {grid_search.best_params_}")
    print("Classification Report (Test Set - 2024):")
    print(classification_report(y_test, y_test_pred))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"AUC Score on Test Set: {roc_auc_score(y_test, top20_probs):.3f}")

    rf_model = best_model.named_steps['classifier']
    importances = rf_model.feature_importances_
    feature_names = scenario_data['X'].columns

    feature_importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)

    print("\nTop Feature Importances:")
    print(feature_importance_df.head(30))

    #plt.figure(figsize=(8, 5))
    #plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
    #plt.gca().invert_yaxis()
    #plt.title(f"Feature Importance - {scenario_name}")
    #plt.xlabel("Importance")
    #plt.tight_layout()
    #plt.show()

    results.append({
        "Scenario": scenario_name,
        "Accuracy": accuracy_score(y_test, y_test_pred),
        "Recall_1": recall_score(y_test, y_test_pred, pos_label=1),
        "F1_1": f1_score(y_test, y_test_pred, pos_label=1),
        "AUC": roc_auc_score(y_test, top20_probs)
    })


Scenario: null | PT add: None | PT mult: None
Best Parameters: {'classifier__max_depth': 20, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 100}
Classification Report (Test Set - 2024):
              precision    recall  f1-score   support

           0       0.93      0.94      0.93       121
           1       0.61      0.55      0.58        20

    accuracy                           0.89       141
   macro avg       0.77      0.75      0.76       141
weighted avg       0.88      0.89      0.88       141

Confusion Matrix:
[[114   7]
 [  9  11]]
AUC Score on Test Set: 0.896

Top Feature Importances:
                Feature  Importance
2  Best_Pos_BT_UWT_null    0.499411
1             FC_Pos_YB    0.301404
3   Best_Pos_BT_PT_null    0.115888
0                   Age    0.083298

Scenario: sent | PT add: None | PT mult: None
Best Parameters: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_sp

In [44]:
# Summary of results
pd.DataFrame(results).sort_values(by='Recall_1', ascending=False)

Unnamed: 0,Scenario,Accuracy,Recall_1,F1_1,AUC
1,sent,0.929078,0.7,0.736842,0.961983
2,dnf_sent,0.914894,0.7,0.7,0.966529
3,filled_from_pt_no_weight,0.851064,0.7,0.571429,0.913223
4,filled_from_pt_low_weight,0.865248,0.7,0.595745,0.91157
5,filled_from_pt_medium_weight,0.851064,0.7,0.571429,0.909917
6,filled_from_pt_high_weight,0.858156,0.65,0.565217,0.921074
7,filled_from_pt_very_high_weight,0.858156,0.65,0.565217,0.909091
0,,0.886525,0.55,0.578947,0.896281
