# Data Preparation

This notebook loads and applies the custom `preprocess_tdf_data` function to clean and structure the raw dataset. The processed data is saved to `Data/Processed/` for use in later analysis and modeling.

## Key Steps:
- Applies preprocessing to raw data in `Data/Raw/`
- Saves the cleaned dataset to `Data/Processed/`


In [1]:
import sys
import os
from pathlib import Path

In [2]:
def find_project_root(start: Path, anchor_dirs=("src", "Data")) -> Path:
    """
    Walk up the directory tree until we find a folder that
    contains all anchor_dirs (e.g. 'src' and 'Data').
    """
    path = start.resolve()
    for parent in [path] + list(path.parents):
        if all((parent / d).is_dir() for d in anchor_dirs):
            return parent
    raise FileNotFoundError("Could not locate project root")

In [3]:
# Locate the project root regardless of notebook depth
project_root = find_project_root(Path.cwd())

# ----- Code modules --------------------------------------------------
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

In [4]:
# Add the src directory to sys.path
src_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
if src_path not in sys.path:
    sys.path.append(src_path)

In [5]:
from data_prep import preprocess_tdf_data

In [6]:
# Data
raw_data_path = project_root / "Data" / "Raw"
processed_data_path = project_root / "Data" / "Processed"
print("Raw data folder:", raw_data_path)
print("Processed data folder:", processed_data_path)

Raw data folder: C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Raw
Processed data folder: C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Processed


In [7]:
def run_tdf_preprocessing():
    output_filename = "tdf_prepared_2011_2024.csv"
    output_full_path = os.path.join(processed_data_path, output_filename)

    df = preprocess_tdf_data(raw_data_path, output_full_path)
    print(f"Data preprocessing complete. Output saved to:\n{output_full_path}")
    return df


In [8]:
df = run_tdf_preprocessing()

Wrote prepared data to C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Processed\tdf_prepared_2011_2024.csv
Data preprocessing complete. Output saved to:
C:\Users\Shaun Ricketts\Documents\GitHub\Tour-de-France-Top-20-Predictor\Data\Processed\tdf_prepared_2011_2024.csv


In [9]:
df

Unnamed: 0,Rider_ID,Year,Age,TDF_Pos,Best_Pos_BT_UWT,Best_Pos_BT_PT,Best_Pos_AT_UWT_YB,Best_Pos_AT_PT_YB,Best_Pos_UWT_YB,Best_Pos_PT_YB,...,gt_debut,rode_giro,FC_Points,FC_Pos,Best_Pos_AT_UWT,Best_Pos_AT_PT,Best_Pos_UWT,Best_Pos_PT,Best_Pos_BT_UWT_YB,Best_Pos_BT_PT_YB
0,5,2011,31,DNF,64.0,39.0,,,,,...,,1.0,93.0,121,111.0,,64.0,39.0,,
1,5,2012,32,76.0,86.0,24.0,111.0,,64.0,39.0,...,,0.0,65.0,153,,,86.0,24.0,64.0,39.0
2,5,2013,33,,67.0,42.0,,,86.0,24.0,...,,,106.0,134,71.0,51.0,67.0,42.0,86.0,24.0
3,5,2014,34,,65.0,32.0,71.0,51.0,67.0,42.0,...,,,49.0,167,98.0,,65.0,32.0,67.0,42.0
4,5,2015,35,,92.0,68.0,98.0,,65.0,32.0,...,,,0.0,184,95.0,66.0,92.0,66.0,65.0,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2744,126678,2024,21,41.0,13.0,,42.0,,15.0,1.0,...,,0.0,1071.0,14,4.0,,4.0,,15.0,1.0
2745,126678,2025,22,,15.0,4.0,4.0,,4.0,,...,,,766.0,8,9999.0,,15.0,4.0,13.0,
2746,156417,2023,19,,,14.0,,,,,...,,,252.0,76,,46.0,,14.0,,
2747,156417,2024,20,47.0,21.0,8.0,,46.0,,14.0,...,1.0,0.0,219.0,69,,61.0,21.0,8.0,,14.0
