# Exploration phase — further testing
Initial testing of different algorithms has already been completed. XGBRegressor and GradientBoostingRegressor, both tree-based boosting models are clearly the winners. In this phase of exploration, we will further optimize the two so as to train the best possible performance predicting models for each circuit.

In [1]:
import time
from pathlib import Path

import pandas as pd
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.base import clone
from xgboost import XGBRegressor

In [2]:
project_path = Path.cwd().parent
bootstrap_project(project_path)
print(f"Ścieżka projektu: {project_path}")

Ścieżka projektu: /Users/kacperziebacz/Desktop/f1-pitstop-advisor-v2


In [3]:
with KedroSession.create(project_path=project_path) as session:
    context = session.load_context()

In [4]:
dfs = context.catalog.load("circuit_lap_data")

print(f"Załadowano dane dla {len(dfs)} torów:")
for circuit, df in dfs.items():
    print(f"  {circuit}: {df.shape[0]} okrążeń, {df.shape[1]} cech")

Załadowano dane dla 22 torów:
  Catalunya: 5034 okrążeń, 14 cech
  Spa-Francorchamps: 1592 okrążeń, 14 cech
  Silverstone: 2495 okrążeń, 14 cech
  Singapore: 3778 okrążeń, 14 cech
  Hungaroring: 4951 okrążeń, 14 cech
  Suzuka: 2754 okrążeń, 14 cech
  Paul Ricard: 945 okrążeń, 14 cech
  Austin: 918 okrążeń, 14 cech
  Miami: 2160 okrążeń, 14 cech
  Zandvoort: 5035 okrążeń, 14 cech
  Monte Carlo: 4244 okrążeń, 14 cech
  Montreal: 4307 okrążeń, 14 cech
  Monza: 3898 okrążeń, 14 cech
  Melbourne: 3071 okrążeń, 14 cech
  Spielberg: 1124 okrążeń, 14 cech
  Sakhir: 4415 okrążeń, 14 cech
  Imola: 2442 okrążeń, 14 cech
  Baku: 2734 okrążeń, 14 cech
  Mexico City: 3843 okrążeń, 14 cech
  Jeddah: 3430 okrążeń, 14 cech
  Yas Marina Circuit: 3307 okrążeń, 14 cech
  Las Vegas: 1799 okrążeń, 14 cech


In [5]:
tested_models = context.catalog.load("initial_models")

In [6]:
# Show data point count for each circuit
circuit_sizes = {}
for circuit, df in dfs.items():
    circuit_sizes[circuit] = df.shape[0]

circuit_sizes = pd.DataFrame({"DataPointCount": circuit_sizes}).sort_values(
    by="DataPointCount", ascending=False
)
circuit_sizes

Unnamed: 0,DataPointCount
Zandvoort,5035
Catalunya,5034
Hungaroring,4951
Sakhir,4415
Montreal,4307
Monte Carlo,4244
Monza,3898
Mexico City,3843
Singapore,3778
Jeddah,3430


In [7]:
best_params = {}
for model_type in ["GradientBoostingRegressor", "XGBRegressor"]:
    best_params[model_type] = {}
    for circuit in dfs.keys():
        best_params[model_type][circuit] = tested_models[model_type][
            circuit
        ].best_params_

In [8]:
for model_type, params in best_params.items():
    best_params[model_type] = pd.DataFrame(params).T

## Exploring parameters from previous testing
Below are the best parameters from initial exploration for GradientBoostingRegressor, for each circuit. Based on this we will determine the parameter value ranges to explore when further optimizing the models.

In [9]:
best_params["GradientBoostingRegressor"]

Unnamed: 0,learning_rate,max_depth,n_estimators,subsample
Catalunya,0.05,5.0,100.0,0.8
Spa-Francorchamps,0.05,3.0,200.0,1.0
Silverstone,0.1,3.0,200.0,0.8
Singapore,0.1,3.0,200.0,1.0
Hungaroring,0.01,5.0,100.0,0.8
Suzuka,0.05,3.0,200.0,0.8
Paul Ricard,0.1,3.0,200.0,1.0
Austin,0.1,3.0,200.0,0.8
Miami,0.1,3.0,200.0,0.8
Zandvoort,0.1,5.0,200.0,0.8


### Same for XGBRegressor

In [10]:
best_params["XGBRegressor"]

Unnamed: 0,colsample_bytree,learning_rate,max_depth,n_estimators,subsample
Catalunya,0.8,0.01,6.0,400.0,0.8
Spa-Francorchamps,0.8,0.1,3.0,100.0,0.8
Silverstone,0.8,0.3,3.0,200.0,1.0
Singapore,1.0,0.1,10.0,100.0,1.0
Hungaroring,1.0,0.01,10.0,200.0,0.8
Suzuka,1.0,0.1,3.0,200.0,1.0
Paul Ricard,0.8,0.1,3.0,400.0,1.0
Austin,1.0,0.3,3.0,200.0,0.8
Miami,0.8,0.3,3.0,100.0,0.8
Zandvoort,1.0,0.3,3.0,400.0,0.8


In [11]:
param_ranges = {}
for model_type, param_df in best_params.items():
    param_ranges[model_type] = pd.DataFrame(
        {"Min": param_df.min(axis="index"), "Max": param_df.max(axis="index")}
    )

Below we look at the exact parameter value ranges for both regression algorithms:

In [12]:
param_ranges["GradientBoostingRegressor"]

Unnamed: 0,Min,Max
learning_rate,0.01,0.1
max_depth,3.0,5.0
n_estimators,100.0,200.0
subsample,0.8,1.0


In [13]:
param_ranges["XGBRegressor"]

Unnamed: 0,Min,Max
colsample_bytree,0.8,1.0
learning_rate,0.01,0.3
max_depth,3.0,10.0
n_estimators,100.0,400.0
subsample,0.8,1.0


## Parameter search grids
Based on the above, I came up with the grid searches below. These will be trained to find the best
parameters for each model, for each circuit.

In [14]:
model_searches = {
    "GradientBoostingRegressor": GridSearchCV(
        GradientBoostingRegressor(random_state=42),
        {
            "n_estimators": [100, 150, 200],
            "learning_rate": [0.05, 0.1],
            "max_depth": [3, 4, 5],
            "subsample": [0.8, 0.9, 1.0],
            "min_samples_leaf": [1, 3],
        },
    ),
    "XGBRegressor": GridSearchCV(
        XGBRegressor(
            random_state=42, n_jobs=-1, objective="reg:squarederror", verbosity=0
        ),
        {
            "n_estimators": [100, 200, 300, 400],
            "max_depth": [3, 6, 10],
            "learning_rate": [0.01, 0.1, 0.3],
            "subsample": [0.8, 1.0],
            "colsample_bytree": [0.8, 1.0],
            "min_child_weight": [1, 3],
        },
    ),
}

In [15]:
# Fit every single circuit/GridSearch configuration
models_and_circuits = {}

for name in model_searches.keys():
    models_and_circuits[name] = {}

for circuit, data in dfs.items():
    print(f"Fitting models for {circuit}")
    circuit_start = time.time()

    # Usuń LapTimeZScore i tekstową kolumnę Compound
    X = data.drop(["LapTimeZScore", "Compound"], axis="columns")
    y = data["LapTimeZScore"]

    print(f"  Shape: {X.shape}, Kolumny: {list(X.columns)}")

    mask = ~y.isna()
    X = X[mask]
    y = y[mask]

    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    X = pd.DataFrame(X_imputed, columns=X.columns, index=X.index)

    print(f"  Shape po imputacji: {X.shape}")

    for name, model_search in model_searches.items():
        print(f"Fitting {name};".ljust(50), end="")
        model_start = time.time()

        model_search_copy = clone(model_search)
        model_search_copy.fit(X, y)
        models_and_circuits[name][circuit] = model_search_copy

        print(f"took {round(time.time() - model_start, 2)} seconds")

    print(
        f'Took a total of {round(time.time() - circuit_start, 2)} seconds to fit all models for circuit "{circuit}"\n'
    )

Fitting models for Catalunya
  Shape: (5034, 12), Kolumny: ['IsPitLap', 'TyreLife', 'FreshTyre', 'LapNumber', 'AirTemp', 'Humidity', 'Pressure', 'Rainfall', 'TrackTemp', 'WindDirection', 'WindSpeed', 'CompoundNumeric']
  Shape po imputacji: (5034, 12)
Fitting GradientBoostingRegressor;                took 196.64 seconds
Fitting XGBRegressor;                             took 425.7 seconds
Took a total of 622.35 seconds to fit all models for circuit "Catalunya"

Fitting models for Spa-Francorchamps
  Shape: (1592, 12), Kolumny: ['IsPitLap', 'TyreLife', 'FreshTyre', 'LapNumber', 'AirTemp', 'Humidity', 'Pressure', 'Rainfall', 'TrackTemp', 'WindDirection', 'WindSpeed', 'CompoundNumeric']
  Shape po imputacji: (1592, 12)
Fitting GradientBoostingRegressor;                took 60.9 seconds
Fitting XGBRegressor;                             took 395.75 seconds
Took a total of 456.65 seconds to fit all models for circuit "Spa-Francorchamps"

Fitting models for Silverstone
  Shape: (2495, 12), Kol

In [16]:
context.catalog.save("second_phase_models", models_and_circuits)

## Results
Below are the results of our testing. R^2 is used for score. XGBRegressor generally performs better, but for some circuits, GradientBoostingRegressor gets better scores.

In [17]:
# Show scores for each GridSearch and circuit
all_scores = {}
for key in models_and_circuits.keys():
    scores = {}
    for circuit, model in models_and_circuits[key].items():
        scores[circuit] = model.best_score_
    all_scores[key] = scores

all_scores: pd.DataFrame = pd.DataFrame(all_scores)

all_scores

Unnamed: 0,GradientBoostingRegressor,XGBRegressor
Catalunya,0.706168,0.724938
Spa-Francorchamps,0.918907,0.919926
Silverstone,0.785578,0.800788
Singapore,0.647672,0.741458
Hungaroring,0.655903,0.684192
Suzuka,0.836539,0.82647
Paul Ricard,0.887277,0.897035
Austin,0.920182,0.921024
Miami,0.818684,0.827875
Zandvoort,0.863794,0.87086


Overall statistics for both algorithms. XGBRegressor is generally better.

In [18]:
# Show score statistics for each model
# MinScore is very important. A good model should perform reasonably well for all tracks.
model_scores_df = pd.DataFrame(
    {
        "MeanScore": all_scores.mean(axis="index"),
        "MedianScore": all_scores.median(axis="index"),
        "ScoreVariance": all_scores.var(axis="index"),
        "MinScore": all_scores.min(axis="index"),
    }
)

model_scores_df.sort_values(by=["MeanScore"], ascending=False)

Unnamed: 0,MeanScore,MedianScore,ScoreVariance,MinScore
XGBRegressor,0.816156,0.834435,0.006849,0.67216
GradientBoostingRegressor,0.79669,0.832405,0.011413,0.538924


In [19]:
# Drop BestModelType if it already exists
all_scores.drop(labels=["BestModelType"], axis="columns", inplace=True, errors="ignore")
all_scores["BestModelType"] = all_scores.idxmax(axis="columns")
all_scores

Unnamed: 0,GradientBoostingRegressor,XGBRegressor,BestModelType
Catalunya,0.706168,0.724938,XGBRegressor
Spa-Francorchamps,0.918907,0.919926,XGBRegressor
Silverstone,0.785578,0.800788,XGBRegressor
Singapore,0.647672,0.741458,XGBRegressor
Hungaroring,0.655903,0.684192,XGBRegressor
Suzuka,0.836539,0.82647,GradientBoostingRegressor
Paul Ricard,0.887277,0.897035,XGBRegressor
Austin,0.920182,0.921024,XGBRegressor
Miami,0.818684,0.827875,XGBRegressor
Zandvoort,0.863794,0.87086,XGBRegressor


Finally, we export the chosen models to a file for further use. The file contains a Python dictionary, where the keys are circuit names, and the values are the best model for each track. For most circuits, that will be an XGBRegressor, for others — a GradientBoostingRegressor.

In [20]:
final_regressor_dictionary = {}
for circuit in all_scores.index:
    best_model_type = all_scores.loc[circuit, "BestModelType"]
    final_regressor_dictionary[circuit] = models_and_circuits[best_model_type][
        circuit
    ].best_estimator_

In [21]:
context.catalog.save("best_parameters", final_regressor_dictionary)
print("Zapisano słownik regresorów")

Zapisano słownik regresorów
