# XGBoost or eXtreme Gradient Boosting Method for the Train Delay Prediction

### First of all, we have to load the data, then take a pre-trained model and fit it to our data.


The first step is to have our imports and the logging configuration.

In [2]:
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor

import sys
import os

sys.path.append(os.path.dirname("/Users/mac/Desktop/train_delay_prediction/utils.py"))

from utils import *

logging.basicConfig(
    filename='xgboost_evaluation.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
)
logging.info("Starting eXtreme Gradient Boosting evaluation script.")

In [None]:
# PARAMETERS
percentage_of_data_usage = 1.0
train_months = [1]      # This will be updated dynamically ([1], [1,2], ..., [1,...,11])
test_months = [12]      # December is fixed for testing
suffix = ""             # Suffix to uniquely identify each progressive run ("_prog_1")

In [None]:
# The ones that perform best:
n_estimators = 10
max_depth = 7

print(f"Running XGBoost with n_estimators={n_estimators} and max_depth={max_depth}")

In [3]:
combine_metrics()

Combining JSON metrics: 100%|██████████| 13/13 [00:00<00:00, 3072.75it/s]




Then we have to load the data and split it correctly, in a way that is not biased. This means separating the test and train set in a way that they are independent according to the dates of departure in order to mitigate overfitting.

In [None]:
data = load_full_year_data(percentage_of_data_usage=percentage_of_data_usage, train_months=train_months, test_months=test_months)

X_train = data["X_train"]
y_train = data["y_train"]
X_test = data["X_test"]
y_test = data["y_test"]

Now, we our going to do a multi-output regression, fit the model to our data, and get the predicted delay stored in a variable.

In [None]:
trained_models = {}

xgb_regressor = MultiOutputRegressor(
    XGBRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42,
        n_jobs=-1
    )
)

model_name = "XGBoost"
trained_model_data = train(xgb_regressor, X_train, y_train, model_name, savemodel=False)
trained_models[model_name] = trained_model_data

We are defining some score metrics to measure accuracy and eventually compare our model to the others. We will save all of those metrics in a .npy and in a .json file in order to store them and load them easily when needed.

In [None]:
logging.info(f"Starting evaluation {suffix}")
metrics = evaluate_2_fullyear(trained_model=trained_model_data, X_test=X_test, y_test=y_test, model_name=model_name + suffix)
logging.info(f"Evaluation complete {suffix}")

The next step is having some graphs just to visualize some results. An important graph is the last one, where we get to see which features have the most influence on our predictions.

In [None]:
# calculate_feature_importance(
#     trained_models=trained_models,
#     X_test=X_test,
#     y_test=y_test,
#     top_features_threshold=0.01,
#     n_repeats=5
# )

### We are done with the evaluation of the data and prediction of train delay prediction with this model and will move on to the next one. Feel free to load the results wherever they are needed, check out the other models, or see the comparison of all models in the resutls notebook.