# XGBoost or eXtreme Gradient Boosting Method for the Train Delay Prediction

### First of all, we have to load the data, then take a pre-trained model and fit it to our data.

The first step is to have our imports and the logging configuration.

In [1]:
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor

import sys
import os

sys.path.append(os.path.dirname("/Users/mac/Desktop/train_delay_prediction/utils.py"))

from utils import *

logging.basicConfig(
    filename='xgboost_evaluation.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
)
logging.info("Starting eXtreme Gradient Boosting evaluation script.")

In [2]:
# with open('config.json', 'r') as f:
#     config = json.load(f)

# n_estimators = config.get('n_estimators', 10)
# max_depth = config.get('max_depth', 10)  

# The ones that perform best for normal data:
# n_estimators = 10
# max_depth = 7

# The ones that perform best for data with more features:
n_estimators = 50
max_depth = 10


print(f"Running XGBoost with n_estimators={n_estimators} and max_depth={max_depth}")

Running XGBoost with n_estimators=50 and max_depth=10


Then we have to load the data and split it correctly, in a way that is not biased. This means separating the test and train set in a way that they are independent according to the dates of departure in order to mitigate overfitting.

In [3]:
data = load_data_more_features(percentage_of_data_usage=1.0)

X_train = data["X_train"]
y_train = data["y_train"]
X_test = data["X_test"]
y_test = data["y_test"]

Now, we our going to do a multi-output regression, fit the model to our data, and get the predicted delay stored in a variable.

In [4]:
trained_models = {}

xgb_regressor = MultiOutputRegressor(
    XGBRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42,
        n_jobs=-1
    )
)

model_name = "XGBoost"
trained_model_data = train(xgb_regressor, X_train, y_train, model_name, savemodel=False)
trained_models[model_name] = trained_model_data

We are defining some score metrics to measure accuracy and eventually compare our model to the others. We will save all of those metrics in a .npy and in a .json file in order to store them and load them easily when needed.

In [5]:
metrics_2 = evaluate_2(
    trained_model=trained_model_data,
    X_test=X_test,
    y_test=y_test,
    model_name=model_name,
)

The next step is having some graphs just to visualize some results. An important graph is the last one, where we get to see which features have the most influence on our predictions.

In [11]:
calculate_feature_importance(
    trained_models=trained_models,
    X_test=X_test,
    y_test=y_test,
    feature_mapping=data["columns_scheme"]["x"],
    top_features_threshold=0.01,
    n_repeats=5
)

Calculating feature importance: 100%|██████████| 1/1 [14:20<00:00, 860.65s/it]


## Run With New Target

In [7]:
# data_newtarget = load_data_newtarget(percentage_of_data_usage=0.01)

# X_train_newtarget = data_newtarget["X_train"]
# y_train_newtarget = data_newtarget["y_train"]
# X_test_newtarget = data_newtarget["X_test"]
# y_test_newtarget = data_newtarget["y_test"]

In [8]:
# trained_models_newtarget = {}

# xgb_regressor_newtarget = MultiOutputRegressor(
#     XGBRegressor(
#         n_estimators=n_estimators,
#         max_depth=max_depth,
#         random_state=42,
#         n_jobs=-1
#     )
# )

# model_name = "XGBoost"
# trained_model_data_newtarget = train(xgb_regressor_newtarget, X_train_newtarget, y_train_newtarget, model_name, savemodel=False)
# trained_models_newtarget[model_name] = trained_model_data_newtarget

In [9]:
# metrics_2_newtarget = evaluate_2_newtarget(
#     trained_model=trained_model_data_newtarget,
#     X_test=X_test_newtarget,
#     y_test=y_test_newtarget,
#     model_name=model_name,
# )

In [10]:
logging.info("XGB evaluation completed.")

### We are done with the evaluation of the data and prediction of train delay prediction with this model and will move on to the next one. Feel free to load the results wherever they are needed, check out the other models, or see the comparison of all models in the report.