# Random Forest Tree Regression Method for the Train Delay Prediction

### First of all, we have to load the data, then take a pre-trained model and fit it to our data.

### For the sake of simplicity and time complexity, we are only going to use 10% of the data we have, and will use 100 estimators (or number of trees).

The first step is to have our imports and the logging configuration.

In [1]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

import sys
import os

sys.path.append(os.path.dirname("/Users/mac/Desktop/train_delay_prediction/utils.py"))

from utils import *

logging.basicConfig(
    filename='random_forest_evaluation.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
)
logging.info("Starting Random Forest evaluation script.")

Then we have to load the data and split it correctly, in a way that is not biased. This means separating the test and train set in a way that they are independent according to the dates of departure in order to mitigate overfitting.

In [2]:
data = load_data(percentage_of_data_usage=1.0)

X_train = data["X_train"]
y_train = data["y_train"]
X_test = data["X_test"]
y_test = data["y_test"]

Now, we our going to do a multi-output regression, fit the model to our data, and get the predicted delay stored in a variable.

In [3]:
trained_models = {}

rf_regressor = MultiOutputRegressor(
    RandomForestRegressor(
        n_estimators=50,
        max_depth=None,
        random_state=42,
        n_jobs=-1,
    )
)

model_name = "Random_Forest"
trained_model_data = train(rf_regressor, X_train, y_train, model_name, savemodel=False)
trained_models[model_name] = trained_model_data

We are defining some score metrics to measure accuracy and eventually compare our model to the others. We will save all of those metrics in a .npy and in a .json file in order to store them and load them easily when needed.

In [4]:
metrics = evaluate_2(
    trained_model=trained_model_data,
    X_test=X_test,
    y_test=y_test,
    model_name=model_name,
)

The next step is having some graphs just to visualize some results. An important graph is the last one, where we get to see which features have the most influence on our predictions.

In [5]:
calculate_feature_importance(
    trained_models=trained_models,
    X_test=X_test,
    y_test=y_test,
    plots_folder="./plots",
    top_features_threshold=0.01,
    n_repeats=5
)