# XGBoost Model Training & Hyperparameter Tuning

This notebook trains an XGBoost regression model and performs grid-based hyperparameter tuning to improve forecast accuracy.

In [None]:
import pandas as pd

df = pd.read_parquet(
    "s3://energy-consumption-forecasting-project/processed/pandas/pjme_energy_features.paruet"
)

In [None]:
split_date = "2017-01-01"

train_df = df[df["timestamp"] < split_date]
test_df = df[df["timestamp"] >= split_date]

x_train = train_df.drop(columns=["timestamp", "energy_mw"])
y_train = train_df["energy_mw"]

x_test = test_df.drop(columns=["timestamp", "energy_mw"])
y_test = test_df["energy_mw"]

In [None]:
from xgboost import XGBRegressor

baseline_model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

baseline_model.fit(x_train, y_train)


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [4, 6, 8],
    "learning_rate": [0.03, 0.05, 0.1],
    "n_estimators": [200, 300],
    "subsample": [0.7, 0.8],
    "colsample_bytree": [0.7, 0.8]
}

# creating a template model
xgb = XGBRegressor(
    random_state = 42,
    onjective = "reg:squarederror"
)

In [None]:
# creating the GridSearch controller
grid_search = GridSearchCV(
    estimator = xgb, # this is the type of model I want to tune. GridSearch will clone this estimator many times.
    param_grid = param_grid, # these are the knobs I want to turn. GridSearch will iterate through all 72 combinations.
    scoring = "neg_root_mean_squared_error", # RMSE should be minimized. scikit-learn always maximizes scores. so it uses negative RMSE
    cv = 3, # 3 cross validations
    verbose = 2,
    n_jobs = -1 # use all available CPU coes
)

In [None]:
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
grid.best_params_

Hyperparameter tuning reduced RMSE by ~8.8% compared to the baseline model. The tuned model was seelcted as the final forecasting model.