## Goals: Training the *Final* Models

This notebook trains the model on the full *baseline_dataset* for the final prediction on evaluation data.

Here, we train a model designed to generalize across water stations in Brazil and France. However, you are not required to follow this approach and may opt to train separate models for different geographic *regions*.

This baseline model training example utilizes all available features, with hyperparameters chosen for quick execution rather than optimization. For hyperparameter tuning and feature selection explorations, refer to the `02_exploration` folder.

> **Note:** This notebook requires outputs from the `00 Preprocessing` notebooks.

<img src="../images/notebook-3.png" alt="Experiment Diagram" style="width:75%; text-align:center;" />

### 1. Data Import and Setup

This section imports the necessary libraries, sets up environment paths, and includes custom utility functions.

In [None]:
import os
import sys

import joblib
import numpy as np
import pandas as pd
import lightgbm as lgb

from quantile_forest import RandomForestQuantileRegressor
from mapie.regression import MapieQuantileRegressor
from interpret.glassbox import ExplainableBoostingRegressor

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..')))

from src.utils.model import split_dataset, compare_models_per_station

##### Constants :
- **INPUT_DIR**: Directory for input data (same as in "02 - Feature Engineering").
- **MODEL_DIR**: Directory where trained models are saved.
- **DATASET_DIR**: Directory where the Zenodo dataset is unzipped.

##### Model Parameters

- **SEED**: 42 (for reproducibility)
- **NUMBER_OF_WEEK**: 4 (one model is trained per week)

##### FINAL_MODELS

- **mapie**: Combines LightGBM with MAPIE. **MAPIE** (Model Agnostic Prediction Interval Estimator) computes prediction intervals for any regression model using conformal methods.
- **qrf**: Quantile Random Forest (natively produces prediction intervals)
- **ebm**: Explainable Boosting Machine is used as a exemple that does not natively implement prediction intervals, but that can be customised to do so.

In [None]:
INPUT_DIR = "../../../data/input/"
MODEL_DIR = "../../../models/"
DATASET_DIR = "../../../dataset/"

SEED = 42
NUMBER_OF_WEEK = 4 # Number of weeks to predict one model is trained per week

FINAL_MODELS = ["mapie",
                "qrf",
                #"EBM"
                ]
mapie_enbpi = {}
mapie = {}
qrf = {}
mapie_aci = {}

COLUMNS_TO_DROP = ["water_flow_week1", "station_code", "water_flow_week2", "water_flow_week3", "water_flow_week4"]


### 2. Data Loading
Load in the baseline datasets, create the directory to save models.

In [None]:
dataset_train = pd.read_csv(f"{INPUT_DIR}dataset_baseline.csv")

dataset_train = dataset_train.set_index("ObsDate")

if not os.path.exists(f"{MODEL_DIR}final/"):
    os.makedirs(f"{MODEL_DIR}final/")

Data pre-processing removal of unnecessary columns, setup of the target

In [None]:
X_train = dataset_train.drop(columns=COLUMNS_TO_DROP)
y_train = {}
for i in range(0, NUMBER_OF_WEEK):
    y_train[i] = dataset_train[f"water_flow_week{i+1}"]


### 2. Models training
#### a. LGBM + MAPIE

## Mapie Model Training Overview

- **Configuration:**  
  - Sets `ALPHA` (0.1) as the prediction interval level.
  - Defines `TIME_VALIDATION` as a split point for creating a validation set.
  - Configures LightGBM parameters (`LGBM_PARAMS`) for quantile regression.




In [None]:
ALPHA = 0.1
TIME_VALIDATION = "2000-01-01 00:00:00"

LGBM_PARAMS = {
    "max_depth": 15,
    "learning_rate": 0.01,
    "n_estimators": 500,
    "colsample_bytree": 0.7,
    "objective": "quantile",
    "alpha": ALPHA
}

- **Data Preparation:**  
  - Splits `dataset_train` into training and validation subsets using `split_dataset`.
  - Removes unnecessary columns from both the training and validation datasets.
  - Extracts target variables for each week (from `water_flow_week1` to `water_flow_week4`).

- **Model Training:**  
  For each week:
  - Initializes a LightGBM regressor with the specified parameters.
  - Wraps it in a `MapieQuantileRegressor` to estimate prediction intervals.
  - Trains the model on the training data and calibrates it using the validation data.
  - Saves the trained model 

In [None]:
if "mapie" in FINAL_MODELS: 
    print("Training Mapie")


    train_mapie, val_mapie, val_temporal  = split_dataset(dataset_train, 0.75, TIME_VALIDATION)

    X_train_mapie = train_mapie.drop(columns=COLUMNS_TO_DROP)
    print(len(X_train_mapie.columns))
    y_train_mapie = {}
    for i in range(0, NUMBER_OF_WEEK):
        y_train_mapie[i] = train_mapie[f"water_flow_week{i+1}"]

    X_val = val_mapie.drop(columns=COLUMNS_TO_DROP)
    y_val = {}
    y_val[0] = val_mapie["water_flow_week1"]
    for i in range(1, NUMBER_OF_WEEK):
        y_val[i] = val_mapie[f"water_flow_week{i+1}"]

    for i in range(NUMBER_OF_WEEK):
        print(f"Training week {i}")
        # Initialize and train MapieQuantileRegressor
        regressor = lgb.LGBMRegressor(**LGBM_PARAMS)
        mapie[i] = MapieQuantileRegressor(estimator=regressor, method="quantile", cv="split", alpha=ALPHA)
        mapie[i].fit(X_train_mapie, y_train_mapie[i], X_calib=X_val, y_calib=y_val[i])
        
        # save model with date
        time = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")

        model_path = f"{MODEL_DIR}final/mapie_quantile_{time}_week_{i}.pkl"
        joblib.dump(mapie[i], model_path)


#### b. QRF

- **Training:**  
  Initializes a `RandomForestQuantileRegressor` with the following parameters:
  - 100 estimators
  - Maximum depth of 10
  - Minimum of 10 samples per leaf

  These parameters allow for relatively fast training, though they are not optimized for peak performance. 
  
  The model is then fitted using `X_train` and the corresponding weekly target `y_train[i]`.

In [None]:
if "qrf" in FINAL_MODELS:
    for i in range(NUMBER_OF_WEEK):
        print(f"Training week {i}")
        # Train RandomForestQuantileRegressor
        qrf[i] = RandomForestQuantileRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)
        qrf[i].fit(X_train, y_train[i])

        time = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")
        model_path = f"{MODEL_DIR}final/qrf_quantile_{time}_week_{i}.pkl"
        joblib.dump(qrf[i], model_path)

#### c. Explainable Boosting Machine

EBM is an ensemble method that does not natively provide access to its individual members for performing quantile predictions or generating prediction intervals. To overcome this limitation, we manually construct an ensemble.


- **Ensemble Training:**  
- For each ensemble member (seed from 0 to 4):
    - A bootstrap sample is created from `X_train` and `y_train[i]` using sampling with replacement.
    - An `ExplainableBoostingRegressor` is instantiated with fixed parameters (e.g., `max_bins=128`, `learning_rate=0.05`, `interactions=3`, and `random_state=42` to ensure consistent binning) and then trained on the sampled data.
    - The trained model is appended to the list for the current week.
- **Saving the Ensemble:**  
- The ensemble (i.e., the list of EBM models for the week) is saved.

In [None]:
if "ebm" in FINAL_MODELS:
    NUM_ENSEMBLES = 5
    ebm_ensembles = {}
    for i in range(NUMBER_OF_WEEK):
        print(f"Training EBM ensemble for week {i}")

        models_i = []
        for seed in range(NUM_ENSEMBLES):
            print(f"Training EBM ensemble {seed} for week {i}")
            # 1. Create your bootstrap sample or subset (if you want bagging)
            sample_indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
            X_sample = X_train.iloc[sample_indices]
            y_sample = y_train[i][sample_indices]
            
            # 2. Train an EBM with consistent binning parameters
            ebm_model = ExplainableBoostingRegressor(
                outer_bags=1,
                inner_bags=1,
                max_bins=128,
                learning_rate=0.05,
                interactions=3,
                early_stopping_rounds=100,
                random_state=SEED
            )
            ebm_model.fit(X_sample, y_sample)
            
            models_i.append(ebm_model)

        time = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")
        file_path = f"{MODEL_DIR}final/ebm_ensemble_{time}_week_{i}.pkl"

        joblib.dump(ebm_ensembles, file_path)
        print(f"Saved EBM ensembles to {file_path}")

        # Store the list of models for week i
        ebm_ensembles[i] = models_i

### 3. Performance Evaluation on the Full Training Set

> **Note:**  
> The performance displayed here is calculated on the training set. This does not necessarily reflect the models' performance on unseen data.


In [None]:
y_train_stations = dataset_train["station_code"].values

for i in range(NUMBER_OF_WEEK):
    predictions = []
    baseline_day_before = dataset_train["water_flow_lag_1w"]
    predictions.append({"model": "Week before", "prediction": baseline_day_before, "dataset":"train", "stations": y_train_stations, "prediction_interval": None})
    if "mapie" in FINAL_MODELS:
        y_pred_mapie, y_pis_mapie = mapie[i].predict(X_train)
        predictions.append({"model": "LGBM+MAPIE", "prediction": y_pred_mapie, "dataset":"train", "stations": y_train_stations, "prediction_interval": y_pis_mapie})
    if "qrf" in FINAL_MODELS:
        y_pred_qrf = qrf[i].predict(X_train, quantiles="mean", aggregate_leaves_first=False)
        y_pis_qrf = qrf[i].predict(X_train, quantiles=[ALPHA/2, 1-ALPHA/2])
        predictions.append({"model": "QRF", "prediction": y_pred_qrf, "dataset":"train", "stations": y_train_stations, "prediction_interval": y_pis_qrf})
    if "ebm" in FINAL_MODELS:
        y_pred_ebm = []
        for model in ebm_ensembles[i]:
            y_pred_ebm.append(model.predict(X_train))
        y_pred_ebm = np.mean(y_pred_ebm, axis=0)
        predictions.append({"model": "EBM", "prediction": y_pred_ebm, "dataset":"train", "stations": y_train_stations, "prediction_interval": None})

    compare_models_per_station(
        y_train[i].values,
        predictions,
        y_train_stations,
        column_to_display="log_likelihood" ,
        title = f"WEEK {i}")

### 4. Coverage on the Full Training Set

> **Note:**  
> The performance displayed here is calculated on the training set. This does not necessarily reflect the models' performance on unseen data.


In [None]:
for i in range(NUMBER_OF_WEEK):

    predictions = []
    baseline_day_before = dataset_train["water_flow_lag_1w"]
    predictions.append({"model": "Week before", "prediction": baseline_day_before, "dataset":"train", "stations": y_train_stations, "prediction_interval": None})
    if "mapie" in FINAL_MODELS:
        y_pred_mapie, y_pis_mapie = mapie[i].predict(X_train)
        predictions.append({"model": "LGBM+MAPIE", "prediction": y_pred_mapie, "dataset":"train", "stations": y_train_stations, "prediction_interval": y_pis_mapie})
        coverage = (y_train[i].values >= y_pis_mapie[:,0,0]) & (y_train[i].values <= y_pis_mapie[:,1,0])
        print(f"MAPIE coverage of the prediction interval for week {i}: {coverage.mean()}")
    if "qrf" in FINAL_MODELS:
        y_pred_qrf = qrf[i].predict(X_train, quantiles="mean", aggregate_leaves_first=False)
        y_pis_qrf = qrf[i].predict(X_train, quantiles=[ALPHA/2, 1-ALPHA/2])
        predictions.append({"model": "QRF", "prediction": y_pred_qrf, "dataset":"train", "stations": y_train_stations, "prediction_interval": y_pis_qrf})
        coverage = (y_train[i].values >= y_pis_qrf[:,0]) & (y_train[i].values <= y_pis_qrf[:,1])
        print(f"QRF coverage of the prediction interval for week {i}: {coverage.mean()}")
    if "ebm" in FINAL_MODELS:
        y_pred_ebm = []
        for model in ebm_ensembles[i]:
            y_pred_ebm.append(model.predict(X_train))
        y_pred_ebm = np.mean(y_pred_ebm, axis=0)
        predictions.append({"model": "EBM", "prediction": y_pred_ebm, "dataset":"train", "stations": y_train_stations, "prediction_interval": None})
        coverage = (y_train[i].values >= y_pis_qrf[:,0]) & (y_train[i].values <= y_pis_qrf[:,1])
        print(f"EBM coverage of the prediction interval for week {i}: {coverage.mean()}")