# Advancing Ensemble Learning Models for<br>Residential Building Electricity Consumption Forecasting

This notebook implements five ensemble learning models with hyperparameter tuning based on the specified hyperparameters.<br>The optimal hyperparameters for each model will be saved in an `optimal_hyperparameters.txt` file.<br>In addition, evaluation metrics will be calculated to assess model performance.

In [None]:
# Install necessary libraries
!pip install pandas scikit-learn xgboost lightgbm catboost numpy

## Hyperparameters for Ensemble Learning Methods

Below are the hyperparameters used for each decision tree-based ensemble learning method.

| **Methodology**               | **Reference** | **Hyperparameters**                                                                                                                                                                                                                         |
|:------------------------------|:--------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Random Forest (RF)**             | [52]          | - Number of Trees: 128<br>- Features per Split: auto, sqrt, log2                                                                                                                                                                            |
| **Gradient Boosting Machine (GBM)** | [41]          | - Number of Iterations: 100, 250, 500<br>- Learning Rate: 0.01, 0.05, 0.1<br>- Depth: 5, 10<br>- Loss Type: quantile, Huber                                                                                                                |
| **Extreme Gradient Boosting (XGBoost)** | [41]          | - Number of Iterations: 250, 500, 1000<br>- Learning Rate: 0.01, 0.05, 0.1<br>- Depth: 6, 8, 10<br>- Subsampling Rate: 0.5, 0.75, 1.0<br>- Feature Sample by Tree/Level/Node: 0.5, 0.75, 1.0<br>- Booster Type: gbtree, dart                | 
| **Light Gradient Boosting Machine (LightGBM)** | [41]     | - Number of Iterations: 1000, 1500<br>- Learning Rate: 0.01, 0.05, 0.1<br>- Number of Leaves: 64<br>- Subsample: 0.5<br>- Feature Sample by Tree: 1.0<br>- Booster Type: gbdt, dart                                                      |
| **Categorical Boosting (CatBoost)**      | [53]          | - Learning Rate: 0.03, 0.1<br>- Maximum Tree Depth: 4, 6, 10<br>- L2 Regularization Levels: 1, 3, 5, 7, 9                                                                                                                                  |

# Dataset Selection and Splitting

This notebook provides a flexible setup for selecting between two datasets: 
* Household (Appliances Energy Prediction);
* Dormitory (University Residential Complex).

Based on the selected dataset, we will load, preprocess, and split the data into predefined training and test sets as specified in the paper.<br>The data split follows specific row indices rather than random splitting, ensuring consistency with the methodology.<br>The model development will then proceed with this structured dataset preparation.

## Step 1: Import Libraries

The necessary libraries for data manipulation, model training, hyperparameter tuning, and evaluation metrics are imported.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error, mean_absolute_error
import numpy as np

## Step 2: Load and Split Dataset Based on User Selection

In this section, we define a function `load_and_split_data` that allows us to select either the Household dataset or the Dormitory dataset.<br>Each dataset is split according to specific row ranges as specified in the research paper.

- **Household Dataset**:
  - Training set: First 2311 rows
  - Test set: Rows from 2311 onward

- **Dormitory Dataset**:
  - Training set: First 20472 rows
  - Test set: Rows from 20472 onward

The function returns the `X_train`, `X_test`, `y_train`, and `y_test` sets based on the selected dataset.

In [None]:
# Function to load and split dataset based on user's selection and predefined ranges
def load_and_split_data(dataset_choice):
    """
    Loads the selected dataset (household or dormitory), prepares the feature set and target variable,
    and splits the data into predefined training and test sets based on specific row indices.
    
    Parameters:
    - dataset_choice (str): 'household' or 'dormitory' to specify which dataset to load
    
    Returns:
    - X_train, X_test, y_train, y_test: Split feature sets and target variables for training and testing
    """
    if dataset_choice == 'household':
        # Load household dataset
        data = pd.read_csv('Appliances Energy Prediction.csv')
        
        # Define features and target variable based on specified row ranges
        X_train = data.iloc[:2311, 1:-1]  # Training feature set
        y_train = data.iloc[:2311, -1]    # Training target set
        X_test = data.iloc[2311:, 1:-1]   # Test feature set
        y_test = data.iloc[2311:, -1]     # Test target set
        
    elif dataset_choice == 'dormitory':
        # Load dormitory dataset
        data = pd.read_csv('University Residential Complex.csv')
        
        # Define features and target variable based on specified row ranges
        X_train = data.iloc[:20472, 5:-1]  # Training feature set
        y_train = data.iloc[:20472, -1]    # Training target set
        X_test = data.iloc[20472:, 5:-1]   # Test feature set
        y_test = data.iloc[20472:, -1]     # Test target set
        
    else:
        raise ValueError("Invalid dataset choice. Please select 'household' or 'dormitory'.")
    
    return X_train, X_test, y_train, y_test

## Step 3: Select Dataset and Split Data

Here, we specify the dataset choice by setting the `dataset_choice` variable to either `'household'` or `'dormitory'`.<br>The `load_and_split_data` function is called, and the data is split into the training and test sets.<br>The shapes of these sets are displayed for verification.

- **Example Usage**:
  - Set `dataset_choice = 'household'` to load the household dataset.
  - Set `dataset_choice = 'dormitory'` to load the dormitory dataset.

In [None]:
# Example usage
# To select the dataset, pass either 'household' or 'dormitory' to the function
dataset_choice = 'household'  # Change to 'dormitory' as needed
X_train, X_test, y_train, y_test = load_and_split_data(dataset_choice)

# Display the shapes of the train and test sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

## Step 4: Define Feature Groups for Model Development

After selecting and splitting the dataset, the next step is to organize the input variables into three distinct feature groups as specified in the paper:

- **External Features**: Contains environmental and temporal variables.
- **Internal Features**: Represents historical consumption-related data.
- **Total Features**: A combination of both external and internal features.

This setup allows us to test model performance with different types of input data, providing flexibility for feature engineering and model comparison.

In [None]:
# Define feature groups based on input variable types
external_features = ["Hour_x", "Hour_y", "DOTW_x", "DOTW_y", "Holi", "Temp", "Humi", "WS", "THI", "WCT"]
internal_features = ["Cons_1", "Holi_1", "Cons_7", "Holi_7", "Cons_avg"]
total_features = external_features + internal_features  # Combining both external and internal features
target_variable = "Consumption"  # Dependent variable

# Display the feature groups for verification
print("External Features:", external_features)
print("Internal Features:", internal_features)
print("Total Features:", total_features)
print("Target Variable:", target_variable)

## Summary of Feature Grouping

The feature groups have been defined as follows:
- **External Features**: Includes environmental factors and time-specific variables.
- **Internal Features**: Contains historical consumption patterns and indicators.
- **Total Features**: An aggregate of external and internal features for comprehensive analysis.

These feature groups are now set up, and we are ready to proceed with model training and evaluation.<br>This approach will allow us to explore the individual contributions of each feature group to prediction accuracy.

# Model Setup and Evaluation Functions

We initialize five ensemble learning models (i.e., RF, GBM, XGBoost, LightGBM, and CatBoost) and perform hyperparameter tuning using GridSearchCV.<br>For each feature group (i.e., External, Internal, Total), we find the optimal hyperparameters for each model.<br>This step ensures that each model is optimized for the specific feature set.

The models will be trained on each feature group, and the optimal hyperparameters will be saved for future reference.

In [None]:
# Define hyperparameter grids for each model
rf_param_grid = {
    'n_estimators': [128],
    'max_features': [None, 'sqrt', 'log2']
}
gbm_param_grid = {
    'n_estimators': [100, 250, 500],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [5, 10],
    'loss': ['quantile', 'huber']
}
xgb_param_grid = {
    'n_estimators': [250, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [6, 8, 10],
    'subsample': [0.5, 0.75, 1.0],
    'colsample_bytree': [0.5, 0.75, 1.0],
    'booster': ['gbtree', 'dart']
}
lgbm_param_grid = {
    'n_estimators': [1000, 1500],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [64],
    'subsample': [0.5],
    'colsample_bytree': [1.0],
    'boosting_type': ['gbdt', 'dart']
}
catb_param_grid = {
    'learning_rate': [0.03, 0.1],
    'depth': [4, 6, 10],
    'l2_leaf_reg': [1, 3, 5, 7, 9]
}

In [None]:
# Define a function to calculate evaluation metrics
def calculate_metrics(y_true, y_pred):
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    cvrmse = (np.sqrt(mean_squared_error(y_true, y_pred)) / np.mean(y_true)) * 100
    nmae = (mean_absolute_error(y_true, y_pred) / np.mean(y_true)) * 100
    
    # Calculate Harmonic Mean
    hm = 3 / ((1 / mape) + (1 / cvrmse) + (1 / nmae))
    
    return mape, cvrmse, nmae, hm

# Define function for evaluation by Holi (Weekday, Holiday, All Days)
def evaluate_model(model, X_test, y_test, feature_group):
    results = {}
    for holi_value, label in [(0, "Weekday"), (1, "Holiday"), (None, "All Days")]:
        if holi_value is not None:
            X_subset = X_test[X_test["Holi"] == holi_value]
            y_subset = y_test[X_test["Holi"] == holi_value]
        else:
            X_subset = X_test
            y_subset = y_test

        predictions = model.predict(X_subset[feature_group])
        mape, cvrmse, nmae, hm = calculate_metrics(y_subset, predictions)
        
        results[label] = {
            "MAPE": mape,
            "CVRMSE": cvrmse,
            "NMAE": nmae,
            "HM": hm
        }
    return results

## Random Forest Regressor - Hyperparameter Tuning and Evaluation

This cell initializes the Random Forest Regressor and performs hyperparameter tuning.<br>It then evaluates the model's performance on the **External**, **Internal**, and **Total** feature groups.<br>For each feature group, the model is evaluated based on the **Holi** variable (i.e., Weekday, Holiday, All Days).

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Function to train and evaluate Random Forest for each feature group
def train_and_evaluate_rf(X_train, y_train, X_test, y_test, feature_group):
    rf = RandomForestRegressor(random_state=42)
    rf_grid = GridSearchCV(estimator=rf, param_grid=rf_param_grid, cv=5, scoring='neg_mean_absolute_percentage_error', n_jobs=-1)
    rf_grid.fit(X_train[feature_group], y_train)
    
    # Save the best hyperparameters for the current feature group
    rf_best_params = rf_grid.best_params_
    print(f"Random Forest best params for {feature_group}: {rf_best_params}")
    
    # Evaluate model and return both metrics and best params
    metrics = evaluate_model(rf_grid.best_estimator_, X_test, y_test, feature_group)
    return metrics, rf_best_params

# Initialize dictionaries to store results
rf_results = {}
rf_best_params = {}

# Evaluate Random Forest on External, Internal, and Total feature groups
for feature_group_name, features in zip(["External", "Internal", "Total"], [external_features, internal_features, total_features]):
    metrics, best_params = train_and_evaluate_rf(X_train, y_train, X_test, y_test, features)
    rf_results[feature_group_name] = metrics
    rf_best_params[feature_group_name] = best_params

# Display results for Random Forest
for feature_group, metrics in rf_results.items():
    print(f"\nFeature Group: {feature_group} - Random Forest")
    for segment, metric_values in metrics.items():
        print(f"  {segment}:")
        for metric_name, value in metric_values.items():
            print(f"    {metric_name}: {value:.2f}%")

## Gradient Boosting Machine - Hyperparameter Tuning and Evaluation

This cell initializes the Gradient Boosting Regressor, performs hyperparameter tuning.<br>It then evaluates the model's performance on the **External**, **Internal**, and **Total** feature groups.<br>For each feature group, the model is evaluated based on the **Holi** variable (i.e., Weekday, Holiday, All Days).

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Function to train and evaluate Gradient Boosting for each feature group
def train_and_evaluate_gbm(X_train, y_train, X_test, y_test, feature_group):
    gbm = GradientBoostingRegressor(random_state=42)
    gbm_grid = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, cv=5, scoring='neg_mean_absolute_percentage_error', n_jobs=-1)
    gbm_grid.fit(X_train[feature_group], y_train)
    
    # Save the best hyperparameters for the current feature group
    gbm_best_params = gbm_grid.best_params_
    print(f"Gradient Boosting best params for {feature_group}: {gbm_best_params}")
    
    # Evaluate model and return metrics
    metrics = evaluate_model(gbm_grid.best_estimator_, X_test, y_test, feature_group)
    return metrics, gbm_best_params

# Evaluate Gradient Boosting on External, Internal, and Total feature groups
gbm_results = {}
gbm_best_params = {}

for feature_group_name, features in zip(["External", "Internal", "Total"], [external_features, internal_features, total_features]):
    metrics, best_params = train_and_evaluate_gbm(X_train, y_train, X_test, y_test, features)
    gbm_results[feature_group_name] = metrics
    gbm_best_params[feature_group_name] = best_params

# Display results for Gradient Boosting
for feature_group, metrics in gbm_results.items():
    print(f"\nFeature Group: {feature_group} - Gradient Boosting")
    for segment, metric_values in metrics.items():
        print(f"  {segment}:")
        for metric_name, value in metric_values.items():
            print(f"    {metric_name}: {value:.2f}%")

## XGBoost Regressor - Hyperparameter Tuning and Evaluation

This cell initializes the XGBoost Regressor, performs hyperparameter tuning.<br>It then evaluates the model's performance on the **External**, **Internal**, and **Total** feature groups.<br>Each feature group is evaluated based on the **Holi** variable (i.e., Weekday, Holiday, All Days).

In [None]:
from xgboost import XGBRegressor

# Function to train and evaluate XGBoost for each feature group
def train_and_evaluate_xgb(X_train, y_train, X_test, y_test, feature_group):
    xgb = XGBRegressor(random_state=42, objective='reg:squarederror')
    xgb_grid = GridSearchCV(estimator=xgb, param_grid=xgb_param_grid, cv=5, scoring='neg_mean_absolute_percentage_error', n_jobs=-1)
    xgb_grid.fit(X_train[feature_group], y_train)
    
    # Save the best hyperparameters for the current feature group
    xgb_best_params = xgb_grid.best_params_
    print(f"XGBoost best params for {feature_group}: {xgb_best_params}")
    
    # Evaluate model and return metrics
    metrics = evaluate_model(xgb_grid.best_estimator_, X_test, y_test, feature_group)
    return metrics, xgb_best_params

# Evaluate XGBoost on External, Internal, and Total feature groups
xgb_results = {}
xgb_best_params = {}

for feature_group_name, features in zip(["External", "Internal", "Total"], [external_features, internal_features, total_features]):
    metrics, best_params = train_and_evaluate_xgb(X_train, y_train, X_test, y_test, features)
    xgb_results[feature_group_name] = metrics
    xgb_best_params[feature_group_name] = best_params

# Display results for XGBoost
for feature_group, metrics in xgb_results.items():
    print(f"\nFeature Group: {feature_group} - XGBoost")
    for segment, metric_values in metrics.items():
        print(f"  {segment}:")
        for metric_name, value in metric_values.items():
            print(f"    {metric_name}: {value:.2f}%")

## LightGBM Regressor - Hyperparameter Tuning and Evaluation

This cell initializes the LightGBM Regressor, performs hyperparameter tuning.<br>It then evaluates the model's performance on the **External**, **Internal**, and **Total** feature groups.<br>For each feature group, the model is evaluated based on the **Holi** variable (i.e., Weekday, Holiday, All Days).

In [None]:
from lightgbm import LGBMRegressor

# Function to train and evaluate LightGBM for each feature group
def train_and_evaluate_lgbm(X_train, y_train, X_test, y_test, feature_group):
    lgbm = LGBMRegressor(random_state=42)
    lgbm_grid = GridSearchCV(estimator=lgbm, param_grid=lgbm_param_grid, cv=5, scoring='neg_mean_absolute_percentage_error', n_jobs=-1)
    lgbm_grid.fit(X_train[feature_group], y_train)
    
    # Save the best hyperparameters for the current feature group
    lgbm_best_params = lgbm_grid.best_params_
    print(f"LightGBM best params for {feature_group}: {lgbm_best_params}")
    
    # Evaluate model and return metrics
    metrics = evaluate_model(lgbm_grid.best_estimator_, X_test, y_test, feature_group)
    return metrics, lgbm_best_params

# Evaluate LightGBM on External, Internal, and Total feature groups
lgbm_results = {}
lgbm_best_params = {}

for feature_group_name, features in zip(["External", "Internal", "Total"], [external_features, internal_features, total_features]):
    metrics, best_params = train_and_evaluate_lgbm(X_train, y_train, X_test, y_test, features)
    lgbm_results[feature_group_name] = metrics
    lgbm_best_params[feature_group_name] = best_params

# Display results for LightGBM
for feature_group, metrics in lgbm_results.items():
    print(f"\nFeature Group: {feature_group} - LightGBM")
    for segment, metric_values in metrics.items():
        print(f"  {segment}:")
        for metric_name, value in metric_values.items():
            print(f"    {metric_name}: {value:.2f}%")

## CatBoost Regressor - Hyperparameter Tuning and Evaluation

This cell initializes the CatBoost Regressor, performs hyperparameter tuning.<br>It then evaluates the model's performance on the **External**, **Internal**, and **Total** feature groups.<br>For each feature group, the model is evaluated based on the **Holi** variable (i.e., Weekday, Holiday, All Days).

In [None]:
from catboost import CatBoostRegressor

# Function to train and evaluate CatBoost for each feature group
def train_and_evaluate_catb(X_train, y_train, X_test, y_test, feature_group):
    catb = CatBoostRegressor(random_state=42, verbose=0)
    catb_grid = GridSearchCV(estimator=catb, param_grid=catb_param_grid, cv=5, scoring='neg_mean_absolute_percentage_error', n_jobs=-1)
    catb_grid.fit(X_train[feature_group], y_train)
    
    # Save the best hyperparameters for the current feature group
    catb_best_params = catb_grid.best_params_
    print(f"CatBoost best params for {feature_group}: {catb_best_params}")
    
    # Evaluate model and return metrics
    metrics = evaluate_model(catb_grid.best_estimator_, X_test, y_test, feature_group)
    return metrics, catb_best_params

# Evaluate CatBoost on External, Internal, and Total feature groups
catb_results = {}
catb_best_params = {}

for feature_group_name, features in zip(["External", "Internal", "Total"], [external_features, internal_features, total_features]):
    metrics, best_params = train_and_evaluate_catb(X_train, y_train, X_test, y_test, features)
    catb_results[feature_group_name] = metrics
    catb_best_params[feature_group_name] = best_params

# Display results for CatBoost
for feature_group, metrics in catb_results.items():
    print(f"\nFeature Group: {feature_group} - CatBoost")
    for segment, metric_values in metrics.items():
        print(f"  {segment}:")
        for metric_name, value in metric_values.items():
            print(f"    {metric_name}: {value:.2f}%")

## Save Results to .txt and .csv Files

In [None]:
import csv

# Combine best hyperparameters for each model
all_best_params = {
    "Random Forest": rf_best_params,
    "Gradient Boosting": gbm_best_params,
    "XGBoost": xgb_best_params,
    "LightGBM": lgbm_best_params,
    "CatBoost": catb_best_params
}

# Combine all results and best parameters
all_results = {
    "Random Forest": rf_results,
    "Gradient Boosting": gbm_results,
    "XGBoost": xgb_results,
    "LightGBM": lgbm_results,
    "CatBoost": catb_results
}

# Save best hyperparameters to a .txt file
with open("optimal_hyperparameters.txt", "w") as f:
    for model_name, best_params in all_best_params.items():
        f.write(f"Model: {model_name}\n")
        for param, value in best_params.items():
            f.write(f"  {param}: {value}\n")
        f.write("\n")

# Save evaluation metrics to a .csv file
with open("evaluation_metrics.csv", "w", newline="") as csvfile:
    fieldnames = ["Model", "Feature Group", "Segment", "MAPE", "CVRMSE", "NMAE", "HM"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    
    for model_name, feature_results in all_results.items():
        for feature_group, results in feature_results.items():
            for segment, metrics in results.items():
                row = {"Model": model_name, "Feature Group": feature_group, "Segment": segment}
                row.update(metrics)
                writer.writerow(row)