### **CS6501 - MACHINE LEARNING AND APPLICATIONS**
#### **NOTEBOOK-6: Final Model Training & Evaluation**

**Description:**  
This notebook focuses on training the final Voting Regressor model on the full BER dataset using the selected 24-feature subset.  
It includes proper train-validation-test splitting, hyperparameter tuning, model evaluation with metrics and plots, and saving the trained model for future predictions.

#### --- Imports ---

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
import warnings
import numpy as np
import pickle

In [2]:
warnings.filterwarnings('ignore')

#### --- Load Entire Cleaned Dataset ---

In [3]:
# Load the cleaned BER dataset
df = pd.read_csv('../dataset/BERPublicSearch_Cleaned.csv')  
print(f"Dataset shape: {df.shape}")

Dataset shape: (80000, 101)


### --- Define target and selected 24 features (Subset 2) of previous notebook ---

In [4]:
target_column = 'BerRating'
selected_features = [
    'UValueWall', 'GroundFloorArea(sq m)', 'DeclaredLossFactor', 'UValueRoof', 'UValueFloor',
    'TempAdjustment', 'DeliveredEnergyPumpsFans', 'HSMainSystemEfficiency', 'DistributionLosses',
    'TempFactorMultiplier', 'UValueWindow', 'HeatSystemResponseCat', 'PrimaryEnergyMainSpace',
    'CO2MainSpace', 'FirstFloorArea', 'FirstEnerProdDelivered', 'WHMainSystemEff',
    'DeliveredEnergyMainSpace', 'Year_of_Construction', 'PrimaryEnergyMainWater',
    'PrimaryEnergySecondarySpace', 'PrimaryEnergyPumpsFans', 'NoStoreys', 'PrimaryEnergySupplementaryWater'
]

### --- Stratified Sampling of 10,000 Records for Hyperparameter Optimization ---

In [5]:
num_bins = 10
df['BER_bin'] = pd.qcut(df['BerRating'], q=num_bins, duplicates='drop')

# Stratified sample of 10k rows
sample_size = 10_000
df_sampled, _ = train_test_split(
    df,
    train_size=sample_size,
    stratify=df['BER_bin'],
    random_state=42
)

# Drop the temporary bin column if not needed
df_sampled = df_sampled.drop(columns=['BER_bin'])

print("Stratified 10k sample created with shape:", df_sampled.shape)

Stratified 10k sample created with shape: (10000, 101)


### --- Target and Feature split for sample ---

In [6]:
X_sample = df_sampled[selected_features]
y_sample = df_sampled[target_column]

### --- Random Forest Hyperparameter Tuning Using RandomizedSearchCV ---

In [7]:
rf = RandomForestRegressor(random_state=42)

In [8]:
# Hyperparameter grid
rf_param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 0.8, 0.5]
}

In [9]:
rf_random = RandomizedSearchCV(
    estimator=rf,
    param_distributions=rf_param_grid,
    n_iter=25,
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

In [10]:
rf_random.fit(X_sample, y_sample)

Fitting 3 folds for each of 25 candidates, totalling 75 fits


In [11]:
print("Best RF params:", rf_random.best_params_)
print("Best RF RMSE:", -rf_random.best_score_)

Best RF params: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30}
Best RF RMSE: 33.29527542002387


### --- XGBRegressor Hyperparameter Tuning Using RandomizedSearchCV ---

In [12]:
xgb = XGBRegressor(
    objective='reg:squarederror',
    eval_metric='rmse',
    random_state=42
)

In [13]:
xgb_param_grid = {
    'n_estimators': [200, 300, 400, 500],
    'learning_rate': [0.01, 0.03, 0.05, 0.1],
    'max_depth': [4, 5, 6, 7, 8],
    'subsample': [0.6, 0.7, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 1.0],
    'reg_lambda': [0.0, 0.5, 1.0, 2.0],
    'reg_alpha': [0.0, 0.1, 0.3, 1.0]
}

In [14]:
xgb_random = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=xgb_param_grid,
    n_iter=30,
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

In [15]:
xgb_random.fit(X_sample, y_sample)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


In [16]:
print("Best XGB params:", xgb_random.best_params_)
print("Best XGB RMSE:", -xgb_random.best_score_)

Best XGB params: {'subsample': 0.8, 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'n_estimators': 400, 'max_depth': 6, 'learning_rate': 0.05, 'colsample_bytree': 0.7}
Best XGB RMSE: 24.30876232643045


### --- CatBoostRegressor Hyperparameter Tuning Using RandomizedSearchCV ---

In [17]:
cat = CatBoostRegressor(
    loss_function='RMSE',
    random_state=42,
    verbose=0   
)

In [18]:
cat_param_grid = {
    'iterations': [300, 500, 700, 900],
    'learning_rate': [0.01, 0.03, 0.05, 0.1],
    'depth': [4, 5, 6, 7, 8, 9],
    'subsample': [0.6, 0.7, 0.8, 1.0],
    'colsample_bylevel': [0.6, 0.7, 0.8, 1.0],
    'l2_leaf_reg': [1, 3, 5, 7, 9]
}

In [19]:
cat_random = RandomizedSearchCV(
    estimator=cat,
    param_distributions=cat_param_grid,
    n_iter=25,
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=2,
    n_jobs=-1,
    random_state=42
)

In [20]:
cat_random.fit(X_sample, y_sample)

Fitting 3 folds for each of 25 candidates, totalling 75 fits


In [21]:
print("Best CatBoost params:", cat_random.best_params_)
print("Best CatBoost RMSE:", -cat_random.best_score_)

Best CatBoost params: {'subsample': 0.6, 'learning_rate': 0.1, 'l2_leaf_reg': 7, 'iterations': 700, 'depth': 6, 'colsample_bylevel': 0.6}
Best CatBoost RMSE: 21.472972411430604


-------------------------------------------------------------------
### BER Rating Predictor – Final Voting Regressor Model Training 
-------------------------------------------------------------------

### --- Separate features and target ---

In [22]:
X = df[selected_features]
y = df[target_column]
print(f"Model to be trained with {X.shape[0]} rows and {X.shape[1]} features.")

Model to be trained with 80000 rows and 24 features.


### --- Split Data into Train and Test Sets (80/20) ---

- **Purpose:** To evaluate the model's ability to generalize to unseen data and avoid overfitting.  
- **Training Set (80%):** Used to train the model and learn patterns from the features.  
- **Test Set (20%):** Held out completely during training; used to assess the final model performance on unseen data.  
- **Key Note:** Since hyperparameters were already tuned on a representative 10k sample, a separate validation set is not used

In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} rows")
print(f"Test set: {X_test.shape[0]} rows")

Training set: 64000 rows
Test set: 16000 rows


### --- Base Models with Best Hyperparameters ---

- **Purpose:** Initialize the three base models for the Voting Regressor using the **best hyperparameters** obtained earlier from hyperparameter tuning on the stratified 10k sample.  

- **Models:**
  1. **RandomForestRegressor (RF):** Configured with optimal `n_estimators`, `max_depth`, `min_samples_split`, `min_samples_leaf`, and `max_features`.  
  2. **XGBRegressor (XGB):** Configured with optimal `n_estimators`, `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `gamma`, and `min_child_weight`.  
  3. **LGBMRegressor (LightGBM):** Configured with optimal `n_estimators`, `max_depth`, `learning_rate`, `num_leaves`, `subsample`, and `colsample_bytree`.  

- **Key Note:** Using the **best hyperparameters ensures each model contributes optimally** to the ensemble, improving accuracy and generalization.


In [24]:
print("Best RF params:", rf_random.best_params_)

Best RF params: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30}


In [25]:
rf_model = RandomForestRegressor(
    n_estimators=400,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    max_depth=30,
    random_state=42,
    n_jobs=-1
)

In [26]:
print("Best XGB params:", xgb_random.best_params_)

Best XGB params: {'subsample': 0.8, 'reg_lambda': 0.0, 'reg_alpha': 0.0, 'n_estimators': 400, 'max_depth': 6, 'learning_rate': 0.05, 'colsample_bytree': 0.7}


In [27]:
xgb_model = XGBRegressor(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.7,
    reg_lambda=0.0,
    reg_alpha=0.0,
    objective='reg:squarederror',
    eval_metric='rmse',
    random_state=42,
    n_jobs=-1
)

In [28]:
print("Best CatBoost params:", cat_random.best_params_)

Best CatBoost params: {'subsample': 0.6, 'learning_rate': 0.1, 'l2_leaf_reg': 7, 'iterations': 700, 'depth': 6, 'colsample_bylevel': 0.6}


In [29]:
cat_model = CatBoostRegressor(
    iterations=700,
    learning_rate=0.1,
    depth=6,
    subsample=0.6,
    colsample_bylevel=0.6,
    l2_leaf_reg=7,
    loss_function='RMSE',
    random_state=42,
    verbose=False
)

### --- Combine Base Models into a Voting Regressor ---

- **Purpose:** Aggregate the predictions of the three base models (RF, XGB, LGBM) to form a robust ensemble.  
- **Method:** Use a **Voting Regressor** that predicts the **average of the individual model predictions**, reducing variance and leveraging complementary strengths of each model.  

- **Key Note:** This ensemble benefits from:
  - **RF:** Reduces variance and provides stability.  
  - **XGB:** Captures complex feature interactions.  
  - **LGBM:** Adds faster leaf-wise learning and slightly different decision boundaries.  

- **Outcome:** Produces a more accurate and generalized prediction of `BerRating` compared to any single model.

In [33]:
voting_model = VotingRegressor(
    estimators=[
        ('rf', rf_model),
        ('xgb', xgb_model),
        ('cat', cat_model)
    ]
)
print("Voting Regressor initialized with RF, XGB, and CatBoost base models.")

Voting Regressor initialized with RF, XGB, and CatBoost base models.


In [34]:
voting_model.fit(X_train, y_train)

#### --- Predict On Test Data ---

In [35]:
y_test_pred = voting_model.predict(X_test)

### --- Compute Metrics for Model Performance ---

To evaluate the performance of the final Voting Regressor, we compute the following metrics on the **test set**:
  - **RMSE** emphasizes large errors  
  - **MAE** shows average error magnitude  
  - **R²** indicates overall goodness-of-fit.  
- Evaluating these metrics on the **test set** ensures we assess **generalization to unseen data**.

In [36]:
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

In [37]:
print(f"Test Metrics -> RMSE: {test_rmse:.3f}, R²: {test_r2:.3f}, MAE: {test_mae:.3f}")

Test Metrics -> RMSE: 17.612, R²: 0.984, MAE: 9.803


# Final BER Rating Predictor – Voting Regressor Summary 

## **Ensemble Overview**
- **Voting Regressor** combining three tuned base models:
  1. **RandomForestRegressor (RF):** Reduces variance, robust predictions for numeric features.  
  2. **XGBRegressor (XGB):** Captures complex feature interactions for high accuracy.  
  3. **LGBMRegressor (LightGBM):** Leaf-wise boosting, complements XGB with faster convergence.

- **Purpose:** Combining models reduces individual biases and leverages complementary strengths for **better generalization**.

## **Dataset**
- **Training set:** 64,000 rows  
- **Test set:** 16,000 rows (20% of total)  
- **Features:** 24 numeric building and energy-related metrics.  
- **Hyperparameters:** Individually tuned on a stratified 10k sample.

## **Model Performance (Test Set)**
**Interpretation:**  
- **RMSE (~17.6):** Average magnitude of prediction error; reasonably low.  
- **R² (~0.984):** Model explains ~98% of variance in `BerRating`.  
- **MAE (~9.8):** Average absolute error, interpretable in original units.  

## **Key Takeaways**
- The model **generalizes well**, as shown by test metrics.  
- Using the **full training data** maximizes learning, and the model is now final and **cannot be retrained**.  
- Each base model contributes **complementary strengths**: RF stabilizes, XGB captures interactions, LGBM speeds up learning.  
- Fully **SHAP-compatible** for interpretability and further analysis on feature contributions.  

**Conclusion:** The final Voting Regressor is **accurate, robust, and ready for deployment**, capable of predicting `BerRating` and providing actionable insights for energy retrofit recommendations.


### --- Persist the Trained Model for Future Use ---

In [38]:
with open("./models/BER_PREDICTOR.pkl", "wb") as f:
    pickle.dump(voting_model, f)