<a href="https://colab.research.google.com/github/xesmaze/cpsc499-sta-fall2024/blob/main/ENSEMBLMethods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Bagging (Random Forest):**

  Random Forest is a Bagging method that builds multiple decision trees and aggregates their predictions to improve generalization and reduce overfitting.

**2. Boosting (XGBoost):**

  XGBoost is a powerful gradient boosting method that builds models sequentially, where each new model attempts to correct the errors made by the previous one.

  This example uses the xgboost library with a squared error loss function.

**3. Stacking:**

  Stacking combines the predictions of several base learners (e.g., Random Forest, SVR) and trains a final meta-learner (e.g., Ridge) to make the final prediction based on the outputs of the base learners.

  In this example, a `RandomForestRegressor` and `SVR` are used as base models, and a `Ridge` regression is used as the meta-learner.

  For each method, you can compare their respective Mean Squared Errors (MSEs) to understand their effectiveness on the Abalone dataset. The Stacking method typically combines the strengths of its base learners, often outperforming individual models.

In [23]:
# Bagging Example with Abalone Dataset
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the Abalone dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
columns = ["Sex", "Length", "Diameter", "Height", "Whole weight", "Shucked weight",
           "Viscera weight", "Shell weight", "Rings"]
data = pd.read_csv(url, names=columns)

# One-hot encode the 'Sex' column
data = pd.get_dummies(data, columns=["Sex"], drop_first=True)

# Split data into features and target
X = data.drop("Rings", axis=1)
y = data["Rings"]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest (Bagging example)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = rf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Random Forest MSE: {mse}")

Random Forest MSE: 5.10786495215311


In [24]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error


# Convert the data to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set up parameters for XGBoost (without n_estimators)
params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'learning_rate': 0.1,
    'random_state': 42  # Set random seed for reproducibility
}

# Train the XGBoost model with num_boost_round instead of n_estimators
xg_reg = xgb.train(params, dtrain, num_boost_round=100)

# Make predictions
y_pred = xg_reg.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"XGBoost MSE: {mse}")


XGBoost MSE: 5.10847911780039


In [25]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Base learners
base_learners = [
    ('rf', RandomForestRegressor(n_estimators=50, random_state=42)),
    ('svr', SVR(kernel='linear')),
]

# Meta-learner (final model)
stack_model = StackingRegressor(
    estimators=base_learners,
    final_estimator=Ridge()
)

# Train the Stacking model
stack_model.fit(X_train, y_train)

# Make predictions
y_pred = stack_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Stacking Model MSE: {mse}")


Stacking Model MSE: 4.8071347773329


**Pruning Example with Early Stopping at XGBoost**

Pruning in boosting algorithms like XGBoost usually refers to early stopping, where the training process halts if the performance on a validation set doesn't improve after a certain number of rounds (boosting iterations).

 We'll add early stopping as the pruning mechanism in XGBoost, where training stops if no improvement is seen in the validation error for a defined number of rounds.

In [26]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Abalone dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
columns = ["Sex", "Length", "Diameter", "Height", "Whole weight", "Shucked weight",
           "Viscera weight", "Shell weight", "Rings"]
data = pd.read_csv(url, names=columns)

# One-hot encode the 'Sex' column
data = pd.get_dummies(data, columns=["Sex"], drop_first=True)

# Split data into features and target
X = data.drop("Rings", axis=1)
y = data["Rings"]

# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Convert data to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set up parameters for XGBoost (without n_estimators)
params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'learning_rate': 0.1,
    'random_state': 42  # Set random seed for reproducibility
}

# Train the XGBoost model with early stopping for pruning
xg_reg = xgb.train(params,
                   dtrain,
                   num_boost_round=100,          # Maximum number of boosting rounds
                   evals=[(dval, 'validation')], # Validation set to monitor
                   early_stopping_rounds=20,     # Stop if no improvement in 20 rounds
                   verbose_eval=True)            # Print progress

# Make predictions on the test set
y_pred = xg_reg.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"XGBoost with Early Stopping MSE: {mse}")
print(f"XGBoost with Early Stopping RMSE: {rmse}")


[0]	validation-rmse:3.00877
[1]	validation-rmse:2.87303
[2]	validation-rmse:2.75336
[3]	validation-rmse:2.65315
[4]	validation-rmse:2.56681
[5]	validation-rmse:2.50281
[6]	validation-rmse:2.45023
[7]	validation-rmse:2.40445
[8]	validation-rmse:2.37101
[9]	validation-rmse:2.33862
[10]	validation-rmse:2.32155
[11]	validation-rmse:2.30271
[12]	validation-rmse:2.29208
[13]	validation-rmse:2.28186
[14]	validation-rmse:2.27022
[15]	validation-rmse:2.26831
[16]	validation-rmse:2.26045
[17]	validation-rmse:2.25583
[18]	validation-rmse:2.25162
[19]	validation-rmse:2.25121
[20]	validation-rmse:2.24850
[21]	validation-rmse:2.24659
[22]	validation-rmse:2.24914
[23]	validation-rmse:2.24994
[24]	validation-rmse:2.24872
[25]	validation-rmse:2.25090
[26]	validation-rmse:2.24861
[27]	validation-rmse:2.24767
[28]	validation-rmse:2.24974
[29]	validation-rmse:2.24912
[30]	validation-rmse:2.25346
[31]	validation-rmse:2.25459
[32]	validation-rmse:2.25063
[33]	validation-rmse:2.25453
[34]	validation-rmse:2.2

**What happens when you increase the maximum number of early stopping rounds to 10 and boosting rounds to 500?**

In [27]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the Abalone dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
columns = ["Sex", "Length", "Diameter", "Height", "Whole weight", "Shucked weight",
           "Viscera weight", "Shell weight", "Rings"]
data = pd.read_csv(url, names=columns)

# One-hot encode the 'Sex' column
data = pd.get_dummies(data, columns=["Sex"], drop_first=True)

# Split data into features and target
X = data.drop("Rings", axis=1)
y = data["Rings"]

# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Convert data to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set up parameters for XGBoost (without n_estimators)
params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'learning_rate': 0.1,
}

# Train the XGBoost model with early stopping for pruning
xg_reg = xgb.train(params,
                   dtrain,
                   num_boost_round=500,          # Maximum number of boosting rounds
                   evals=[(dval, 'validation')], # Validation set to monitor
                   early_stopping_rounds=10,     # Stop if no improvement in 20 rounds
                   verbose_eval=True)            # Print progress

# Make predictions on the test set
y_pred = xg_reg.predict(dtest)

# Compute RMSE for the final evaluation
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"XGBoost with Early Stopping MSE: {mse}")
print(f"XGBoost with Early Stopping RMSE: {rmse}")

[0]	validation-rmse:3.00877
[1]	validation-rmse:2.87303
[2]	validation-rmse:2.75336
[3]	validation-rmse:2.65315
[4]	validation-rmse:2.56681
[5]	validation-rmse:2.50281
[6]	validation-rmse:2.45023
[7]	validation-rmse:2.40445
[8]	validation-rmse:2.37101
[9]	validation-rmse:2.33862
[10]	validation-rmse:2.32155
[11]	validation-rmse:2.30271
[12]	validation-rmse:2.29208
[13]	validation-rmse:2.28186
[14]	validation-rmse:2.27022
[15]	validation-rmse:2.26831
[16]	validation-rmse:2.26045
[17]	validation-rmse:2.25583
[18]	validation-rmse:2.25162
[19]	validation-rmse:2.25121
[20]	validation-rmse:2.24850
[21]	validation-rmse:2.24659
[22]	validation-rmse:2.24914
[23]	validation-rmse:2.24994
[24]	validation-rmse:2.24872
[25]	validation-rmse:2.25090
[26]	validation-rmse:2.24861
[27]	validation-rmse:2.24767
[28]	validation-rmse:2.24974
[29]	validation-rmse:2.24912
[30]	validation-rmse:2.25346
XGBoost with Early Stopping MSE: 4.46763098597263
XGBoost with Early Stopping RMSE: 2.120969891706594


lets investigate how changing the number of boost rounds effect the loss?

In [33]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import numpy as np
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import pandas as pd

# Function to calculate RMSE for different boosting rounds with fixed seed
def get_rmse(num_boost_rounds, dtrain, dtest, y_test):
    # Train the XGBoost model with a fixed random seed for reproducibility
    xg_reg = xgb.train(params, dtrain, num_boost_round=num_boost_rounds)

    # Make predictions
    y_pred = xg_reg.predict(dtest)

    # Compute RMSE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

# Number of boosting rounds to evaluate
boost_rounds = [20,30,40,50,100, 200, 300, 400, 500]
rmse_results = []

# Convert the data to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set up parameters for XGBoost with a fixed random seed
params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'learning_rate': 0.1,
    'random_state': 42  # Set random seed for reproducibility
}

# Calculate RMSE for each number of boosting rounds
for num_round in boost_rounds:
    rmse = get_rmse(num_round, dtrain, dtest, y_test)
    rmse_results.append(rmse)

# Prepare data for Plotly
data = pd.DataFrame({'Boosting Rounds': boost_rounds, 'RMSE': rmse_results})

# Create Plotly figure
fig = go.Figure()

# Add line plot for RMSE
fig.add_trace(go.Scatter(
    x=data['Boosting Rounds'],
    y=data['RMSE'],
    mode='lines+markers',
    name='RMSE'
))

# Customize the layout
fig.update_layout(
    title='XGBoost RMSE vs Number of Boosting Rounds',
    xaxis_title='Number of Boosting Rounds',
    yaxis_title='Root Mean Squared Error (RMSE)',
    template='plotly'
)

# Show the plot
fig.show()
