# Step 5: Ensemble Modeling
Use a simple averaging ensemble of XGBoost, LightGBM, and Random Forest regressors.

## Overview:
This notebook implements a model ensemble technique that combines predictions from multiple machine learning models to improve prediction accuracy and robustness. Ensembling helps:

1. Reduce overfitting by averaging out individual model biases
2. Improve prediction stability 
3. Potentially achieve better performance than any single model

We'll use a simple average of predictions from three different model types:
- XGBoost with tuned hyperparameters from Step 4
- Random Forest 
- LightGBM

In [None]:
# Import necessary libraries for data manipulation and modeling
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations
from sklearn.ensemble import RandomForestRegressor  # Tree-based ensemble model
from lightgbm import LGBMRegressor  # Gradient boosting framework
from xgboost import XGBRegressor   # Gradient boosting library

In [None]:
# Load feature-engineered datasets that include the additional features we created
# These datasets should have better predictive power than the basic preprocessed ones
train = pd.read_csv('datasets/train_fe.csv')  # Load training data with engineered features
test = pd.read_csv('datasets/test_fe.csv')    # Load test data with engineered features
test_ids = pd.read_csv('datasets/test_ids.csv')['id']  # Load test IDs for submission file

# Separate features and target variable
X = train.drop(columns='Calories')  # Feature matrix (all columns except target)
y = train['Calories']  # Target variable (log-transformed calories)

## 1. Initialize Models
Use your tuned XGBoost parameters. We'll use default RF & LGBM for now.

For the ensemble, we're combining:
- XGBoost with carefully tuned hyperparameters from Step 4
- Random Forest with default settings but increased number of trees
- LightGBM with default settings but increased number of trees

This approach leverages the strength of our best-performing model (XGBoost) while incorporating other model types to reduce bias and improve generalization.

In [None]:
# Initialize XGBoost model with carefully tuned hyperparameters from Step 4
xgb_model = XGBRegressor(
    n_estimators=761,      # Number of trees - optimized value from hyperparameter tuning
    max_depth=8,           # Maximum tree depth
    learning_rate=0.0433,  # Learning rate (step size)
    subsample=0.8292,      # Fraction of samples used for tree building
    colsample_bytree=0.6293,  # Fraction of features used per tree
    gamma=0.0251,          # Minimum loss reduction for further partition
    reg_alpha=0.8449,      # L1 regularization term
    reg_lambda=2.7842,     # L2 regularization term
    random_state=42,       # For reproducibility
    n_jobs=-1              # Use all available CPU cores
)

# Initialize Random Forest model with default parameters but 100 trees
# Random Forest provides a different modeling approach than gradient boosting
rf_model = RandomForestRegressor(
    n_estimators=100,   # Number of trees in the forest
    random_state=42,    # For reproducibility
    n_jobs=-1           # Use all available CPU cores
)

# Initialize LightGBM model with default parameters but 100 trees
# LightGBM provides another gradient boosting approach with different implementations
lgb_model = LGBMRegressor(
    n_estimators=100,   # Number of boosting rounds
    random_state=42     # For reproducibility
)

In [None]:
# Train all three models on the full training dataset
# We train each model separately before combining their predictions
xgb_model.fit(X, y)  # Train XGBoost model
rf_model.fit(X, y)   # Train Random Forest model
lgb_model.fit(X, y)  # Train LightGBM model

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005178 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1155
[LightGBM] [Info] Number of data points in the train set: 750000, number of used features: 11
[LightGBM] [Info] Start training from score 4.141144


In [None]:
# Generate predictions from each model on the test dataset
# Since we're working with log-transformed target values, these are log(Calories)
xgb_preds_log = xgb_model.predict(test)  # XGBoost predictions
rf_preds_log = rf_model.predict(test)    # Random Forest predictions
lgb_preds_log = lgb_model.predict(test)  # LightGBM predictions

# Create ensemble prediction by simple averaging
# Equal weights (1/3) are given to each model's predictions
# Simple averaging is effective and doesn't require additional training
avg_preds_log = (xgb_preds_log + rf_preds_log + lgb_preds_log) / 3

# Transform predictions back to original scale
# expm1() is the inverse function of log1p()
avg_preds = np.expm1(avg_preds_log)  # Convert from log(Calories) to Calories

In [None]:
# Import libraries needed for cross-validation and evaluation
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_log_error

# Define a function to evaluate ensemble performance using cross-validation
def ensemble_cv_rmsle(models, X, y, n_splits=5):
    """
    Evaluate ensemble model performance using k-fold cross-validation
    
    Parameters:
    - models: List of models to ensemble
    - X: Feature matrix
    - y: Target values
    - n_splits: Number of cross-validation folds
    
    Returns:
    - Average RMSLE across all folds
    """
    # Setup k-fold cross-validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    rmsle_scores = []  # Store scores from each fold

    # Perform cross-validation
    for train_idx, val_idx in kf.split(X):
        # Split data for this fold
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # Train all models on the training portion
        for model in models:
            model.fit(X_train, y_train)

        # Generate and average predictions from all models
        # The column_stack creates a 2D array of predictions, then we average across rows (axis=1)
        preds_log = np.column_stack([model.predict(X_val) for model in models]).mean(axis=1)

        # Calculate RMSLE for this fold
        rmsle = np.sqrt(mean_squared_log_error(y_val, preds_log))
        rmsle_scores.append(rmsle)

    # Return average RMSLE across all folds
    return np.mean(rmsle_scores)

# Evaluate the ensemble using cross-validation
models = [xgb_model, rf_model, lgb_model]  # List of models to ensemble
ensemble_rmsle = ensemble_cv_rmsle(models, X, y)  # Calculate cross-validated RMSLE
print(f"Cross-validated Ensemble RMSLE: {ensemble_rmsle:.5f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005211 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1154
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 11
[LightGBM] [Info] Start training from score 4.141163
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004836 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1157
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 11
[LightGBM] [Info] Start training from score 4.141466
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.016181 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total 

In [None]:
# Create a submission dataframe with test IDs and ensemble predictions
submission = pd.DataFrame({
    'id': test_ids,             # Test instance identifiers
    'Calories': avg_preds       # Ensemble model predictions (in original scale)
})

# Save submission file with date in filename for tracking purposes
submission.to_csv('datasets/submissions/submission_ensemble_avg_may20.csv', index=False)
print("Submission file 'submission_ensemble_avg.csv' created.")

# This submission represents our ensemble approach, which should be more robust
# than any single model due to the averaging of different model types

Submission file 'submission_ensemble_avg.csv' created.
