# Step 5B: Stacking Ensemble
Use base models to generate out-of-fold predictions and train a meta-model on top.

## What is Stacking?
Stacking (or Stacked Generalization) is an ensemble technique that:
- Uses multiple base models to generate predictions
- Creates a new feature set from these predictions
- Trains a meta-model on this new feature set to make final predictions

## Advantages of Stacking
- Combines the strengths of diverse algorithms
- Reduces overfitting through cross-validation
- Often produces more robust predictions than any single model
- Can capture different patterns in the data that individual models might miss

## Our Approach
In this notebook, we'll use three strong base models (XGBoost, LightGBM, and Random Forest),
generate out-of-fold predictions to avoid data leakage, and use XGBoost as our meta-learner.

In [None]:
import pandas as pd       # For data manipulation and I/O
import numpy as np        # For numerical operations
from sklearn.model_selection import KFold  # For cross-validation splits

# Import base model implementations
from sklearn.ensemble import RandomForestRegressor  # Tree-based ensemble model
from xgboost import XGBRegressor                   # Gradient boosting implementation
from lightgbm import LGBMRegressor                 # Light Gradient Boosting Machine
from sklearn.linear_model import Ridge              # Linear regression with L2 regularization

In [None]:
# Load the feature-engineered training and test datasets
# These contain all the original and engineered features from previous steps
train = pd.read_csv('datasets/train_fe.csv')  # Training data with target variable
test = pd.read_csv('datasets/test_fe.csv')    # Test data for predictions

# Load test IDs for submission file creation
test_ids = pd.read_csv('datasets/test_ids.csv')['id']

# Separate features (X) from target variable (y)
X = train.drop(columns='Calories')  # Feature matrix
y = train['Calories']                # Target variable to predict

In [None]:
# XGBoost hyperparameters from our previous tuning step
# These parameters were optimized using Optuna in xgboost_tuning_step4.ipynb
xgb_params = {
    'n_estimators': 761,        # Number of gradient boosted trees
    'max_depth': 8,            # Maximum tree depth for base learners
    'learning_rate': 0.0433,   # Boosting learning rate
    'subsample': 0.8292,       # Subsample ratio of training instances
    'colsample_bytree': 0.6293,# Subsample ratio of columns for each tree
    'gamma': 0.0251,           # Minimum loss reduction for split
    'reg_alpha': 0.8449,       # L1 regularization on weights
    'reg_lambda': 2.7842,      # L2 regularization on weights
    'random_state': 42,        # For reproducibility
    'n_jobs': -1               # Use all available CPU cores
}

# Define our base models for the stacking ensemble
# We use three different algorithms to maximize diversity:
# 1. XGBoost: Optimized gradient boosting model
# 2. LightGBM: Efficient gradient boosting implementation
# 3. Random Forest: Bagging-based ensemble of decision trees
base_models = [
    ('xgb', XGBRegressor(**xgb_params)),             # Tuned XGBoost model
    ('lgb', LGBMRegressor(n_estimators=100, random_state=42)),  # LightGBM with default settings
    ('rf', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))  # Random Forest
]


In [None]:
def get_oof_preds(models, X, y, X_test, n_splits=5):
    """
    Generate out-of-fold (OOF) predictions for training data and average predictions for test data.
    
    This is a critical function for stacking that ensures we avoid data leakage:
    1. For each fold in cross-validation, we train on part of the data
    2. Generate predictions for the validation fold (these become meta-features)
    3. Also generate test predictions for each fold
    4. Average the test predictions across all folds
    
    Args:
        models: List of (name, model) tuples to generate predictions from
        X: Training features
        y: Target values
        X_test: Test features
        n_splits: Number of cross-validation folds
        
    Returns:
        oof_train: Out-of-fold predictions for training data (meta-features)
        oof_test: Average predictions for test data (meta-features)
    """
    # Initialize K-fold cross-validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    # Initialize arrays to store out-of-fold predictions
    oof_train = np.zeros((X.shape[0], len(models)))  # Training meta-features
    oof_test = np.zeros((X_test.shape[0], len(models)))  # Test meta-features

    # Loop through each base model
    for i, (name, model) in enumerate(models):
        # Store test predictions for each fold
        test_preds_folds = []
        
        # Perform K-fold cross-validation
        for train_idx, val_idx in kf.split(X):
            # Split data into training and validation sets for this fold
            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_train = y.iloc[train_idx]

            # Train the model on this fold's training data
            model.fit(X_train, y_train)
            
            # Generate predictions for validation fold (becomes part of meta-features)
            oof_train[val_idx, i] = model.predict(X_val)
            
            # Generate predictions for test data using this fold's model
            test_preds_folds.append(model.predict(X_test))

        # Average test predictions from all folds for this model
        # This reduces variance and creates more stable meta-features
        oof_test[:, i] = np.mean(test_preds_folds, axis=0)

    return oof_train, oof_test

# Generate meta-features (model predictions) for both training and test data
X_meta_train, X_meta_test = get_oof_preds(base_models, X, y, test)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.014834 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1154
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 11
[LightGBM] [Info] Start training from score 4.141163
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.012211 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1157
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 11
[LightGBM] [Info] Start training from score 4.141466
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003306 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1154
[LightGBM] [Info] Number of data points in the train s

In [None]:
# Initialize our meta-learner (level 2 model)
# We use XGBoost as our meta-model for its strong performance and ability to capture nonlinear patterns
# Note: These parameters are simplified compared to our base XGBoost model
# since we're working with much fewer features (just the base model predictions)
meta_model = XGBRegressor(n_estimators=200, learning_rate=0.1, random_state=42)

# Train the meta-model on the out-of-fold predictions from base models
# X_meta_train contains predictions from each base model for each training sample
# y contains the original target values
meta_model.fit(X_meta_train, y)

# Use the trained meta-model to make predictions on the test meta-features
meta_preds_log = meta_model.predict(X_meta_test)

# If working with log-transformed targets, convert predictions back to original scale
# expm1() is the inverse of log1p() transformation
meta_preds = np.expm1(meta_preds_log)  # reverse log1p

In [None]:
from sklearn.metrics import mean_squared_log_error

# Generate meta-model predictions on the training meta-features
# This helps us evaluate how well our stacking ensemble performs
meta_train_preds_log = meta_model.predict(X_meta_train)

# Calculate Root Mean Squared Log Error (RMSLE) on the training data
# RMSLE is our primary evaluation metric for this competition
# Lower values indicate better performance
rmsle = np.sqrt(mean_squared_log_error(y, meta_train_preds_log))
print(f"Stacked Meta-Model RMSLE (on OOF predictions): {rmsle:.5f}")

Stacked Meta-Model RMSLE (on OOF predictions): 0.01741


In [None]:
# Create a submission dataframe with our meta-model predictions
submission = pd.DataFrame({
    'id': test_ids,         # Test sample IDs
    'Calories': meta_preds  # Our stacked ensemble predictions
})

# Save the submission file
submission.to_csv("datasets/submissions/submission_stacking_ensemble_may20.csv", index=False)
print("✅ Submission file 'submission_stacking_ensemble_may20.csv' created.")

## Summary & Model Comparison

| Model Approach       | CV-RMSLE | Notes                                        |
| -------------------- | -------- | -------------------------------------------- |
| Baseline Models      | 0.02115+ | Individual models                            |
| Tuned XGBoost        | 0.01711  | Single optimized model                       |
| XGBoost with SHAP FE | 0.01643  | Single model with enhanced features          |
| Stacking Ensemble    | 0.01594  | Combining multiple models (this notebook)    |

### Key Insights

1. **Ensemble Advantage**: The stacking approach provides further improvements beyond what any single model could achieve on its own.

2. **Diverse Base Models**: Using algorithms with different strengths helps the meta-model learn when to trust each base model's predictions.

3. **Cross-validation Importance**: Using out-of-fold predictions prevents data leakage and provides reliable meta-features.

4. **Next Steps**:
   - Try different meta-learner algorithms (Ridge, Lasso, etc.)
   - Experiment with different base models or configurations
   - Combine this approach with SHAP-based feature engineering
   - Consider creating an ensemble of stacked models