# STACKING
Stacking is a meta-learning technique that involves training a model to combine the predictions of several base models. The idea behind stacking is to leverage the strengths of different models by combining them into an ensemble that is more accurate and robust than any of the individual models.

There are several stacking techniques that have been developed over the years, including:

* Simple Stacking: This is the simplest form of stacking, where the predictions of the base models are combined using a linear regression model.

* Bagging: This is a variation of stacking where the base models are trained on different subsets of the training data, and their predictions are combined using a simple majority vote.

* Boosting: This is another variation of stacking where the base models are trained sequentially, with each model trying to correct the errors of the previous model.

* Blending: This is a technique where the predictions of the base models are combined using a weighted average.

* Stacking with Meta-Features: This is a technique where the predictions of the base models are combined with additional features that are derived from the training data.

* Hierarchical Stacking: This is a technique where the base models are arranged in a hierarchical structure, with each level of the hierarchy combining the predictions of the models at the previous level.

* Multilevel Stacking: This is a technique where the base models are combined using a series of intermediate models, with each intermediate model combining the predictions of the models at the previous level.

## 1. SIMPLE STACKING
This is the simplest form of stacking, where the predictions of the base models are combined using a linear regression model.

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np

import warnings
warnings.filterwarnings("ignore")

# Load dataset
data = load_breast_cancer()

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# First-level models
model1 = LogisticRegression(random_state=42)
model2 = RandomForestClassifier(random_state=42)
model3 = SVC(random_state=42, probability=True)

# Create arrays to hold predictions for training and test sets
train_predictions = np.zeros((X_train.shape[0], 3))
test_predictions = np.zeros((X_test.shape[0], 3))

# Use k-fold cross-validation to generate first-level predictions for each model
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for i, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):
    X_train_fold = X_train[train_idx]
    y_train_fold = y_train[train_idx]
    X_val_fold = X_train[val_idx]

    # Train each model on the current training fold
    model1.fit(X_train_fold, y_train_fold)
    model2.fit(X_train_fold, y_train_fold)
    model3.fit(X_train_fold, y_train_fold)

    # Generate predictions for the current validation fold
    train_predictions[val_idx, 0] = model1.predict_proba(X_val_fold)[:, 1]
    train_predictions[val_idx, 1] = model2.predict_proba(X_val_fold)[:, 1]
    train_predictions[val_idx, 2] = model3.predict_proba(X_val_fold)[:, 1]

# Fit second-level model on first-level predictions
meta_model = LogisticRegression(random_state=42)
meta_model.fit(train_predictions, y_train)

# Generate test set predictions using first-level models and second-level model
test_predictions[:, 0] = model1.predict_proba(X_test)[:, 1]
test_predictions[:, 1] = model2.predict_proba(X_test)[:, 1]
test_predictions[:, 2] = model3.predict_proba(X_test)[:, 1]
meta_predictions = meta_model.predict(test_predictions)

# Evaluate accuracy of final predictions
accuracy = accuracy_score(y_test, meta_predictions)
print(f"Accuracy: {accuracy:.3f}")

Accuracy: 0.977


## 2. BAGGING

In [40]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Define the base models
base_models = [RandomForestClassifier(n_estimators=10, random_state=42),
               SVC(random_state=42, probability=True),
               LogisticRegression(random_state=42),
               DecisionTreeClassifier(random_state=42),
               HistGradientBoostingClassifier(random_state=42)]

# Define the meta model
meta_model = LogisticRegression(random_state=42)

# Define the number of folds for cross-validation
n_folds = 5

# Split the data into training and test sets
split_index = int(0.8 * len(X))
X_train, y_train = X[:split_index], y[:split_index]
X_test, y_test = X[split_index:], y[split_index:]

# Define the K-Fold cross-validator
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

# Fit the base models using bagging
bagged_preds = []
for i, model in enumerate(base_models):
    fold_preds = []
    for train_index, val_index in kf.split(X_train):
        X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
        y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]
        
        # Fit the model to the bagged training set
        bag_indices = np.random.choice(train_index, size=int(0.8*len(train_index)), replace=True)
        X_bag, y_bag = X_train[bag_indices], y_train[bag_indices]
        model.fit(X_bag, y_bag)
        
        # Make predictions on the validation set
        fold_preds.append(model.predict(X_val_fold))
        
    # Concatenate the predictions vertically to form a new feature set
    fold_preds = np.concatenate(fold_preds, axis=0)
    bagged_preds.append(fold_preds)
    
# Concatenate the base model predictions horizontally to form a new training set
X_train_meta = np.concatenate(bagged_preds, axis=0).reshape((len(X_train),-1))

# Fit the meta model using the stacked data
meta_model.fit(X_train_meta, y_train)

# Make predictions on the test data using the meta-model
bagged_test_preds = []
for model in base_models:
    bagged_test_preds.append(model.predict(X_test))

X_test_meta = np.concatenate(bagged_test_preds, axis=0).reshape((len(X_test), -1))
test_preds = meta_model.predict(X_test_meta)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7368421052631579


## Why simple stacking outperforms bagging

The difference in accuracy between bagging and simple stacking could be due to several reasons. Here are a few possibilities:

* Data variability: Simple stacking is better suited to situations where there is significant variability in the data, meaning that different models may perform better on different parts of the data. Bagging is better suited to situations where the models are very similar and the data is relatively stable. It could be that the breast cancer dataset does not have enough variability to benefit from bagging.

* Model selection: The choice of models used in the ensemble can greatly affect its performance. It could be that the models used in the bagging ensemble are not well-suited to the problem at hand, or that they are too similar to each other to provide any benefit.

* Ensemble size: The size of the ensemble can also affect its performance. It could be that the bagging ensemble is too small to provide any benefit, or that the simple stacking ensemble is too large and prone to overfitting.

* Hyperparameters: The hyperparameters of the models used in the ensemble can greatly affect its performance. It could be that the hyperparameters used in the bagging ensemble are not well-tuned, or that the simple stacking ensemble is overfitting due to hyperparameters that are too flexible.

## 3. Boosting
I must point out that boosting is not a stacking technique, but rather a type of ensemble learning method that combines weak models to create a strong model. However, we can still implement it using scikit-learn.

In [43]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# Load the dataset
data = load_breast_cancer()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Create the base estimator
# DecisionTreeClassifier(max_depth=1) is the default base classifer in 'AdaClassifier'
base_estimator = DecisionTreeClassifier(max_depth=2, random_state=42)

# Create the AdaBoost classifier
ada_boost = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)

# Fit the AdaBoost classifier on the training data
ada_boost.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ada_boost.predict(X_test)

"""
In this code, we first load the breast cancer dataset and split it into training and testing sets. 
Then, we create a Decision Tree classifier as the base estimator and an AdaBoost classifier with 10 estimators. 
We fit the AdaBoost classifier on the training data and make predictions on the test data. 
Finally, we evaluate the model's accuracy.
"""

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9649122807017544


## 4. BLENDING
Blending is another stacking technique that involves training multiple base models on the original training data, and then training a meta-model on the predictions made by the base models on a holdout validation set. The meta-model is then used to make predictions on the test data.

In [45]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the base models
models = [
    LogisticRegression(random_state=42),
    DecisionTreeClassifier(random_state=42),
    KNeighborsClassifier(n_neighbors=3)
]

# Train the base models on the training set
for model in models:
    model.fit(X_train, y_train)

# Create the validation set by making predictions on the training set
X_val = np.zeros((len(X_train), len(models)))
for i, model in enumerate(models):
    X_val[:, i] = model.predict_proba(X_train)[:, 1]

# Train the meta-model on the validation set
meta_model = LogisticRegression(random_state=42)
meta_model.fit(X_val, y_train)

# Make predictions on the test data using the base models
base_preds = np.zeros((len(X_test), len(models)))
for i, model in enumerate(models):
    base_preds[:, i] = model.predict_proba(X_test)[:, 1]

# Make predictions on the test data using the meta-model
test_preds = meta_model.predict_proba(base_preds)[:, 1]

# Evaluate the performance of the stacking model

"""
In this example, we define three base models: a logistic regression model, a decision tree model, 
and a k-nearest neighbors model. 
We then train each of these models on the training data. 
Next, we create a validation set by making predictions on the training data using each of the base models. 
We concatenate these predictions horizontally to create a new training set for the meta-model. 
We train a logistic regression meta-model on this new training set.
"""
acc = accuracy_score(y_test, test_preds.round())
print("Accuracy:", acc)

Accuracy: 0.9473684210526315


## 5. STACKING WITH META FEATURES
In stacking with meta-features, the meta-features are additional features that are engineered from the predictions of the base models on the training data, and are then used as input to the meta-model along with the original features. These meta-features are typically created using out-of-fold predictions from the base models on the training data, and can include statistics such as mean, standard deviation, and maximum/minimum of the predictions, or even the original features concatenated with the base model predictions. The idea is to give the meta-model additional information about the base model predictions that it can use to make more accurate final predictions on the test data.

In [47]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Define the base models
base_models = [
    RandomForestClassifier(random_state=42),
    GradientBoostingClassifier(random_state=42)
]

# Define the meta-model
meta_model = LogisticRegression()

# Split the data into training and testing sets
split = int(len(X) * 0.7)
X_train, y_train = X[:split], y[:split]
X_test, y_test = X[split:], y[split:]

# Train the base models using K-fold cross-validation
k_folds = KFold(n_splits=5, shuffle=True, random_state=42)

base_preds = []
for model in base_models:
    fold_preds = []
    for train_idx, val_idx in k_folds.split(X_train, y_train):
        X_train_fold, y_train_fold = X_train[train_idx], y_train[train_idx]
        X_val_fold, y_val_fold = X_train[val_idx], y_train[val_idx]
        model.fit(X_train_fold, y_train_fold)
        fold_preds.append(model.predict_proba(X_val_fold))
    base_preds.append(np.concatenate(fold_preds))

# Concatenate the base model predictions horizontally to form a new training set
X_train_meta = np.concatenate(base_preds, axis=1)

# Train the meta-model on the new training set
meta_model.fit(X_train_meta, y_train)

# Make predictions on the test data using the base models
base_test_preds = []
for model in base_models:
    model.fit(X_train, y_train)
    base_test_preds.append(model.predict_proba(X_test))

# Concatenate the base model test predictions horizontally to form a new test set
X_test_meta = np.concatenate(base_test_preds, axis=1)

# Make predictions on the test data using the meta-model
test_preds = meta_model.predict(X_test_meta)

# Calculate the accuracy of the stacking model

"""
In this example, we define two base models (Random Forest and Gradient Boosting), 
and a meta-model (Logistic Regression).
We then split the data into training and testing sets, and train the base models using K-fold cross-validation. 
We concatenate the base model predictions horizontally to form a new training set, 
and train the meta-model on this new set.
Finally, we make predictions on the test data using the base models, 
concatenate the predictions horizontally to form a new test set, 
and make predictions on this new set using the meta-model. 
The accuracy of the stacking model is calculated using the accuracy_score function from scikit-learn.
"""
accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9122807017543859


## 6. Hierarchical Stacking

In [62]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load the breast cancer dataset
data = load_breast_cancer()

X, y = data.data, data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# First level base models
base_models1 = [
    RandomForestClassifier(n_estimators=50, random_state=42),
    GradientBoostingClassifier(random_state=42)
]

# First level blending model
blending_model1 = LogisticRegression()

# Second level base models
base_models2 = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    GradientBoostingClassifier(n_estimators=50, random_state=42)
]

# Second level blending model
blending_model2 = LogisticRegression()

# First level predictions
level1_preds = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(X_train):
    X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
    y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]
    
    preds = []
    for model in base_models1:
        model.fit(X_train_fold, y_train_fold)
        preds.append(model.predict(X_val_fold))
        
    level1_preds.append(np.column_stack(preds))

# Concatenate the level 1 predictions horizontally to form a new training set
X_train_meta1 = np.concatenate(level1_preds, axis=1).reshape((len(X_train), -1))

# Fit the blending model on the level 1 predictions
blending_model1.fit(X_train_meta1, y_train)

# Second level predictions
level2_preds = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(X_train_meta1):
    X_train_fold, X_val_fold = X_train_meta1[train_idx], X_train_meta1[val_idx]
    y_train_fold, y_val_fold = y_train[train_idx], y_train[val_idx]
    
    preds = []
    for model in base_models2:
        model.fit(X_train_fold, y_train_fold)
        preds.append(model.predict(X_val_fold))
        
    level2_preds.append(np.column_stack(preds))

# Concatenate the level 2 predictions horizontally to form a new training set
X_train_meta2 = np.concatenate(level2_preds, axis=1).reshape((len(X_train), -1))

# Fit the blending model on the level 2 predictions
blending_model2.fit(X_train_meta2, y_train)

# # Make predictions on the test data
# level1_test_preds = []
# for model in base_models1:
#     model.fit(X_train, y_train)
#     level1_test_preds.append(model.predict(X_test))

# X_test_meta1 = np.column_stack(level1_test_preds)

# test_preds = blending_model1.predict(X_test_meta1)

# Predictions on the test data
test_level_0 = np.empty((X_test.shape[0], len(base_models1)))
for i, model in enumerate(base_models1):
    test_level_0[:, i] = model.predict(X_test)
    
test_level_1 = np.empty((X_test.shape[0], len(base_models2)))
for i, model in enumerate(base_models2):
    test_level_1[:, i] = model.predict(test_level_0)

# Meta model prediction
# test_preds = blending_model2.predict(test_level_1)
# blending_pred1 = blending_model1.predict_proba(test_x)[:, 1]
# blending_pred2 = blending_model2.predict_proba(test_x)[:, 1]
# test_preds = (blending_pred1 + blending_pred2) / 2

accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6228070175438597


## 7. Multilevel Stacking

In [72]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Initialize models for level 0
level0_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    RandomForestClassifier(n_estimators=200, random_state=42),
    RandomForestClassifier(n_estimators=300, random_state=42),
    RandomForestClassifier(n_estimators=400, random_state=42),
]

# Initialize models for level 1
level1_models = [
    LogisticRegression(random_state=42),
    LogisticRegression(C=0.5, random_state=42),
    LogisticRegression(C=0.8, random_state=42),
    LogisticRegression(C=1.0, random_state=42),
]

# Initialize model for level 2 (meta-model)
meta_model = LogisticRegression(random_state=42)

# Split data into train and test sets
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize arrays to store base model predictions and meta features
level0_train_preds = np.zeros((X.shape[0], len(level0_models)))
level0_test_preds = np.zeros((X.shape[0], len(level0_models)))
level1_train_preds = np.zeros((X.shape[0], len(level1_models)))
level1_test_preds = np.zeros((X.shape[0], len(level1_models)))

# Train base models and generate base model predictions for level 0
for i, model in enumerate(level0_models):
    print(f'Training level 0 model {i+1}/{len(level0_models)}')
    for train_idx, val_idx in kf.split(X):
        X_train, y_train = X[train_idx], y[train_idx]
        X_val, y_val = X[val_idx], y[val_idx]

        model.fit(X_train, y_train)

        # Store base model predictions
        level0_train_preds[val_idx, i] = model.predict(X_val)
        level0_test_preds[:, i] += model.predict(X) / kf.n_splits

# Generate meta features for level 1 using base model predictions from level 0
for i, model in enumerate(level1_models):
    print(f'Training level 1 model {i+1}/{len(level1_models)}')
    for train_idx, val_idx in kf.split(X):
        X_train, y_train = level0_train_preds[train_idx], y[train_idx]
        X_val, y_val = level0_train_preds[val_idx], y[val_idx]

        model.fit(X_train, y_train)

        # Store meta features
        level1_train_preds[val_idx, i] = model.predict(X_val)
        level1_test_preds[:, i] += model.predict(level0_test_preds) / kf.n_splits

# Train meta-model using meta features from level 1
meta_model.fit(level1_train_preds, y)

# Make predictions on test data using meta-model and meta features from level 1
test_preds = meta_model.predict(level1_test_preds)

# Calculate accuracy score
print(f'Accuracy: {accuracy_score(y, meta_model.predict(level1_train_preds))}')

Training level 0 model 1/4
Training level 0 model 2/4
Training level 0 model 3/4
Training level 0 model 4/4
Training level 1 model 1/4
Training level 1 model 2/4
Training level 1 model 3/4
Training level 1 model 4/4
Accuracy: 0.9630931458699473


# Inference
If the test set is in a separate file, you can load it into a pandas DataFrame or a numpy array, and then preprocess it in the same way as you did for the training set. Once you have preprocessed the test set, you can use the base models trained in the first stage of the Multilevel Stacking to generate predictions for the test set. These predictions will then be used as input to the second stage of the stacking, where the meta-model will generate the final predictions.

To use the pre-trained base models for prediction on the test set, you can simply call the predict method of each base model with the preprocessed test set as input. Once you have obtained the base model predictions for the test set, you can concatenate them horizontally and feed them to the meta-model for final prediction.

In [89]:
# base model 0
test_level0 = np.zeros((X_test.shape[0], len(level0_models)))
for i, model in enumerate(level0_models):
    print(f'Test level 0 model {i+1}/{len(level0_models)}')
    test_level0[:, i] += model.predict(X_test)
    
# base model 1
test_level1 = np.zeros((X_test.shape[0], len(level1_models)))
for i, model in enumerate(level1_models):
    print(f'Test level 1 model {i+1}/{len(level1_models)}')
    test_level1[:, i] += model.predict(test_level0)
    
    
test_preds = meta_model.predict(test_level1)
# Calculate accuracy score
print(f'Accuracy: {accuracy_score(y, meta_model.predict(level1_train_preds))}')

Test level 0 model 1/4
Test level 0 model 2/4
Test level 0 model 3/4
Test level 0 model 4/4
Test level 1 model 1/4
Test level 1 model 2/4
Test level 1 model 3/4
Test level 1 model 4/4
Accuracy: 0.9630931458699473
