# Simple Blending Ensemble for Classification

Blending and stacking have become increasingly popular methods in Kaggle competitions. These methods both involve using the predictions made by various base models as features for a meta-model, which provides a final prediction. In this notebook, we create a simple blended model for Kaggle's "Tabular Playground Series - Sep 2021" competition. This notebook is meant to serve as an example on how to blend models; as such, we will be utilizing less computationally expensive base models to generate predictions.  

Prior to jumping in, it should be noted that there is a subtle difference between model blending and stacking, which can be summarized as follows:

**Stacked Model**: An ensemble of models in which the meta-model is trained on out-of-fold predictions made by the base models during k-fold cross validation. 

**Blended Model**: An ensemble of models in which the meta-model is trained on predictions made by the base models on a holdout dataset (e.g., the validation dataset).

**Import Necessary Libraries/Modules**

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
        
%load_ext skip_kernel_extension_py

**Import Datasets**

In [None]:
#Read data into DataFrames.
train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv", index_col=0)
test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv", index_col=0)

In [None]:
#Split features and target.
y = train['claim']
X = train.drop('claim', axis=1)

**Exploratory Data Analysis**

In [None]:
#Display samples of DataFrames.
for i in [train, test]:
    display(i.head())

In [None]:
#Dislpay Nulls.
print(f'\nRows with NaNs in training set: {train.isnull().any(axis=1).sum()}')
print(f'Columns with NaNs in training set: {train.isnull().any(axis=0).sum()}')
print(f'Rows in training set: {len(train)}')
print(f'\nRows with NaNs in testing set: {test.isnull().any(axis=1).sum()}')
print(f'Columns with NaNs in testing set: {test.isnull().any(axis=0).sum()}')
print(f'Rows in testing set: {len(test)}')

In [None]:
#Examine distribution of target values.
display(y.value_counts())

From what we can see here, there is nearly an equal distribution of classes in the training dataset. As such, we don't have to perform upsample/downsampling or SMOTEing. 

In [None]:
#Create correlation map.
corrmap = train.corr()
corr_mask = np.triu(corrmap)
f, ax = plt.subplots(figsize=(15, 10))
corr_plot = sns.heatmap(corrmap, square=True, mask=corr_mask, cmap="Blues")
plt.show()

We can see that there is no real correlation between any of the features, nor between any feature and the target. Since there is no real inter-feature correlation, we don't have any redundant information, which means we can use all features. This will likely prove more useful than specifically selecting features with the highest correlations with the target, since none of the features show any significant correlation in this regard. 

In [None]:
#Compare distributions for all features.
plt.figure(figsize=(24, 6*(118/4)))
for i in tqdm(range(len(X.columns.tolist()))):
    plt.subplot(30, 4, i+1)
    sns.histplot(X[f'f{i+1}'], kde=True)
    sns.histplot(test[f'f{i+1}'], kde=True, color='green')
plt.show()

The distributions of values for all features appears to be quite similar for both training and testing datasets. As such, there does not appear to be any dataset shift. Additionally, we expect there to be a similar distribution of target values between the training and testing datasets.

**Feature Engineering**

In [None]:
%%skip True

#Create new feature as total NaNs in each row.
#Thank you to BIZEN for this idea.
X['nan_count'] = X.isna().sum(axis=1)
test['nan_count'] = test.isna().sum(axis=1)

In [None]:
%%skip True

#Impute NaNs using SimpleImputer with mean.
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), index=X.index, columns=X.columns)
test = pd.DataFrame(imputer.transform(test), index=test.index, columns=test.columns)

X.to_csv('imputed_train_blending.csv',index=False)
test.to_csv('imputed_test_blending.csv', index=False)

Note: The above cells were originally run when attempting to fill NaNs while adding an extra feature that conveys how many NaNs were originally in each observation. In order to save time, we imported a custom script while allows us to use the '%%skip' cell-magic command to skip the execution of certain cells while preserving the notebook's appearance. Rather than running the above cells, we will instead import the pre-imputed data that we saved during the previous execution.

In [None]:
#Load imputed datasets.
test = pd.read_csv("../input/imputed-data-blended/imputed_test_blending.csv")
X = pd.read_csv("../input/imputed-data-blended/imputed_train_blending.csv")
display(X.head())
display(test.head())

In [None]:
#Split data into training and validation sets.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=12345)

### **Build Blending Ensemble Model** 

**Step 1: Create list of models to be used**

Here we make a list of base models (aka level-0) models to be used for our blending ensemble. While we believe that blending the output of gradient boosting models would provide the best results, we are chosing these simpler models to preserve time, as this notebook is just meant to serve as an introducting to blending. In a later notebook, we will stack gradient boosting models using Sklearn's StackingClassifier; we will edit this notebook with a link to the other one in due course.

Note, we are not providing any customized hyperparameter values for these models. If one were to use ensemble blending as their method of choice for a Kaggle competition, then it is highly recommended that one find the optimal hyperparameters for each base model using optuna or hyperopt.

In [None]:
models = [('dtc', DecisionTreeClassifier()),
        ('gnb', ExtraTreesClassifier()),
        ('ada', AdaBoostClassifier()),
        ('rf', RandomForestClassifier())]

**Step 2: Ceate function to fit models**

This function fits each base model to the training dataset and obtains predictions from the valiadtion dataset. Unlike with stacking, where one uses out-of-fold predictions as features for one's meta-model, as well as predictions on the full test set with models fitted to the whole trainng set, with simple blending one only needs to use the base-models' predictions from the validation test set as the meta-models features. 

Below, each model's predictions are reshaped to (len(prediction), 1) and appended to a list. The predictions are then stacked, thus forming an array with each individual base-model's predictions as a feature. Finally, we fit a meta-model (in this case an XGBoost classifier) to the array of base-model predictions and the actual validation target values, and we return the fitted model.

In [None]:
def fit_models(models, X_train, X_valid, y_train, y_valid):

    #Create variable in which to store predictions for meta-model.
    preds_for_meta = []
    
    #Loop through models in model list.
    for name, model in tqdm(models):
        
        #Fit model and obtain predictions.
        model.fit(X_train, y_train)
        pred = model.predict_proba(X_valid)[:, 1]
        
        #Obtain base moedl roc score.
        roc_base = roc_auc_score(y_valid, pred)
        
        print(f'{model} score: {roc_base}')
        
        #Reshape prediction into single-column matrix.
        pred = pred.reshape(len(pred), 1)
        
        #Append prediction to varible for meta-model.
        preds_for_meta.append(pred)
        
    #Create 2D array from predictions.
    meta_features = np.hstack(preds_for_meta)
    
    #Define blender for model.
    meta_model = xgb.XGBClassifier(n_estimators=7000,
                                 #tree_method='gpu_hist', 
                                 #gpu_id = 0,
                                 random_state = 5,
                                 learning_rate=.03)
    
    #Fit meta model on predictions from base models.
    meta_model.fit(meta_features, y_valid.values.ravel(),
                 verbose=False,
                 eval_set=[(meta_features, y_valid.values.ravel())],
                 eval_metric='auc',
                 early_stopping_rounds=300)
    
    print(f'Meta AUC: {roc_auc_score(y_valid, meta_model.predict_proba(meta_features)[:, 1])}')
    
    return meta_model

**Step 3: Create function to make predictions with meta-model on test set**

This function takes as its inputs the list of base models, our meta-model, and the test dataset. We obtain predictions from each base model, reshape the prediction, and append it to a list. The list is then stacked in order to serve as features for the meta-model. Finally, we obtain predictions from the meta-model on the test set, though indirectly by using the base-models' predictions as features. We return the meta-model's predictions.

In [None]:
def meta_predict(models, meta_model, X_test):
    
    #Variable to store base models' test predictions.
    preds_for_meta = []
    
    #Loop through models to make predictions.
    for name, model in tqdm(models):
        
        #Make predictions with base models.
        pred = model.predict(X_test)
        
        #Reshape prediction.
        pred = pred.reshape(len(pred), 1)
        
        #Append predition to meta-model prediction list.
        preds_for_meta.append(pred)
        
    #Reshape all predictions into 2D array.
    meta_features = np.hstack(preds_for_meta)
    
    #Make prediction using meta-model.
    meta_preds = meta_model.predict_proba(meta_features)[:, 1]
    
    return meta_preds

**Step 4: Obtain meta-model from fit_models function**

In the following cell, we simply execute our fit_models function, which returns our meta-model.

In [None]:
%%time
meta_model = fit_models(models, X_train, X_valid, y_train, y_valid)

As we can see, the AUC-ROC score for the meta-model is approximately .8682, which is better than any of the individual base models' scores; additionally, this score is not just the average of the base models' scores. While the difference between the meta-model's AUC-ROC score and the base models' scores seems, prima facie, quite small, such a difference may prove significant in the context of Kaggle competitions. 

**Step 5: Obtain prediction from test dataset**

Finally, with our meta-model developed, we simply execute the meta_predict function to obtain the predictions on the test set.

In [None]:
test_pred = meta_predict(models, meta_model, test)

**Submit Prediction**

In [None]:
test = pd.read_csv("/kaggle/input/tabular-playground-series-sep-2021/test.csv", usecols=['id'], low_memory=False)
predictions = pd.DataFrame()
predictions["id"] = test['id']
predictions["claim"] = test_pred
predictions.to_csv('submission.csv', index=False, header=predictions.columns)

## Conclusion

The overall aim of this notebook was to demonstrate how one would build a blending ensemble classification model. This involved preprocessing our data, feature engineering, and creating custom functions to train and evaluate base-models, as well as our meta model. The latter model was trained with features which were each comprised of predictions made by base models - to avoid long waits, we used simple Sklearn models as our base models. Overall, one can clearly see how blending models can provide a greater AUC-ROC score than any of the individual base models.

If one were to use this method for an actual competition, or for business purposes, then it is highly recommended that one try various base models and select those which provide the best results in the context of one's end goal. With the chosen base models, one should attempt to find the optimal hyperparameters for each model - using either hyperopt or optuna is highly recommended. Finally, one should use the optmized base models and try to determine the best hyperparameters for the meta-model.

Thank's for reading through this notebook. If anyone has any suggests for improvement, they would be much appreciated!