# Stacking with Two Levels (Base: 5 models) for TPS Sep 2021
In this notebook, I will use a 2-level stacking model with a meta-learner to predict probabilities for claims based on insurance policies (on Tabular Playground Sep 2021).

The levels are:
1. RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, HistGradientBoostingClassifier, and GaussianNB
2. XGBClassifier

## Imports

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# organization
from sklearn.pipeline import Pipeline

# data preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# models
# Level 1
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, \
                              AdaBoostClassifier, HistGradientBoostingClassifier)
from sklearn.naive_bayes import GaussianNB

# Level 2
from xgboost import XGBClassifier

# Cross-validation and out-of-folds prediction
from sklearn.model_selection import KFold

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Imports and Preprocessing

In [None]:
train_data = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
test_data = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")

In [None]:
features = [col for col in train_data.columns if col != "claim" and col != "id"] # keeping track of the features
X = train_data[features]
X_test = test_data[features]
y = train_data["claim"]
X.head()

In [None]:
# preprocessing of data
preprocessor = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy = "mean")),
    ("scaler", StandardScaler())
])

imputed_X = pd.DataFrame(preprocessor.fit_transform(X))
imputed_X_test = X = pd.DataFrame(preprocessor.transform(X_test))

imputed_X.columns = X.columns
imputed_X_test.columns = X_test.columns

X = imputed_X
X_test = imputed_X_test

In [None]:
# check to see that all of the values were imputed
series = X.isna().count(False) != 957919
print(series.sum())
series = X_test.isna().count(False) == X_test.iloc[0].count()
print(series.sum())

## Training and Levels
I will train a 2-level model in this section. The levels will be:
1. Level 1:
* RandomForestClassifier
* ExtraTreesClassifier
* AdaBoostClassifier
* HistGradientBoostingClassifier
* GaussianNB
2. Level 2:
* XGBClassifier

In [None]:
# see references for these functions

# Parameters to be used later
ntrain = X.shape[0]
ntest = X_test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # folds for out-of-fold prediction
kf = KFold(n_splits= NFOLDS)

# Class to extend the sklearn classifier
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        if (clf != GaussianNB):
            params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)

# function prevents train-test contamination (with edits to make it up-to-date)
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(x_train, y_train)):
        x_tr = x_train.loc[train_index]
        y_tr = y_train.loc[train_index]
        x_te = x_train.loc[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

## Level 1

In [None]:
# Parameters for level 1 models

# Random Forest Parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 250,
    'max_depth': 6,
    'min_samples_leaf': 20,
    'max_features' : 'sqrt',
    'verbose': 0
}

# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators': 250,
    'max_depth': 8,
    'min_samples_leaf': 20,
    'verbose': 0
}


# AdaBoost parameters
ada_params = {
    'n_estimators': 150,
    'learning_rate' : 0.75
}

# Histogram Gradient Boosting parameters
hgb_params = {
    'max_iter': 250,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# Gaussian Naive-Bayes Classifier parameters (none needed)
gnb_params = {}

In [None]:
# Level 1 Model Creation
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
print("RandomForest model created")
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
print("ExtraTrees model created")
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
print("AdaBoost model created")
hgb = SklearnHelper(clf=HistGradientBoostingClassifier, seed=SEED, params=hgb_params)
print("HistGradientBoosting model created")
gnb = SklearnHelper(clf=GaussianNB, params=gnb_params)
print("GaussianNB model created")

In [None]:
# Level 1 Model Training
rf_oof_train, rf_oof_test = get_oof(rf, X, y, X_test) # Random Forest
print("Random Forest training done")
et_oof_train, et_oof_test = get_oof(et, X, y, X_test) # Extra Trees
print("Extra Trees training done")
ada_oof_train, ada_oof_test = get_oof(ada, X, y, X_test) # AdaBoost
print("AdaBoost training done")
hgb_oof_train, hgb_oof_test = get_oof(hgb, X, y, X_test) # Histogram Gradient Boost
print("HistGradientBoosting training done")
gnb_oof_train, gnb_oof_test = get_oof(gnb, X, y, X_test) # Gaussian Naive Bayes
print("GaussianNB training done")

## Set-up for Level 2
From here, we need to combine the dataframes into one larger dataframe (for each of train and test).

In [None]:
X_final = np.concatenate(( rf_oof_train, et_oof_train, ada_oof_train, hgb_oof_train, gnb_oof_train), axis=1)
X_test_final = np.concatenate(( rf_oof_test, et_oof_test, ada_oof_test, hgb_oof_test, gnb_oof_test), axis=1)

## Level 2
### Final Model with XGBoost

In [None]:
stacked_model = XGBClassifier(
    n_estimators= 2000,
    objective= 'binary:logistic',
    n_jobs = -1,
    learning_rate = 0.01)

In [None]:
stacked_model.fit(X_final, y)

## Final Training and Submission
Here, I'll do the final model training and the submission.

In [None]:
predictions = stacked_model.predict(X_test_final)

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")
submission.claim = predictions
submission.to_csv("submission.csv", index=False, header=True)
print("Final submission created!")

## References
I referred to the [Introduction to Ensembling/Stacking in Python](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python) notebook by Anisotropic for the SklearnHelper class and the get_oof function (for out-of-fold predictions).