## <center>Tabular Playground Series - Sep 2021</center>
### <center>Stacking solution (LightGBM + CatBoost + XGBoost)</center>

This notebook contains full solution to building stacking pipeline + evaluation of predictions.  
In this competition we predict whether a customer made a claim upon an insurance policy.

#### Dataset:
The dataset is used for this competition is synthetic (and generated using a CTGAN), but based on a real dataset. The original dataset deals with predicting whether a claim will be made on an insurance policy.
* 'f1' - 'f118' continuous features
* 'claim' - binary valued target, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim.

### Import libraries

In [None]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting
import seaborn as sns # plotting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import Pool 
from sklearn import metrics
import shap as shap
from tqdm import tqdm

pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 200)

SEED = 91 # random seed

# 1. Load data and first look

In [None]:
PATH = '/kaggle/input/tabular-playground-series-sep-2021/' # you can use your own local path

print(f"Files in directory {PATH.split('/')[-2]}:")
for _, _, filenames in os.walk(PATH):
    for filename in filenames:
        print('  '+os.path.join(filename))

In [None]:
try:
    df_train = pd.read_csv(PATH+'train.csv', index_col=0)
    df_test = pd.read_csv(PATH+'test.csv', index_col=0)
    print('All data has been loaded successfully!')
except Exception as err:
    print(repr(err))

In [None]:
#df_train = df_train.sample(frac=0.15, random_state=SEED)

In [None]:
full_lenght_data = len(df_train) + len(df_test)
print(f"train: {len(df_train)} ({100*len(df_train)/full_lenght_data:.0f}%)")
print(f"test:  {len(df_test)} ({100*len(df_test)/full_lenght_data:.0f}%)")

In [None]:
df_train.info()

In [None]:
df_test.info()

Check if exist missing value

In [None]:
df_train.isnull().sum().sort_values(ascending=False)

In [None]:
print(f"claim = 0: {len(df_train[df_train['claim'] == 0])} ({100*len(df_train[df_train['claim'] == 0])/len(df_train):.2f}%)")
print(f"claim = 1: {len(df_train[df_train['claim'] == 1])} ({100*len(df_train[df_train['claim'] == 1])/len(df_train):.2f}%)")

# 2. Data preprocessing

### Feature engineering

In [None]:
def add_features(df: pd.DataFrame) -> pd.DataFrame:
    df_new = df.copy()
    features = [x for x in df.columns.values if x not in 'claim']

    df_new['num_missing'] = df_new[features].isna().sum(axis=1)
    df_new['num_missing_std'] = df_new[features].isna().std(axis=1).astype('float')
    df_new['abs_sum'] = df_new[features].abs().sum(axis=1)
    df_new['median'] = df_new[features].median(axis=1)
    df_new['std'] = df_new[features].std(axis=1)
    df_new['min'] = df_new[features].abs().min(axis=1)
    df_new['max'] = df_new[features].abs().max(axis=1)
    #df_new['sem'] = df_new[features].sem(axis=1)
    df_new['avg'] = df_new[features].mean(axis=1)
    
    return df_new

### Preprocess nan values

Idea taken from www.kaggle.com/dlaststark/tps-sep-single-xgboost-model  

I have modified the choices using the following rationale:
* Mean: normal distribution  
* Median: unimodal and skewed  
* Mode: all other cases  

In [None]:
def preprocess_na(df: pd.DataFrame) -> pd.DataFrame:
    df_new = df.copy()
    features = df.columns.tolist()
    
    fill_value_dict = {
        'f1': 'Mean', 
        'f2': 'Median', 
        'f3': 'Median', 
        'f4': 'Median', 
        'f5': 'Mode', 
        'f6': 'Mean', 
        'f7': 'Median', 
        'f8': 'Median', 
        'f9': 'Median', 
        'f10': 'Median', 
        'f11': 'Mean', 
        'f12': 'Median', 
        'f13': 'Mean', 
        'f14': 'Median', 
        'f15': 'Mean', 
        'f16': 'Median', 
        'f17': 'Median', 
        'f18': 'Median', 
        'f19': 'Median', 
        'f20': 'Median', 
        'f21': 'Median', 
        'f22': 'Mean', 
        'f23': 'Mode', 
        'f24': 'Median', 
        'f25': 'Median', 
        'f26': 'Median', 
        'f27': 'Median', 
        'f28': 'Median', 
        'f29': 'Mode', 
        'f30': 'Median', 
        'f31': 'Median', 
        'f32': 'Median', 
        'f33': 'Median', 
        'f34': 'Mean', 
        'f35': 'Median', 
        'f36': 'Mean', 
        'f37': 'Median', 
        'f38': 'Median', 
        'f39': 'Median', 
        'f40': 'Mode', 
        'f41': 'Median', 
        'f42': 'Mode', 
        'f43': 'Mean', 
        'f44': 'Median', 
        'f45': 'Median', 
        'f46': 'Mean', 
        'f47': 'Mode', 
        'f48': 'Mean', 
        'f49': 'Mode', 
        'f50': 'Mode', 
        'f51': 'Median', 
        'f52': 'Median', 
        'f53': 'Median', 
        'f54': 'Mean', 
        'f55': 'Mean', 
        'f56': 'Mode', 
        'f57': 'Mean', 
        'f58': 'Median', 
        'f59': 'Median', 
        'f60': 'Median', 
        'f61': 'Median', 
        'f62': 'Median', 
        'f63': 'Median', 
        'f64': 'Median', 
        'f65': 'Mode', 
        'f66': 'Median', 
        'f67': 'Median', 
        'f68': 'Median', 
        'f69': 'Mean', 
        'f70': 'Mode', 
        'f71': 'Median', 
        'f72': 'Median', 
        'f73': 'Median', 
        'f74': 'Mode', 
        'f75': 'Mode', 
        'f76': 'Mean', 
        'f77': 'Mode', 
        'f78': 'Median', 
        'f79': 'Mean', 
        'f80': 'Median', 
        'f81': 'Mode', 
        'f82': 'Median', 
        'f83': 'Mode', 
        'f84': 'Median', 
        'f85': 'Median', 
        'f86': 'Median', 
        'f87': 'Median', 
        'f88': 'Median', 
        'f89': 'Median', 
        'f90': 'Mean', 
        'f91': 'Mode', 
        'f92': 'Median', 
        'f93': 'Median', 
        'f94': 'Median', 
        'f95': 'Median', 
        'f96': 'Median', 
        'f97': 'Mean', 
        'f98': 'Median', 
        'f99': 'Median', 
        'f100': 'Mode', 
        'f101': 'Median', 
        'f102': 'Median', 
        'f103': 'Median', 
        'f104': 'Median', 
        'f105': 'Median', 
        'f106': 'Median', 
        'f107': 'Median', 
        'f108': 'Median', 
        'f109': 'Mode', 
        'f110': 'Median', 
        'f111': 'Median', 
        'f112': 'Median', 
        'f113': 'Mean', 
        'f114': 'Median', 
        'f115': 'Median', 
        'f116': 'Mode', 
        'f117': 'Median', 
        'f118': 'Mean'
    }


    for col in tqdm(features):
        if fill_value_dict.get(col)=='Mean':
            fill_value = df_new[col].mean()
        elif fill_value_dict.get(col)=='Median':
            fill_value = df_new[col].median()
        elif fill_value_dict.get(col)=='Mode':
            fill_value = df_new[col].mode().iloc[0]
    
        df_new[col].fillna(fill_value, inplace=True)
    
    return df_new

In [None]:
print(f"Number of features before preprocess: train_df={df_train.shape[1]} test_df={df_test.shape[1]}")

df_train = add_features(df_train)
df_train = preprocess_na(df_train)
df_test = add_features(df_test)
df_test = preprocess_na(df_test)

print(f"After: train_df={df_train.shape[1]} test_df={df_test.shape[1]}")
df_train.head()

In [None]:
TARGET = 'claim'

X = df_train.copy()
y = X.pop(TARGET)

I test StandartScaler, RobustScaler and MinMaxScaler. And last one gives better score. If you have thoughts why, please tell in comments.

In [None]:
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=list(df_train.columns).remove(TARGET))
X_test = pd.DataFrame(scaler.transform(df_test), columns=list(df_train.columns).remove(TARGET))

# 3. Model

In [None]:
model_results = {'model': [], 'score': [], 'training_time': []}

def add_model_result(dic, model, score, time=None, fi=None):
    '''Save results of every model'''
    dic['model'].append(model)
    dic['score'].append(score)
    if time:
        dic['training_time'].append(time)

In [None]:
models = {
    'XGB 1': { # https://www.kaggle.com/mlanhenke/tps-09-simple-blend-stacking-xgb-lgbm-catb
        'model': XGBClassifier(
            eval_metric='auc',
            max_depth=4,
            alpha=10,
            subsample=0.65,
            colsample_bytree=0.7,
            colsample_bylevel = 0.8675692743597421,
            objective='binary:logistic',
            use_label_encoder=False,
            learning_rate=0.012,
            n_estimators=10000,
            min_child_weight = 366,
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            n_jobs=-1,
        ),
        'feature_importance': 0
    },
    
    'XGB 2': { # https://www.kaggle.com/mlanhenke/tps-09-simple-blend-stacking-xgb-lgbm-catb
        'model': XGBClassifier(
            eval_metric='auc',
            max_depth=3,
            subsample=0.5,
            colsample_bytree=0.5,
            learning_rate=0.01187431306013263,
            objective='binary:logistic',
            use_label_encoder=False,
            n_estimators=10000,
            tree_method='gpu_hist',
            gpu_id=0,
            predictor='gpu_predictor',
            n_jobs=-1,
            seed=SEED
        ),
        'feature_importance': 0
    },

#     'LGBM 1': {
#         'model': LGBMClassifier(
#             num_leaves = 28,
#             n_estimators = 3000,
#             max_depth = 8,
#             min_child_samples = 202,
#             learning_rate = 0.11682677767413432,
#             bagging_fraction = 0.5036513634677549,
#             colsample_bytree = 0.7519268943195143,
#             n_jobs = 4,
#             random_seed = SEED
#         ),
#         'feature_importance': 0
#     },

#     'LGBM 2': { # https://www.kaggle.com/tensorchoko/tabular-sep-2021-lightgbm
#         'model': LGBMClassifier(
#             learning_rate = 0.03,
#             num_iterations = 30000,
#             objective ='binary',
#             metric = 'binary_logloss',
#             feature_pre_filter = False,
#             lambda_l1 = 0.0,
#             lambda_l2 = 0.0,
#             num_leaves = 123,
#             feature_fraction = 1.0,
#             bagging_fraction = 1.0,
#             bagging_freq = 0,
#             min_child_samples = 20,
#             n_jobs = 4,
#             random_seed = SEED+1
#         ),
#         'feature_importance': 0
#     },

#     'LGBM 3': { # https://www.kaggle.com/mlanhenke/tps-09-simple-blend-stacking-xgb-lgbm-catb
#         'model': LGBMClassifier(
#             max_depth = 4,
#             objective = 'binary',
#             metric = 'auc',
#             n_estimators = 5000,
#             learning_rate = 0.1,
#             reg_alpha = 18,
#             reg_lambda = 17,
#             num_leaves = 7,
#             colsample_bytree = 0.3,
#             device = 'gpu',
#             n_jobs = 4,
#             random_seed = SEED
#         ),
#         'feature_importance': 0
#     },
    
    'LGBM 4': { # https://www.kaggle.com/realtimshady/single-simple-lightgbm
        'model': LGBMClassifier(
            max_depth = 4,
            objective = 'binary',
            metric = 'auc',
            n_estimators = 30000,
            learning_rate = 0.02,
            reg_alpha = 25.2,
            reg_lambda = 90,
            num_leaves = 148,
            subsample = 0.71,
            subsample_freq = 1,
            colsample_bytree = 0.98,
            min_child_samples = 99,
            min_child_weight = 152,
            #n_jobs = 4,
            device = 'gpu',
            random_seed = 3407
        ),
        'feature_importance': 0
    },

    'LGBM 5': { # https://www.kaggle.com/hiro5299834/tps-sep-2021-single-lgbm
        'model': LGBMClassifier(
            objective = 'binary',
            metric = 'AUC',
            n_estimators = 20000, #20000,
            learning_rate = 0.01, #5e-3,
            subsample = 0.6,
            subsample_freq = 1,
            colsample_bytree = 0.4,
            reg_alpha = 10.0,
            reg_lambda = 1e-1,
            min_child_weight = 256,
            min_child_samples = 20,
            importance_type = 'gain',
            random_seed = SEED
        ),
        'feature_importance': 0
    },

    'LGBM 6': { # https://www.kaggle.com/towhidultonmoy/tuned-lightgbm
        'model': LGBMClassifier(
            objective = 'binary',
            boosting_type = 'gbdt', #gbdt
            num_leaves = 6, #6 #2^(max_depth)
            max_depth = 2, #2  
            learning_rate = 0.1, #0.1
            n_estimators = 41000, #40000
            reg_alpha = 25.0,
            reg_lambda = 76.7,
            bagging_seed = 7014, #42
            feature_fraction_seed = 7014, #42
            subsample = 0.985,
            subsample_freq = 1,
            colsample_bytree = 0.69,
            min_child_samples = 54,
            min_child_weight = 256,
            device = 'gpu',
            random_seed = SEED
        ),
        'feature_importance': 0
    },
    
    'CatBoost 1': {
        'model': CatBoostClassifier(
            class_weights = [1,1.15],
            depth = 7,
            learning_rate = 0.02,
            iterations = 16000,
            bootstrap_type = 'Bernoulli',
            subsample = 0.98,
            task_type = 'GPU',
            #thread_count = 4,
            random_seed = 3407
        ),
        'feature_importance': 0
    },

    'CatBoost 2': { # https://www.kaggle.com/brendanartley/sep-21-tab-series-lgbm-optuna
        'model': CatBoostClassifier(
            iterations = 15585, 
            objective = 'CrossEntropy', 
            bootstrap_type = 'Bernoulli', 
            od_wait = 1144, 
            learning_rate = 0.023575206684596582, 
            reg_lambda = 36.30433203563295, 
            random_strength = 43.75597655616195, 
            depth = 7, 
            min_data_in_leaf = 11, 
            leaf_estimation_iterations = 1, 
            subsample = 0.8227911142845009,
            task_type = 'GPU',
            #thread_count = 4,
            random_seed = SEED
        ),
        'feature_importance': 0
    },
    
#     'CatBoost 3': { # https://www.kaggle.com/kennethquisado/xgboost-10fold-cv-blend
#         'model': CatBoostClassifier(
#             eval_metric = 'AUC',
#             #n_estimators = 10000,
#             max_depth = 6,
#             learning_rate = 0.04,
#             grow_policy = "SymmetricTree",
#             l2_leaf_reg = 3.0,
#             random_strength = 1.0,
#             task_type = 'GPU',
#             #thread_count = 4,
#             random_seed = SEED+2
#         ),
#         'feature_importance': 0
#     },
}

In [None]:
N_FOLD =  5
kfold = KFold(n_splits = N_FOLD, random_state = SEED, shuffle = True)

model_results_level0 = {'model': [], 'score': [], 'training_time': []}
predicted_probabilities = pd.DataFrame(X.index, columns=['id'])
test_predicted_probabilities = pd.DataFrame(X_test.index, columns=['id'])

for m in models:
    print(f"{m}:")
    predictions_valid  = np.zeros(X.shape[0])
    probabilities_valid = np.zeros(X.shape[0])
    test_predicted_probabilities[m] = np.zeros(X_test.shape[0])
    score = 0
    
    start_time = time.time()
    # Iterate through each fold
    for fold, (train_idx, valid_idx) in enumerate(kfold.split(X)):
        X_train = X.iloc[train_idx]
        X_valid = X.iloc[valid_idx]
        y_train = y.iloc[train_idx]
        y_valid = y.iloc[valid_idx]        

        model = models[m]['model']
        model.fit(X_train, y_train,
                  eval_set = [(X_valid, y_valid)],
                  early_stopping_rounds = 120,
                  verbose = False
                 )

        # Mean of the predictions
        test_predicted_probabilities[m] += model.predict_proba(X_test)[:,1] / N_FOLD

        # Mean of feature importance
        models[m]['feature_importance'] += model.feature_importances_ / N_FOLD

        # Out of Fold predictions
        predictions_valid[valid_idx] = model.predict(X_valid)
        probabilities_valid[valid_idx] = model.predict_proba(X_valid)[:,1]
        fold_score = metrics.roc_auc_score(y_valid, predictions_valid[valid_idx])
        print(f"Fold {fold} | ROC-AUC: {fold_score:.3f}")

        score += fold_score / N_FOLD

    predicted_probabilities[m] = probabilities_valid
    add_model_result(model_results_level0, m, score, time.time()-start_time)
    print(f"Overall ROC-AUC: {score:.6f}\n")

In [None]:
model_results = pd.DataFrame(model_results_level0).sort_values('score', ascending=False)
model_results

In [None]:
predicted_probabilities[TARGET] = y.reset_index()[TARGET]
predicted_probabilities = predicted_probabilities.drop('id', axis=1)
predicted_probabilities

### Feature importance

In [None]:
df_fi = pd.concat([pd.DataFrame(models[m]['feature_importance'], index=df_test.columns, columns=[m]) for m in models],
                  axis=1)
df_fi = df_fi.fillna(0).apply(lambda x: x/sum(x)*100)
df_fi['overall'] = df_fi.apply(lambda x: sum(x), axis=1)
df_fi = df_fi.apply(lambda x: x/sum(x)*100)
df_fi.sort_values('overall', ascending=False)

# 4. Evaluation

In [None]:
TRESHOLD = 0.5 # treshold to decide claim or not

In [None]:
plt.figure(figsize=(17, 11))
plt.subplots_adjust(hspace=0.5, wspace=0.3)
#sns.set_palette("Spectral")
for i, m in enumerate(models):
    plt.subplot(3, 3, i+1)
    predictions = predicted_probabilities[m].apply(lambda x: 1 if x > TRESHOLD else 0)
    df_cm = pd.DataFrame(metrics.confusion_matrix(y, predictions), columns=np.unique(y), index = np.unique(y))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'
    sns.heatmap(df_cm, cmap="Blues", annot=True, fmt='g')
    plt.title(f"{m} (acc={metrics.accuracy_score(y, predictions):.3f})")
plt.show()

In [None]:
# Plot ROC curve
def plot_roc_curve(fpr=None, tpr=None):
    """Plot custom histogram"""
    plt.figure(figsize=(5,5))
    plt.title('ROC-curve', fontsize=16)
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    
    plt.plot(fpr, tpr)

    # ROC-curve of random model
    plt.plot([0, 1], [0, 1], linestyle='--')
    
    plt.ylim([0.0, 1.0])
    plt.xlim([0.0, 1.0])
    plt.grid(True)
    
    plt.show()

In [None]:
%%time

predictions_valid = np.zeros((predicted_probabilities.shape[0],))
probabilities_valid = np.zeros(X.shape[0])
final_predicted_probabilities = 0
score = 0

X = predicted_probabilities[models.keys()]
y = predicted_probabilities[TARGET]

N_FOLD = 7
kf = KFold(n_splits = N_FOLD, random_state = 99, shuffle = True)

# Iterate through each fold
for fold, (train_idx, valid_idx) in enumerate(kf.split(X)):
    X_train = X.iloc[train_idx]
    X_valid = X.iloc[valid_idx]
    y_train = y.iloc[train_idx]
    y_valid = y.iloc[valid_idx] 

    model = LogisticRegression(C=0.55, solver='saga', penalty='elasticnet', l1_ratio=.15, max_iter=150, n_jobs=-1)
    model.fit(X_train, y_train)
    
    # Mean of the predictions
    final_predicted_probabilities += model.predict_proba(test_predicted_probabilities[models.keys()])[:,1] / N_FOLD
    
    # Out of Fold predictions
    predictions_valid[valid_idx] = model.predict(X_valid)
    probabilities_valid[valid_idx] = model.predict_proba(X_valid)[:,1]
    fold_score = metrics.roc_auc_score(y_valid, predictions_valid[valid_idx])
    print(f"Fold {fold} | ROC-AUC: {fold_score:.3f}")

    score += fold_score / N_FOLD
    
print(f"Overall ROC-AUC: {score:.6f}")

### Plot final AUC-ROC

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y, probabilities_valid)
plot_roc_curve(fpr, tpr)
plt.show()

In [None]:
import plotly.figure_factory as ff
fig = ff.create_distplot([probabilities_valid], ['LogisticRegression'], bin_size=0.1, show_hist=False, show_rug=False)
fig.show()

# 5. Submit predictions

Save the probabilities of predictions to a CSV file

In [None]:
output = pd.DataFrame({'id': df_test.index,
                        'claim': final_predicted_probabilities})
output.to_csv('submission.csv', index=False)

### _If you find it useful please upvote_
### _Thank you!_