# **TPS - Sep 2021**

## **XGBoost & LightGBM & CatBoost Stacking**

### Thank you for visiting my notebook :)
### This notebook is for beginner like me **who wants to study stacking ensemble!**

#### **Stacking Ensemble** is a nice technique for forwarding you score.
#### As you can see below image, Stacking Ensemble needs some models for classification and meta-model for final prediction!

#### Here's what you need to do.
**Step1. Make your train, test data for training & prediction (Preprocessing)**

**Step2. Select some models for making stacking datasets!! (Train models and Making Datasets)**

**Step3. Select final model for meta-model!**

**Step4. With your meta-model, Train & Predict with stacking datasets ;)**

![Stacking Ensemble](http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier_files/stackingclassification_overview.png)

# **Import Library**

In [None]:
import gc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

# **Load Data**

In [None]:
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')

all_data = pd.concat([train, test])

In [None]:
all_data2 = all_data.drop(columns = ['id', 'claim'])
all_data2

# **Handle missing values**

*   We can use mean values to handle missing values.
*   Or, we can predict missing values with clean data.


### **Distribution of Missing values**

In [None]:
# Distribution

plt.figure(figsize = (12, 6))
missing_values = all_data2.isnull().sum()[:-1]
sns.histplot(missing_values, color='violet');
plt.show()

print('\n')
print('-------- Distribution of Missing values --------')
print('Min:', missing_values.min())
print('Max:', missing_values.max())
print('Mean:', missing_values.mean())
print('------------------------------------------------')

# Feature Generation

*   Thanks to BIZEN's notebook, we can use those missing value counts for feature! [Check BIZEN's Notebook!](https://www.kaggle.com/hiro5299834/tps-sep-2021-single-lgbm)

*   **I made Missing-Value's One-Hot Encoded columns for training missing value(Y/N)**

In [None]:
all_data2['n_missing'] = all_data2.isna().sum(axis=1)
all_data2['std'] = all_data2.std(axis=1)
all_data2['min'] = all_data2.min(axis=1)
all_data2['max'] = all_data2.max(axis=1)
all_data2

In [None]:
miss_one_hot = all_data2.iloc[:, :118].isna()
miss_one_hot.columns = [f'missing_f_{i}' for i in range(118)]
miss_one_hot

In [None]:
all_data3 = pd.concat([all_data2, miss_one_hot], axis = 1)

In [None]:
del all_data, all_data2, test, miss_one_hot
gc.collect()

In [None]:
train2 = all_data3[:len(train)]
test2 = all_data3[len(train):]
y = train['claim']

In [None]:
del train
gc.collect()

In [None]:
sc = StandardScaler()
si = SimpleImputer()

train2.iloc[:, :118] = si.fit_transform(sc.fit_transform(train2.iloc[:, :118]))
test2.iloc[:, :118] = si.fit_transform(sc.fit_transform(test2.iloc[:, :118]))

# **Modeling**

### **Stacking Data Loader**

In [None]:
def Stacking_Data_Loader(model, model_name, x_train, y_train, x_test, fold):
    '''
    Put your train, test datasets and fold value!
    This function returns train, test datasets for stacking ensemble :)
    '''

    stk = StratifiedKFold(n_splits = fold, random_state = 42, shuffle = True)
    
    # Declaration Pred Datasets
    train_fold_pred = np.zeros((x_train.shape[0], 1))
    test_pred = np.zeros((x_test.shape[0], fold))
    
    for counter, (train_index, valid_index) in enumerate(stk.split(x_train, y_train)):
        x_train, y_train = train2.iloc[train_index], y[train_index]
        x_valid, y_valid = train2.iloc[valid_index], y[valid_index]

        print('------------ Fold', counter+1, 'Start! ------------')
        if model_name == 'cat':
            model.fit(x_train, y_train, eval_set=[(x_valid, y_valid)])
        elif model_name == 'xgb':
            model.fit(x_train, y_train, eval_set=[(x_valid, y_valid)], eval_metric = 'auc', verbose = 500, early_stopping_rounds = 200)
        else:
            model.fit(x_train, y_train, eval_set=[(x_valid, y_valid)], eval_metric = 'auc', verbose = 100, early_stopping_rounds = 200)
            
        print('------------ Fold', counter+1, 'Done! ------------')
        
        train_fold_pred[valid_index, :] = model.predict_proba(x_valid)[:, 1].reshape(-1, 1)
        test_pred[:, counter] = model.predict_proba(x_test)[:, 1]
    
    test_pred_mean = np.mean(test_pred, axis = 1).reshape(-1, 1)

    print('Done!')
    
    return train_fold_pred, test_pred_mean

### **Modeling**

#### Model's HyperParameters
* LGBM2 Param : https://www.kaggle.com/hiro5299834/tps-sep-2021-single-lgbm
* Cat2 Param : https://www.kaggle.com/mlanhenke/tps-09-single-catboostclassifier-0-81676

Thanks for Sharing!

In [None]:
lgb1_params = {
    'objective': 'binary',
    'n_estimators': 10000,
    'random_state': 42,
    'learning_rate': 0.095,
    'subsample': 0.6,
    'subsample_freq': 1,
    'colsample_bytree': 0.4,
    'reg_alpha': 10.0,
    'reg_lambda': 1e-1,
    'min_child_weight': 256,
    'min_child_samples': 20,
    'device' : 'gpu',
    'max_depth' : 3,
    'num_leaves' : 7
}

lgb2_params = {
    'max_depth' : 3,
    'num_leaves' : 7,
    'n_estimators' : 5000,
    'colsample_bytree' : 0.3,
    'subsample' : 0.5,
    'random_state' : 42,
    'reg_alpha' : 18,
    'reg_lambda' : 17,
    'learning_rate' : 0.095,
    'device' : 'gpu',
    'objective' : 'binary'
}

xgb1_params = {
      'tree_method' : 'gpu_hist', 
      'learning_rate' : 0.01,
      'n_estimators' : 50000,
      'colsample_bytree' : 0.3,
      'subsample' : 0.75,
      'reg_alpha' : 19,
      'reg_lambda' : 19,
      'max_depth' : 5, 
      'predictor' : 'gpu_predictor'
}

cat1_params = {
     'depth' : 5,
     'grow_policy' : 'SymmetricTree',
     'l2_leaf_reg' : 3.0,
     'random_strength' : 1.0,
     'learning_rate' : 0.02,
     'iterations' : 10000,
     'loss_function' : 'CrossEntropy',
     'eval_metric' : 'AUC',
     'use_best_model' : True,
     'early_stopping_rounds' : 200,
     'task_type' : 'GPU',
     'verbose' : 1000,
}

cat2_params = {
    'iterations': 15585, 
    'objective': 'CrossEntropy', 
    'bootstrap_type': 'Bernoulli', 
    'learning_rate': 0.023575206684596582, 
    'reg_lambda': 36.30433203563295, 
    'random_strength': 43.75597655616195, 
    'depth': 7, 
    'min_data_in_leaf': 11, 
    'leaf_estimation_iterations': 1, 
    'subsample': 0.8227911142845009,
    'task_type' : 'GPU',
    'eval_metric' : 'AUC',
    'verbose' : 1000,
    'early_stopping_rounds' : 200,
}

In [None]:
lgbm_1 = LGBMClassifier(**lgb1_params)
lgbm_2 = LGBMClassifier(**lgb2_params)

# xgb = XGBClassifier(**xgb1_params)

# cat_1 = CatBoostClassifier(**cat1_params)
# cat_2 = CatBoostClassifier(**cat2_params)

In [None]:
del all_data3
gc.collect()

### **Stacking**

* Making train, test prediction array!
* Concat 5 arrays in 1 dataset
* Thanks to kenneth Q's nice notebook (https://www.kaggle.com/kennethquisado/xgboost-10fold-cv-blend)

In [None]:
# cat1_train, cat1_test = Stacking_Data_Loader(cat_1, 'cat', train2, y, test2, 5)
# del cat_1
# gc.collect()

# cat2_train, cat2_test = Stacking_Data_Loader(cat_2, 'cat', train2, y, test2, 5)
# del cat_2
# gc.collect()

lgbm1_train, lgbm1_test = Stacking_Data_Loader(lgbm_1, 'lgbm', train2, y, test2, 5)
del lgbm_1
gc.collect()

lgbm2_train, lgbm2_test = Stacking_Data_Loader(lgbm_2, 'lgbm', train2, y, test2, 5)
del lgbm_2
gc.collect()

# xgb_train, xgb_test = Stacking_Data_Loader(xgb, 'xgb', train2, y, test2, 5)
# del xgb
# gc.collect()

### **Final Stacking Datasets!**

In [None]:
stack_x_train = np.load('../input/catboost-xgboost-stacking-datasets/stack_x_train.npy')
stack_x_test = np.load('../input/catboost-xgboost-stacking-datasets/stack_x_test (1).npy')

stack_x_train = np.concatenate((stack_x_train, lgbm1_train, lgbm2_train), axis = 1)
stack_x_test = np.concatenate((stack_x_test, lgbm1_test, lgbm2_test), axis = 1)

stack_x_train

In [None]:
# stack_x_train = np.concatenate((cat1_train, cat2_train, xgb_train, lgbm1_train, lgbm2_train), axis = 1)
# stack_x_test = np.concatenate((cat1_test, cat2_test, xgb_test, lgbm1_test, lgbm2_test), axis = 1)
# stack_x_train

In [None]:
stk = StratifiedKFold(n_splits = 5)

test_pred = 0
fold = 1
total_auc = 0

for train_index, valid_index in stk.split(stack_x_train, y):
    x_train, y_train = stack_x_train[train_index], y[train_index]
    x_valid, y_valid = stack_x_train[valid_index], y[valid_index]
    
    lr = LogisticRegression(n_jobs = -1, random_state = 42, C = 1000, max_iter = 1000)
    lr.fit(x_train, y_train)
    
    valid_pred = lr.predict_proba(x_valid)[:, 1]
    test_pred += lr.predict_proba(stack_x_test)[:, 1]
    auc = roc_auc_score(y_valid, valid_pred)
    total_auc += auc / 5
    print('Fold', fold, 'AUC :', auc)
    fold += 1
    
print('Total AUC score :', total_auc)

# **Submission!**

In [None]:
sub = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')
sub['claim'] = test_pred
sub.to_csv('sub.csv', index = 0)

# Done!


## If you think this notebook is helpful for you, Please do not forget upvote!