# Stacking Ensemble

이 노트북에서는 스태킹 앙상블을 사용하는 예제를 보여드리겠습니다. 

스태킹은 여러 모델을 동일한 교차검증 폴드로 학습한 후, 각 모델의 교차검증 예측 값을 입력으로 사용하여 새로운 모델을 학습하는 앙상블 기법입니다. 여기서 피처로 학습한 모델을 레벨 1 혹은 베이스 모델이라고 하고, 교차검증 예측 값으로 학습한 모델을 레벨 2 혹은 앙상블 모델이라고 합니다. 레벨 2 모델의 교차검증 값을 입력으로 또 다시 레벨 3 모델을 학습할 수도 있습니다 (아래 그림 참조).

![Screen%20Shot%202021-04-25%20at%2010.16.21%20PM.png](attachment:Screen%20Shot%202021-04-25%20at%2010.16.21%20PM.png)

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
pip install kaggler --upgrade

Collecting kaggler
[?25l  Downloading https://files.pythonhosted.org/packages/ab/4c/3dffc99732532278a7386c04dd9fd3a27b409739e3a67a9956f6ce08af11/Kaggler-0.9.4.tar.gz (820kB)
[K     |▍                               | 10kB 22.9MB/s eta 0:00:01[K     |▉                               | 20kB 30.5MB/s eta 0:00:01[K     |█▏                              | 30kB 21.5MB/s eta 0:00:01[K     |█▋                              | 40kB 25.1MB/s eta 0:00:01[K     |██                              | 51kB 22.8MB/s eta 0:00:01[K     |██▍                             | 61kB 25.2MB/s eta 0:00:01[K     |██▉                             | 71kB 17.7MB/s eta 0:00:01[K     |███▏                            | 81kB 18.6MB/s eta 0:00:01[K     |███▋                            | 92kB 17.3MB/s eta 0:00:01[K     |████                            | 102kB 17.1MB/s eta 0:00:01[K     |████▍                           | 112kB 17.1MB/s eta 0:00:01[K     |████▉                           | 122kB 17.1MB/s eta 0

In [3]:
import lightgbm as lgb
import numpy as np
import pandas as pd
from pprint import pprint
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, confusion_matrix
import warnings

from kaggler.preprocessing import LabelEncoder
from kaggler.model import AutoLGB

  import pandas.util.testing as tm


In [4]:
import kaggler
print(kaggler.__version__)

0.9.4


In [5]:
pd.set_option('max_columns', 100)
warnings.simplefilter('ignore')

## Load Data

In [6]:
algo_name = 'esb'
model_name = f'{algo_name}'

predict_val_file = f'{model_name}.val.txt'
predict_tst_file = f'{model_name}.tst.txt'
submission_file = f'{model_name}.sub.csv'

index_col = 'index'
target_col = 'credit'

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
trn = pd.read_csv('/content/drive/MyDrive/Kaggle_Study/나은/creditcard-user-overdue-prediction/train.csv', index_col=index_col)
print(trn.shape)
trn.head()

(26457, 19)


Unnamed: 0_level_0,gender,car,reality,child_num,income_total,income_type,edu_type,family_type,house_type,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,occyp_type,family_size,begin_month,credit
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,F,N,N,0,202500.0,Commercial associate,Higher education,Married,Municipal apartment,-13899,-4709,1,0,0,0,,2.0,-6.0,1.0
1,F,N,Y,1,247500.0,Commercial associate,Secondary / secondary special,Civil marriage,House / apartment,-11380,-1540,1,0,0,1,Laborers,3.0,-5.0,1.0
2,M,Y,Y,0,450000.0,Working,Higher education,Married,House / apartment,-19087,-4434,1,0,1,0,Managers,2.0,-22.0,2.0
3,F,N,Y,0,202500.0,Commercial associate,Secondary / secondary special,Married,House / apartment,-15088,-2092,1,0,1,0,Sales staff,2.0,-37.0,0.0
4,F,Y,Y,0,157500.0,State servant,Higher education,Married,House / apartment,-15037,-2105,1,0,0,0,Managers,2.0,-26.0,2.0


In [9]:
tst = pd.read_csv('/content/drive/MyDrive/Kaggle_Study/나은/creditcard-user-overdue-prediction/test.csv', index_col=index_col)
print(tst.shape)
tst.head()

(10000, 18)


Unnamed: 0_level_0,gender,car,reality,child_num,income_total,income_type,edu_type,family_type,house_type,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,occyp_type,family_size,begin_month
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
26457,M,Y,N,0,112500.0,Pensioner,Secondary / secondary special,Civil marriage,House / apartment,-21990,365243,1,0,1,0,,2.0,-60.0
26458,F,N,Y,0,135000.0,State servant,Higher education,Married,House / apartment,-18964,-8671,1,0,1,0,Core staff,2.0,-36.0
26459,F,N,Y,0,69372.0,Working,Secondary / secondary special,Married,House / apartment,-15887,-217,1,1,1,0,Laborers,2.0,-40.0
26460,M,Y,N,0,112500.0,Commercial associate,Secondary / secondary special,Married,House / apartment,-19270,-2531,1,1,0,0,Drivers,2.0,-41.0
26461,F,Y,Y,0,225000.0,State servant,Higher education,Married,House / apartment,-17822,-9385,1,1,0,0,Managers,2.0,-8.0


In [10]:
sub = pd.read_csv('/content/drive/MyDrive/Kaggle_Study/나은/creditcard-user-overdue-prediction/sample_submission.csv', index_col=index_col)
print(sub.shape)
sub.head()

(10000, 3)


Unnamed: 0_level_0,0,1,2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
26457,0,0,0
26458,0,0,0
26459,0,0,0
26460,0,0,0
26461,0,0,0


## Feature Engineering

### Label Encoding for Categorical Features

범주형 변수를 레이블 인코딩합니다. `kaggler` 패키지의 `LabelEncoder`를 사용하면 희귀 값들 (아래 코드에서는 10번 미만 등장한 값들)을 새로운 하나의 범주로 그룹지어주고, 결측값도 새로운 범주로 간주합니다.

In [11]:
cat_cols = [x for x in trn.columns if trn[x].dtype == 'object']
num_cols = [x for x in trn.columns if x not in cat_cols + [target_col]]
feature_cols = num_cols + cat_cols
print(len(feature_cols), len(cat_cols), len(num_cols))

18 8 10


In [12]:
lbe = LabelEncoder(min_obs=10)
trn[cat_cols] = lbe.fit_transform(trn[cat_cols])
tst[cat_cols] = lbe.transform(tst[cat_cols])

In [13]:
for col in ['child_num', 'family_size']:
    trn[col] = np.log2(1 + trn[col])
    tst[col] = np.log2(1 + tst[col])
    
trn['DAYS_BIRTH'] = np.log2(1 - trn['DAYS_BIRTH'])
tst['DAYS_BIRTH'] = np.log2(1 - tst['DAYS_BIRTH'])

scaler = StandardScaler()
trn[num_cols] = scaler.fit_transform(trn[num_cols])
tst[num_cols] = scaler.transform(tst[num_cols])

## Level-1 Base Model Training

In [14]:
n_est = 1000
seed = 42
n_fold = 5
n_class = 3

lgb_params = {
    'metric': 'multi_logloss',
    'n_estimators': n_est,
    'objective': 'multiclass',
    'random_state': seed,
    'learning_rate': 0.01,
    'min_child_samples': 20,
    'reg_alpha': 3e-5,
    'reg_lambda': 9e-2,
    'num_leaves': 63,
    'colsample_bytree': 0.8,
    'subsample': 0.8,
    'num_class': n_class
}

xgb_params = {
    'metric': 'mlogloss',
    'objective': 'multi:softprob',
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'learning_rate': 0.01,
    'random_state': seed,
    'num_class': n_class,
    'max_depth': 6,
    'n_estimators': n_est,
    'min_child_samples': 20,
    'reg_alpha': 3e-5,
    'reg_lambda': 9e-2,
}

rf_params = {
    'max_depth': 20,
    'min_samples_leaf': 4,
    'random_state': seed
}

In [15]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier


base_models = {'rf': RandomForestClassifier(**rf_params), 
               'lgb': LGBMClassifier(**lgb_params),
               'xgb': XGBClassifier(),
               'et': ExtraTreesClassifier(bootstrap=True, 
                                          criterion='entropy', 
                                          max_features=0.55, 
                                          min_samples_leaf=8, 
                                          min_samples_split=4, 
                                          n_estimators=100)}

In [19]:
from copy import copy

X = trn[feature_cols]
y = trn[target_col]
X_tst = tst[feature_cols]

cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)

p_dict = {}
p_tst_dict = {}
for name in base_models:
    print(f'Training {name}:')
    p = np.zeros((X.shape[0], n_class), dtype=float)
    p_tst = np.zeros((X_tst.shape[0], n_class), dtype=float)
    for i, (i_trn, i_val) in enumerate(cv.split(X, y)):
        clf = copy(base_models[name])
        clf.fit(X.iloc[i_trn], y[i_trn])

        p[i_val] = clf.predict_proba(X.iloc[i_val])
        p_tst += clf.predict_proba(X_tst) / n_fold

    p_dict[name] = p
    p_tst_dict[name] = p_tst
    print(f'\tCV Log Loss: {log_loss(y, p):.6f}')

Training rf:
	CV Log Loss: 0.730307
Training lgb:
	CV Log Loss: 0.742768
Training xgb:
	CV Log Loss: 0.798341
Training et:
	CV Log Loss: 0.776546


## Level-2 Stacking

In [18]:
X = pd.DataFrame(np.hstack([x for _, x in p_dict.items()]))
X_tst = pd.DataFrame(np.hstack([x for _, x in p_tst_dict.items()]))

print(X)
print(y)
p = np.zeros((X.shape[0], n_class), dtype=float)
p_tst = np.zeros((X_tst.shape[0], n_class), dtype=float)
for i_cv, (i_trn, i_val) in enumerate(cv.split(X, y)):
    if i_cv == 0:
        clf = AutoLGB(objective='multiclass', metric='multi_logloss', params={'num_class': n_class}, 
                      feature_selection=False, n_est=10000)
        print(X.iloc[i_trn], y[i_trn])
        clf.tune(X.iloc[i_trn], y[i_trn])
        n_best = clf.n_best
        features = clf.features
        params = clf.params
        print(f'best iteration: {n_best}')
        print(f'selected features ({len(features)}): {features}')        
        pprint(params)
        clf.fit(X.iloc[i_trn], y[i_trn])
    else:
        train_data = lgb.Dataset(X[features].iloc[i_trn], label=y[i_trn])
        clf = lgb.train(params, train_data, n_best, verbose_eval=100)
    
    p[i_val] = clf.predict(X[features].iloc[i_val])
    p_tst += clf.predict(X_tst[features]) / n_fold

             0         1         2         3         4         5         6   \
0      0.162460  0.164950  0.672589  0.206100  0.079631  0.714269  0.178365   
1      0.422140  0.177707  0.400154  0.478632  0.117802  0.403566  0.165739   
2      0.112419  0.221551  0.666030  0.112022  0.150591  0.737387  0.146300   
3      0.134362  0.158854  0.706785  0.150916  0.154780  0.694304  0.129216   
4      0.148379  0.179409  0.672211  0.172696  0.150607  0.676697  0.132653   
...         ...       ...       ...       ...       ...       ...       ...   
26452  0.123656  0.795309  0.081035  0.102203  0.888780  0.009017  0.193546   
26453  0.099353  0.402851  0.497796  0.147776  0.387059  0.465165  0.127830   
26454  0.090881  0.270567  0.638552  0.072000  0.158388  0.769611  0.105093   
26455  0.079010  0.170515  0.750475  0.078673  0.110124  0.811203  0.115188   
26456  0.057779  0.186150  0.756071  0.057600  0.176710  0.765690  0.096614   

             7         8         9         10      

IndexError: ignored

In [None]:
print(f'CV Log Loss: {log_loss(y, p):.6f}')
np.savetxt(predict_val_file, p, fmt='%.6f')
np.savetxt(predict_tst_file, p_tst, fmt='%.6f')

CV Log Loss: 0.715961


## Save the Submission File

In [None]:
sub[sub.columns] = p_tst
sub.head()

Unnamed: 0_level_0,0,1,2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
26457,0.07873,0.157264,0.764006
26458,0.256918,0.208637,0.534445
26459,0.064199,0.08366,0.852141
26460,0.140809,0.116727,0.742464
26461,0.125426,0.15726,0.717314


In [None]:
sub.to_csv(submission_file)

In [None]:
submission_file

'esb.sub.csv'