# Stacking을 이용한 Second-Level Learning Model

안녕하세요. 취미로 캐글을 시작한 지 얼마 안 되어서 많이 부족한 커널이지만  
즐겁게 만들어보았습니다! 자유로운 피드백은 늘 환영입니다!

Stacking을 이용한 다음 모델은 이런 식으로 구현되어 있습니다.

1. 5개의 First Level Classifier들이 (DecisionTree, RandomForest, ExtraTrees, AdaBoost, GBM) 각자 예측값을 0, 1의 형태로 도출한다.
2. 이 예측값을 모아서 Second Level Classifier(XGBoost)에 넣어 최종 결과를 도출한다.

**이 커널은 Baseline이며, 향후 Feature Engineering, Parameter Tuning 등을 통해 성능을 개선할 예정입니다.  
피드백 및 성능 개선에 대한 아이디어는 무엇이든 감사하게 듣겠습니다!**

# 1. 학습 가능한 형태로 데이터 변환

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from matplotlib impoty pyplot as plt
import seaborn as sns

from category_encoders.ordinal import OrdinalEncoder
from sklearn.model_selection import train_test_split, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.metrics import f1_score, accuracy_score


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_df = pd.read_csv('../input/kakr-4th-competition/train.csv')
test_df = pd.read_csv('../input/kakr-4th-competition/test.csv')

In [None]:
train_df.drop(['id'], axis=1, inplace=True)
test_df.drop(['id'], axis=1, inplace=True)

우선 예측하고자 하는 'income' 을 True/False 형태로 변환해준 뒤, X와 y를 분리했습니다.

In [None]:
y = train_df['income'] != '<=50K'
X = train_df.drop(['income'], axis=1)

Ordinal Encoder를 이용한 라벨링을 진행합니다.

In [None]:
LE_encoder = OrdinalEncoder(list(X.columns))

X = LE_encoder.fit_transform(X, y)
test_df = LE_encoder.transform(test_df)

라벨링을 마치고 나면 아래와 같은 데이터로 정리됩니다.

In [None]:
X['income'] = y
X.head(5)

'native_country' 열만 float 형태여서, 다른 열과 동일하게 형변환을 진행했습니다.

In [None]:
test_df['native_country'] = test_df['native_country'].astype(np.int64)

이제 마지막으로 X_train, y_train, X_test를 나누어 저장해둡니다.

In [None]:
y_train = X['income'].values
X_train = X.drop(['income'], axis=1).values
X_test = test_df.values

# 2. First-Level Classifier 만들기

## 1) Model Creation

Model_Creation이라는 클래스를 생성합니다.  
5개의 Classifier를 효율적이고 빠르게 생성하기 위해 만드는 클래스입니다.

In [None]:
class Model_Creation(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)
        
    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)
        
    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self, x, y):
        return self.clf.fit(x, y)
    
    def feature_importances(self, x, y):
        print(self.clf.fit(x, y).feature_importances_)

이번에는 get_scores 함수를 정의합니다.
Return값으로 4가지를 받습니다.  

1. Training Prediction : X_train, y_train을 넣어줬을 때의 0, 1 예측값들입니다. 이 결과값은 추후에 합쳐진 뒤, Second-Level Classifier의 Input이 됩니다.  
2. Test Prediction : X_test을 넣어줬을 때의 예측값입니다. First-Level Classifier들의 성능이 어떠한지 체크할 수 있습니다.
3. First-level Accuracy Score : X_train, y_train의 Accuracy Score입니다.
4. First-level F1 Score : X_train, y_train의 F1-Score입니다.

In [None]:
train_size = train_df.shape[0]
test_size = test_df.shape[0]
SEED = 0
NFOLDS = 5
kf = KFold(n_splits=NFOLDS, random_state=SEED)

def get_scores(clf, x_train_get, y_train_get, x_test_get):
    pred_train = np.zeros((train_size,))
    pred_test = np.zeros((test_size,))
    pred_test_kfold = np.empty((NFOLDS, test_size))
        
    for i, (train_index, val_index) in enumerate(kf.split(x_train_get)):
        x_train = x_train_get[train_index]
        y_train = y_train_get[train_index]
        x_val = x_train_get[val_index]
        
        clf.train(x_train, y_train)
        
        pred_train[val_index] = clf.predict(x_val)
        pred_test_kfold[i, :] = clf.predict(x_test_get)
        
    pred_test[:] = pred_test_kfold.mean(axis=0)
    
    pred_train = pred_train.astype(int)
    pred_test = pred_test.astype(int)
    
    clf_acc_score = accuracy_score(y_train_get, pred_train)
    clf_f1_score = f1_score(y_train_get, pred_train)
    
    return pred_train.reshape(-1, 1), pred_test.reshape(-1, 1), clf_acc_score, clf_f1_score

가장 기본적인 방식으로 Parameter들을 정의했습니다.  
추후 이 값들은 Tuning이 필요합니다.

In [None]:
dt_params = {
    'max_depth' : 3,
    'min_samples_split' : 2
}

ada_params = {
    'n_estimators': 100,
    'learning_rate' : 0.75
}

rf_params = {
    'n_estimators' : 100,
    'min_samples_split' : 2
}

et_params = {
    'n_estimators': 100,
    'min_samples_leaf': 2,
}

gb_params = {
    'n_estimators': 100,
    'min_samples_leaf': 2,
}

5개의 모델을 아래와 같이 생성한 뒤, 결과값을 변수에 저장합니다.

In [None]:
dt_model = Model_Creation(clf=DecisionTreeClassifier, seed=SEED, params=dt_params)
rf_model = Model_Creation(clf = RandomForestClassifier, seed = SEED, params = rf_params)
et_model = Model_Creation(clf = ExtraTreesClassifier, seed = SEED, params = et_params)
ada_model = Model_Creation(clf = AdaBoostClassifier, seed = SEED, params = ada_params)
gb_model = Model_Creation(clf = GradientBoostingClassifier, seed = SEED, params = gb_params)

In [None]:
dt_train_result, dt_test_result, dt_acc_score, dt_f1_score = get_scores(clf=dt_model, x_train_get=X_train, y_train_get=y_train, x_test_get=X_test)
rf_train_result, rf_test_result, rf_acc_score, rf_f1_score = get_scores(clf=rf_model, x_train_get=X_train, y_train_get=y_train, x_test_get=X_test)
et_train_result, et_test_result, et_acc_score, et_f1_score = get_scores(clf=et_model, x_train_get=X_train, y_train_get=y_train, x_test_get=X_test)
ada_train_result, ada_test_result, ada_acc_score, ada_f1_score = get_scores(clf=ada_model, x_train_get=X_train, y_train_get=y_train, x_test_get=X_test)
gb_train_result, gb_test_result, gb_acc_score, gb_f1_score = get_scores(clf=gb_model, x_train_get=X_train, y_train_get=y_train, x_test_get=X_test)

## 2) Performance & Feature Importance

First-level Classifier들의 성능을 체크해봅시다.

In [None]:
print('Accuracy score of DecisionTreeClassifier :', round(dt_acc_score, 4) * 100, '%')
print('F1-Score of DecisionTreeClassifier :', round(dt_f1_score, 4) * 100)

print('Accuracy score of RandomForestClassifer :', round(rf_acc_score, 4) * 100, '%')
print('F1-Score of ExtraTreesClassifier :', round(rf_f1_score, 4) * 100)

print('Accuracy score of ExtraTreesClassifier :', round(et_acc_score, 4) * 100, '%')
print('F1-Score of ExtraTreesClassifier :', round(et_f1_score, 4) * 100)

print('Accuracy score of AdaBoost :', round(ada_acc_score, 4) * 100, '%')
print('F1-Score of AdaBoost :', round(ada_f1_score, 4) * 100)

print('Accuracy score of Gradient Boosting Machine :', round(gb_acc_score, 4) * 100, '%')
print('F1-Score of Gradient Boosting Machine :', round(gb_f1_score, 4) * 100)

First-level Classifier들은 어떤 Feature에 가중치를 두고 학습되었는지 확인해봅시다.

In [None]:
dt_features = dt_model.feature_importances(X_train, y_train)
rf_features = rf_model.feature_importances(X_train, y_train)
et_features = et_model.feature_importances(X_train, y_train)
ada_features = ada_model.feature_importances(X_train, y_train)
gb_features = gb_model.feature_importances(X_train, y_train)

In [None]:
dt_features = [0.00158387, 0, 0, 0, 0.21968882, 0.52720064, 0, 0, 0, 0, 0.25152667, 0, 0, 0]

rf_features = [0.14884607, 0.03715319, 0.1672759,  0.0335612, 0.08463101, 0.09981356,
 0.07028089, 0.07831672, 0.0139579, 0.01390901, 0.11619356, 0.03545653,
 0.08397304, 0.01663142]

et_features = [0.09821158, 0.03551721, 0.03277514, 0.05533726, 0.14090152, 0.13328737,
 0.05471377, 0.14526666, 0.01215226, 0.04887652, 0.13526527, 0.0378438,
 0.06095664, 0.00889499]

ada_features = [0.11, 0.02, 0.02, 0.01, 0.07, 0.06, 0.17, 0.08, 0.01, 0.04, 0.21, 0.17, 0.03, 0]

gb_features = [0.06090052, 0.00509677, 0.00435494, 0.00256431, 0.19704072, 0.38849472,
 0.02393962, 0.00597061, 0.00107229, 0.00193148, 0.20452907, 0.06179946,
 0.04085863, 0.00144688]

In [None]:
X_tmp = X.drop(['income'], axis=1)
X_tmp.head(5)

In [None]:
feature_df = pd.DataFrame({
    'Features' : X_tmp.columns.values,
    'DT_feat_importances_' : dt_features,
    'RF_feat_importances_' : rf_features,
    'ET_feat_importances_' : et_features,
    'ADA_feat_importances_' : ada_features,
    'GB_feat_importances_' : gb_features
})

feature_df에 각 Classifier들의 importance를 모아놓았습니다.

In [None]:
feature_df.head(5)

이를 Seaborn을 이용해서 그림으로 확인해봅시다.

In [None]:
def plot_feature_importance(clf_features):
    plt.figure(figsize = (20, 12))
    feat_plot = sns.barplot(data=feature_df, x='Features', y=clf_features)
    feat_plot.set_title(clf_features, fontdict={'fontsize' : 20})
    for p in feat_plot.patches:
        feat_plot.annotate(format(p.get_height(), '.2f'),
                        (p.get_x() + p.get_width() / 2., p.get_height()),
                         ha = 'center', va = 'center',
                         textcoords = 'offset points')

In [None]:
plot_feature_importance('DT_feat_importances_')

In [None]:
plot_feature_importance('RF_feat_importances_')

In [None]:
plot_feature_importance('ET_feat_importances_')

In [None]:
plot_feature_importance('ADA_feat_importances_')

In [None]:
plot_feature_importance('GB_feat_importances_')

# 3. Second-Level Classifier 만들기

In [None]:
first_level_pred = pd.DataFrame({
    'DecisionTree' : dt_train_result.ravel(),
    'RandomForest' : rf_train_result.ravel(),
    'ExtraTrees' : et_train_result.ravel(),
    'AdaBoost' : ada_train_result.ravel(),
    'GBM' : gb_train_result.ravel()
    
})
first_level_pred.head()

아래는 5개 Classifier들의 Consensus를 보여주는 값입니다.

In [None]:
first_level_pred.value_counts()

이제 이 예측값들을 하나로 묶어서 Second-level Classifier에 넣습니다.

In [None]:
X_train_secondlevel = np.concatenate((dt_train_result, rf_train_result, et_train_result,
                                     ada_train_result, gb_train_result), axis=1)
X_test_secondlevel = np.concatenate((dt_test_result, rf_test_result, et_test_result,
                                     ada_test_result, gb_test_result), axis=1)

In [None]:
xgb_model = xgb.XGBClassifier(
    n_estimators = 100,
    max_depth = 4,
    min_child_weight = 2,
    objective = 'binary:logistic').fit(X_train_secondlevel, y_train)
final_pred = xgb_model.predict(X_test_secondlevel)

In [None]:
sample_submission = pd.read_csv('/kaggle/input/kakr-4th-competition/sample_submission.csv')
sample_submission['prediction'] = final_pred.astype(int)
sample_submission.to_csv('submission.csv', index=False)

# 4. Reference

1. [[KaKr] 탐색적 데이터 분석(EDA) 설명 + 예시](https://www.kaggle.com/subinium/kakr-eda)  
2. [캐하~ EDA + LightGBM + PyCaret](https://www.kaggle.com/teddylee777/eda-lightgbm-pycaret)  
3. [Introduction to Ensembling/Stacking in Python](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python)