<div class="alert alert-block" style="border: 1px solid #455A64;background-color:#ECEFF1;">
본 자료 및 영상 컨텐츠는 저작권법 제25조 2항에 의해 보호를 받습니다. 본 컨텐츠 및 컨텐츠 일부 문구등을 외부에 공개, 게시하는 것을 금지합니다. 특히 자료에 대해서는 저작권법을 엄격하게 적용하겠습니다.
</div>

### 중요도에 따른 feature 정리
> 분류 확률을 계산하는데 기여한 정도를 **피처 중요도** 라고 함 <br>
> 결과에 유의미한 영향을 주는 feature 만을 중심으로 머신러닝 기법을 적용하기도 함

### 데이터 준비

In [6]:
import warnings
warnings.filterwarnings('ignore')

import pickle
with open('titanic_step3_feature_encoding.pickle', 'rb') as pickle_filename:
    df_onehot = pickle.load(pickle_filename)
with open('titanic_step3_feature_encoding_y.pickle', 'rb') as pickle_filename:
    y_train = pickle.load(pickle_filename)
ntrain = 891
X_train, X_test = df_onehot[:ntrain], df_onehot[ntrain:]
X_train.head()

Unnamed: 0,Pclass_0,Pclass_1,Pclass_2,Sex_0,Sex_1,Age_0,Age_1,Age_2,Age_3,Age_4,...,HighChance_0,HighChance_1,HighChance_2,HighChance_3,HighChance_4,HighChance_5,HighChance_6,LowChance_0,LowChance_1,LowChance_2
0,0,0,1,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,1,0,0
1,1,0,0,0,1,0,0,0,0,1,...,0,0,1,0,0,0,0,1,0,0
2,0,0,1,0,1,0,0,0,1,0,...,1,0,0,0,0,0,0,1,0,0
3,1,0,0,0,1,0,0,0,0,1,...,0,0,1,0,0,0,0,1,0,0
4,0,0,1,1,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1


### XGBoost 및 LightGBM 설치

In [7]:
!pip install lightgbm



In [8]:
!pip install xgboost



### 라이브러리 모델 임포트

In [9]:
import numpy as np # 각 모델에서 내부적으로 관련 라이브러리 사용 가능
import pandas as pd

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier             # 1. K-Nearest Neighbor(KNN)
from sklearn.linear_model import LogisticRegression            # 2. Logistic Regression
from sklearn.svm import SVC                                                # 3. SVC
from sklearn.tree import DecisionTreeClassifier                   # 4. Decision Tree
from sklearn.ensemble import RandomForestClassifier       # 5. Random Forest
from sklearn.ensemble import ExtraTreesClassifier             # 6. Extra Tree
from sklearn.ensemble import GradientBoostingClassifier  # 7. GBM
from sklearn.naive_bayes import GaussianNB                     # 8. GaussianNB
from xgboost import XGBClassifier                                     # 9. XGBoost
from lightgbm import LGBMClassifier                                 # 10. LightGBM

### 디폴트 테스트
> 하이퍼 파라미터 튜닝을 통해 각 머신러닝 모델을 보다 최적화할 수 있지만, 우선 디폴트값으로 바로 예측 가능

In [31]:
knn_model = KNeighborsClassifier()
logreg_model = LogisticRegression()
svc_model = SVC()
decision_model = DecisionTreeClassifier()
random_model = RandomForestClassifier()
extra_model = ExtraTreesClassifier()
gbm_model = GradientBoostingClassifier()
nb_model = GaussianNB()
xgb_model = XGBClassifier(eval_metric='logloss')
lgbm_model = LGBMClassifier()

models = [
    knn_model,
    logreg_model,
    svc_model,
    decision_model,
    random_model,
    extra_model,
    gbm_model,
    nb_model,
    xgb_model,
    lgbm_model
]

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)           # K-Fold 사용
results = dict()
for alg in models:
    alg.fit(X_train, y_train)    
    score = cross_val_score(alg, X_train, y_train.values.ravel(), cv=k_fold, scoring='accuracy')
    results[alg.__class__.__name__] = np.mean(score)*100

In [32]:
results

{'KNeighborsClassifier': 82.04119850187267,
 'LogisticRegression': 83.38951310861424,
 'SVC': 83.05118601747814,
 'DecisionTreeClassifier': 80.46941323345817,
 'RandomForestClassifier': 82.48938826466915,
 'ExtraTreesClassifier': 82.48813982521848,
 'GradientBoostingClassifier': 83.27590511860174,
 'GaussianNB': 70.8039950062422,
 'XGBClassifier': 81.36828963795256,
 'LGBMClassifier': 81.7003745318352}

### 정확도 높은 순으로 정렬하기

- 사전 데이터의 value 로 정렬하는 방법 

In [33]:
sorted(results.items(), key=lambda x: x[1], reverse=True) # reverse=True 면 높은 순서대로 정렬

[('LogisticRegression', 83.38951310861424),
 ('GradientBoostingClassifier', 83.27590511860174),
 ('SVC', 83.05118601747814),
 ('RandomForestClassifier', 82.48938826466915),
 ('ExtraTreesClassifier', 82.48813982521848),
 ('KNeighborsClassifier', 82.04119850187267),
 ('LGBMClassifier', 81.7003745318352),
 ('XGBClassifier', 81.36828963795256),
 ('DecisionTreeClassifier', 80.46941323345817),
 ('GaussianNB', 70.8039950062422)]

### 성능이 좋은 머신러닝 기법만을 중심으로, 중요도 계산하기

In [35]:
tree_models = [
    random_model,
    extra_model,
    gbm_model,
    xgb_model
]

### 트리 관련 모델은 중요도가 측정됨
- 트리를 결정하는 과정에서 각 feature 가 얼마나 중요한지를 수치화하며, feature_importances_ 에 해당 값을 가지고 있음
- 해당 값을 기준으로 중요도가 낮은 feature 를 걸러낼 수 있음

In [36]:
for alg in tree_models:
    try:
        print(alg.__class__.__name__)
        print(alg.feature_importances_)        
    except:
        print(alg.__class__.__name__, "X")

RandomForestClassifier
[1.64634606e-02 1.59927779e-02 2.92367497e-02 8.41168635e-02
 6.21882942e-02 4.53343579e-03 9.96866440e-03 1.33175555e-02
 2.24452366e-02 1.72908690e-02 1.68118325e-02 5.37556682e-03
 1.34588586e-03 2.02426366e-02 1.78299371e-02 1.11396672e-02
 5.63538217e-03 3.38711418e-03 5.36842055e-03 2.37478639e-02
 8.23924718e-03 8.80315665e-03 1.88026471e-03 7.68328766e-03
 3.77614899e-03 7.57696078e-03 1.85237058e-03 2.53218230e-04
 1.80727368e-02 1.39772855e-02 9.96718372e-03 8.97076787e-02
 1.67711803e-02 2.02420954e-02 1.04966968e-02 3.28975994e-04
 2.01423397e-04 3.46910726e-03 1.62603171e-03 8.13301585e-04
 6.19674823e-04 6.25480593e-04 2.27692284e-04 5.94562335e-05
 0.00000000e+00 1.62088107e-02 1.33018797e-02 1.41840086e-02
 7.32939840e-03 6.97233749e-03 7.77096279e-03 2.65584060e-03
 2.01204089e-03 1.44945052e-03 5.63936960e-03 5.30526928e-03
 5.88032552e-03 1.50930960e-02 9.89321053e-04 4.00069244e-03
 1.19153441e-03 3.78125779e-04 4.96047646e-03 2.63380516e-03
 

- 중요도 기반 데이터프레임 작성하기

In [37]:
random_model_importance = pd.DataFrame({'Feature':X_train.columns, 'random_model':random_model.feature_importances_})
extra_model_importance = pd.DataFrame({'Feature':X_train.columns, 'extra_model':extra_model.feature_importances_})
gbm_model_importance = pd.DataFrame({'Feature':X_train.columns, 'gbm_model':gbm_model.feature_importances_})
xgb_model_importance = pd.DataFrame({'Feature':X_train.columns, 'xgb_model':xgb_model.feature_importances_})

### multiple dataframe 합치기
- dataframes = [각 데이터프레임, ...]
- functools.reduce(lambda  left,right: pd.merge(left, right, on=['동일컬럼']), dataframes)

In [38]:
from functools import reduce
data_frames = [
    random_model_importance,
    extra_model_importance,
    gbm_model_importance,
    xgb_model_importance
]
importances = reduce(lambda  left,right: pd.merge(left, right, on=['Feature']), data_frames)

In [39]:
importances.head()

Unnamed: 0,Feature,random_model,extra_model,gbm_model,xgb_model
0,Pclass_0,0.016463,0.020246,0.015749,0.003858
1,Pclass_1,0.015993,0.017342,4e-06,0.004995
2,Pclass_2,0.029237,0.030793,0.113087,0.068942
3,Sex_0,0.084117,0.073755,0.088014,0.01561
4,Sex_1,0.062188,0.073793,0.024531,0.0


- 항목별 평균 중요도 구하기

In [40]:
importances['avg'] = importances.mean(axis=1)

- 중요도 기반 정렬하기

In [41]:
importances = importances.sort_values(by='avg', ascending=False)

### 중요도가 높은 feature 만 선택하기

In [44]:
importances = importances[:50]

- 특정 컬럼만 선택해서, 데이터프레임 만들기

In [45]:
train_importance = X_train[importances['Feature'].tolist()]
test_importance = X_test[importances['Feature'].tolist()]

In [46]:
train_importance.head()

Unnamed: 0,Initial_0,Sex_0,Pclass_2,Sex_1,HighChance_0,LowChance_0,Cabin_0,Fare_0,Ticket_Num_Cut_2,Embarked_0,...,Age_6,Ticket_Num_Cut_3,Cabin_4,Ticket_Num_Cut_6,HighChance_3,Ticket_Initial2_2,Cabin_1,Family_3,Ticket_Initial2_15,Cabin_6
0,1,1,1,0,1,1,1,1,0,1,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
2,0,0,1,1,1,1,1,1,0,1,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,1,1,1,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


### 중요도가 높은 feature 로만 머신러닝 적용해보기

In [47]:
knn_model = KNeighborsClassifier()
logreg_model = LogisticRegression()
svc_model = SVC()
decision_model = DecisionTreeClassifier()
random_model = RandomForestClassifier()
extra_model = ExtraTreesClassifier()
gbm_model = GradientBoostingClassifier()
nb_model = GaussianNB()
xgb_model = XGBClassifier(eval_metric='logloss')
lgbm_model = LGBMClassifier()

models = [
    knn_model,
    logreg_model,
    svc_model,
    decision_model,
    random_model,
    extra_model,
    gbm_model,
    nb_model,
    xgb_model,
    lgbm_model
]

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)           # K-Fold 사용
results = dict()
for alg in models:
    alg.fit(train_importance, y_train)    
    score = cross_val_score(alg, train_importance, y_train.values.ravel(), cv=k_fold, scoring='accuracy')
    results[alg.__class__.__name__] = np.mean(score)*100

In [48]:
sorted(results.items(), key=lambda x: x[1], reverse=True) # reverse=True 면 높은 순서대로 정렬

[('LogisticRegression', 83.27590511860174),
 ('SVC', 82.71535580524345),
 ('ExtraTreesClassifier', 82.71161048689137),
 ('GradientBoostingClassifier', 82.49063670411985),
 ('RandomForestClassifier', 82.15106117353308),
 ('LGBMClassifier', 82.14981273408239),
 ('KNeighborsClassifier', 81.92759051186019),
 ('XGBClassifier', 81.9250936329588),
 ('DecisionTreeClassifier', 80.80649188514357),
 ('GaussianNB', 77.76903870162297)]

<div class="alert alert-block" style="border: 2px solid #E65100;background-color:#FFF3E0;padding:10px">
<font size="4em" style="font-weight:bold;color:#BF360C;">큰그림으로 이해하기</font><br>
<font size="4em" style="color:#BF360C;">환경에 따라 결과가 조금씩 차이가 있을 수 있지만, 제 PC에서는 최대 83.38% 예측 정확도를 보임</font><br>
<font size="4em" style="color:#BF360C;">오히려 살짝 떨어져보이지만, 하이퍼파라미터 튜닝을 통해, 성능 개선 가능</font><br>
<font size="4em" style="color:#BF360C;">반복적인 성능 개선을 위해, 불필요한 컬럼을 삭제하여, 수행속도를 높이고, 중요도 높은 컬럼만으로 결국 정확도를 높일 수 있음</font>
</div>

In [49]:
import pickle
with open('titanic_step4_importance_train.pickle', 'wb') as pickle_filename:
    pickle.dump(train_importance, pickle_filename)
with open('titanic_step4_importance_test.pickle', 'wb') as pickle_filename:
    pickle.dump(test_importance, pickle_filename)
with open('titanic_step4_importance_train_y.pickle', 'wb') as pickle_filename:
    pickle.dump(y_train, pickle_filename)

<div class="alert alert-block" style="border: 1px solid #455A64;background-color:#ECEFF1;">
본 자료 및 영상 컨텐츠는 저작권법 제25조 2항에 의해 보호를 받습니다. 본 컨텐츠 및 컨텐츠 일부 문구등을 외부에 공개, 게시하는 것을 금지합니다. 특히 자료에 대해서는 저작권법을 엄격하게 적용하겠습니다.
</div>