#### Ensemble - RandomForest & ExtraTree
- 배깅 방식의 앙상블 => 중복 허용 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : RandomForestC/R 
- 페이스트 방식의 앙상블 => 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : ExtraTreeC/R

[목표] 와인분류 => 2개 종류(0과 1)로 분류

[1] 모듈 로딩 및 데이터 준비

In [40]:
# 모듈 로딩
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [41]:
# 데이터
DATA_FILE=pd.read_csv('../DATA/wine.csv')

# CSV => DataFrame
wineDF=pd.DataFrame(DATA_FILE)

In [42]:
# 타겟/라벨의 클래스 분포
wineDF['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

In [43]:
for i in wineDF.columns:
    print(f"{i}의 종류 : {wineDF[i].unique()}\n")

alcohol의 종류 : [ 9.4         9.8        10.          9.5        10.5         9.2
  9.9         9.1         9.3         9.          9.7        10.1
 10.6         9.6        10.8        10.3        13.1        10.2
 10.9        10.7        12.9        10.4        13.         14.
 11.5        11.4        12.4        11.         12.2        12.8
 12.6        12.5        11.7        11.3        12.3        12.
 11.9        11.8         8.7        13.3        11.2        11.6
 11.1        13.4        12.1         8.4        12.7        14.9
 13.2        13.6        13.5        10.03333333  9.55        8.5
 11.06666667  9.56666667 10.55        8.8        13.56666667 11.95
  9.95        9.23333333  9.25        9.05       10.75        8.6
  8.9        13.9        13.7         8.         14.2        11.94
 12.89333333 11.46666667 10.98       11.43333333 10.53333333  9.53333333
 10.93333333 11.36666667 11.33333333  9.73333333 11.05        9.75
 11.35       11.45       14.05       12.33333333 12.75

[2] 학습 준비

In [44]:
# 피쳐/독립변수와 타겟/라벨/종속변수 분리
from sklearn.model_selection import train_test_split

featureDF=wineDF[wineDF.columns[:-1]]
targetSR=wineDF[wineDF.columns[-1]]

print(f'featureDF : {featureDF.shape}  targetSR : {targetSR.shape}')

featureDF : (6497, 3)  targetSR : (6497,)


In [45]:
# 학습용 & 테스트용 데이터셋 분할
X_train, X_test, y_train, y_test=train_test_split(featureDF, targetSR, test_size=0.2 , stratify=targetSR , random_state=1)

In [46]:
print(f'X_train : {X_train.shape}  targetSR : {y_train.shape}')
print(f'X_test : {X_test.shape}  targetSR : {y_test.shape}')

X_train : (5197, 3)  targetSR : (5197,)
X_test : (1300, 3)  targetSR : (1300,)


[3] 학습 진행
- 학습방법 : 지도학습 > 분류
- 알고리즘 : 앙상블 > 배깅 - RandomForestClassifier

In [47]:
from sklearn.ensemble import RandomForestClassifier

In [48]:
# 인스턴스 생성 => 100개의 내부 DT 모델에서 사용할 데이터셋 생성. random_state 매개변수 설정으로 고정된 데이터셋 생성
#                 ood_score 매개변수 : 샘플 데이터셋 추출 후 남은 데이터셋 검증용으로 사용
lf_model=RandomForestClassifier(random_state=7, oob_score=True)

# 학습
lf_model.fit(X_train, y_train)

In [49]:
# 모델 파라미터
print(f'classes_ : {lf_model.classes_}')
print(f'n_classes_ : {lf_model.n_classes_}개')
print()
print(f'feature_names_in_ : {lf_model.feature_names_in_}')
print(f'n_features_in_ : {lf_model.n_features_in_}')
print(f'feature_importances_ : {lf_model.feature_importances_}')

classes_ : [0. 1.]
n_classes_ : 2개

feature_names_in_ : ['alcohol' 'sugar' 'pH']
n_features_in_ : 3
feature_importances_ : [0.23572103 0.49995154 0.26432743]


In [50]:
# 모델 파라미터
print(f'classes_ : {lf_model.estimator_}')
for est in lf_model.estimators_: print(est)

classes_ : DecisionTreeClassifier()
DecisionTreeClassifier(max_features='sqrt', random_state=327741615)
DecisionTreeClassifier(max_features='sqrt', random_state=976413892)
DecisionTreeClassifier(max_features='sqrt', random_state=1202242073)
DecisionTreeClassifier(max_features='sqrt', random_state=1369975286)
DecisionTreeClassifier(max_features='sqrt', random_state=1882953283)
DecisionTreeClassifier(max_features='sqrt', random_state=2053951699)
DecisionTreeClassifier(max_features='sqrt', random_state=959775639)
DecisionTreeClassifier(max_features='sqrt', random_state=1956722279)
DecisionTreeClassifier(max_features='sqrt', random_state=2052949340)
DecisionTreeClassifier(max_features='sqrt', random_state=1322904761)
DecisionTreeClassifier(max_features='sqrt', random_state=165338510)
DecisionTreeClassifier(max_features='sqrt', random_state=1133316631)
DecisionTreeClassifier(max_features='sqrt', random_state=4812360)
DecisionTreeClassifier(max_features='sqrt', random_state=372560217)
Decisi

In [51]:
print(f'oob_score_ : {lf_model.oob_score_}')

oob_score_ : 0.89532422551472


[4] 성능 평가

In [52]:
train_score=lf_model.score(X_train, y_train)
test_score=lf_model.score(X_test, y_test)

In [53]:
print(f'train_score : {train_score} test_score : {test_score}')

train_score : 0.9973061381566288 test_score : 0.9


[5] 튜닝

- RandomizedSearchCV 하이퍼 파라미터 최적화 클래스
    * 범위가 넓은 하이퍼 파라미터 설정에 좋음
    * 지정된 범위에서 지정된 횟수 만큼 하이퍼 파라미터를 추출하여 조합 진행

In [54]:
# 모듈 로딩
from sklearn.model_selection import RandomizedSearchCV

In [55]:
# RandomForestClassifier 하이퍼 파라미터 설정
params={'max_depth': range(2,15), 'min_samples_leaf': range(5,16), 'criterion': ['gini','entropy','log_loss']}

In [56]:
rf_model=RandomForestClassifier(random_state=7)

In [57]:
searchCV=RandomizedSearchCV(rf_model, param_distributions=params)

In [58]:
searchCV.fit(X_train, y_train)

In [59]:
# 모델 파라미터
print(f'[searchCV.best_score_] {searchCV.best_score_}')
print(f'[searchCV.best_params_] {searchCV.best_params_}')
print(f'[searchCV.best_estimator_] {searchCV.best_estimator_}')

cv_resultDF=pd.DataFrame(searchCV.cv_results_)
cv_resultDF

[searchCV.best_score_] 0.8714701636188643
[searchCV.best_params_] {'min_samples_leaf': 10, 'max_depth': 14, 'criterion': 'entropy'}
[searchCV.best_estimator_] RandomForestClassifier(criterion='entropy', max_depth=14, min_samples_leaf=10,
                       random_state=7)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.144261,0.013311,0.005541,0.00642,7,2,entropy,"{'min_samples_leaf': 7, 'max_depth': 2, 'crite...",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,10
1,0.249571,0.000817,0.013479,0.006264,14,11,log_loss,"{'min_samples_leaf': 14, 'max_depth': 11, 'cri...",0.875962,0.836538,0.876805,0.883542,0.869105,0.86839,0.01657,4
2,0.199994,0.021096,0.013318,0.00666,9,5,log_loss,"{'min_samples_leaf': 9, 'max_depth': 5, 'crite...",0.852885,0.8375,0.860443,0.872955,0.856593,0.856075,0.011485,9
3,0.246482,0.006619,0.010002,0.00786,7,8,log_loss,"{'min_samples_leaf': 7, 'max_depth': 8, 'crite...",0.872115,0.840385,0.876805,0.879692,0.870067,0.867813,0.014127,6
4,0.183647,0.001415,0.013035,0.006651,14,5,gini,"{'min_samples_leaf': 14, 'max_depth': 5, 'crit...",0.852885,0.834615,0.866218,0.87488,0.854668,0.856653,0.01362,8
5,0.273755,0.007613,0.013421,0.005204,6,10,log_loss,"{'min_samples_leaf': 6, 'max_depth': 10, 'crit...",0.873077,0.846154,0.875842,0.880654,0.873917,0.869929,0.012174,2
6,0.25222,0.007414,0.017242,0.001118,11,10,gini,"{'min_samples_leaf': 11, 'max_depth': 10, 'cri...",0.873077,0.838462,0.879692,0.881617,0.869105,0.86839,0.015626,3
7,0.252373,0.007581,0.017655,0.00112,13,11,entropy,"{'min_samples_leaf': 13, 'max_depth': 11, 'cri...",0.874038,0.838462,0.871992,0.884504,0.871992,0.868198,0.015576,5
8,0.275062,0.008895,0.011512,0.004803,10,14,entropy,"{'min_samples_leaf': 10, 'max_depth': 14, 'cri...",0.876923,0.835577,0.879692,0.885467,0.879692,0.87147,0.018161,1
9,0.226159,0.014267,0.013663,0.006553,10,7,gini,"{'min_samples_leaf': 10, 'max_depth': 7, 'crit...",0.869231,0.833654,0.876805,0.87488,0.866218,0.864157,0.015719,7
