# Ensemble - RandomForest & ExtraTree
- 배깅 방식의 앙상블 -> 중복 랜덤 샘플 + 동일 모델(DT)
    - 대표 알고리즘 : RandomForest C/R 
- 페이스팅 방식의 앙상블 -> 랜덤 샘플 + 동일 모델(DT)
    - 대표 알고리즘 : ExtraTree C/R

[목표] 와인 분류 => 0과 1인 2개 종류 분류

## [1] 모듈 로딩 및 데이터 준비

In [79]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [80]:
wine_df = pd.read_csv('../DATA/wine.csv')

In [81]:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [82]:
wine_df.head(2)

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0


In [83]:
wine_df['class'].unique()

array([0., 1.])

In [84]:
# 타겟/라벨의 클래스 분포

wine_df['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

비율이 다르므로 맞추는 방법
- 다운 샘플링 ( 1.0을 0.0 개수에 맞춤 )
- 업 샘플링 (0.0을 1.0 개수에 맞춤)

In [85]:
wine_df.describe()

Unnamed: 0,alcohol,sugar,pH,class
count,6497.0,6497.0,6497.0,6497.0
mean,10.491801,5.443235,3.218501,0.753886
std,1.192712,4.757804,0.160787,0.430779
min,8.0,0.6,2.72,0.0
25%,9.5,1.8,3.11,1.0
50%,10.3,3.0,3.21,1.0
75%,11.3,8.1,3.32,1.0
max,14.9,65.8,4.01,1.0


스케일링 한 값과 안 한 값으로 하여 비교해보기

## [2] 학습 준비

In [86]:
# 학습용, 테스트용 데이터셋 분할 

from sklearn.model_selection import train_test_split

In [87]:
feature_df = wine_df[wine_df.columns[:-1]]
target_sr = wine_df[wine_df.columns[-1]]

print(f'feature_df : {feature_df.shape}, target_sr : {target_sr.shape}')

feature_df : (6497, 3), target_sr : (6497,)


In [88]:
# 학습용, 테스트용 데이터셋 분리 
x_train, x_test , y_train, y_test = train_test_split(feature_df, target_sr, test_size=0.2, stratify=target_sr, random_state=1)

In [89]:
print(f'x_train : {x_train.shape}, y_train : {y_train.shape}')
print(f'x_test : {x_test.shape} , y_test : {y_test.shape}')

x_train : (5197, 3), y_train : (5197,)
x_test : (1300, 3) , y_test : (1300,)


## [3] 학습 진행

In [90]:
# 학습방법 : 지도학습 - 분류 
# 알고리즘 : 앙상블 - 배깅 - RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [91]:
# 인스턴스 생성 
# - 100개의 내부 DT 모델에서 사용할 데이터셋 생성 
# - random_state 매개변수 설정을 고정된 데이터셋 생성 
# - oob_score 매개변수 : 샘플 데이터셋 추출 후 남은 데이터셋 검증용으로 사용  
 
# lf_model = RandomForestClassifier(random_state=7,oob_score=True)
lf_model = ExtraTreesClassifier(random_state=7, n_estimators=300) # n_estimators : 생성할 트리의 개수 (default : 100)


# 학습 
lf_model.fit(x_train, y_train)

ExtraTreesClassifier로 변경했을 때 데이터 수가 많지 않아서 성능 비교가 어려움 

-> n_estimator 또는 n_iter를 늘리기

--> 속도가 빠르지만 성능은 좋지 않음

In [92]:
# 모델 파라미터 
print(f'classes_ : {lf_model.classes_}')
print(f'n_classes_ : {lf_model.n_classes_}개')
print()
print(f'feature_names_in_ : {lf_model.feature_names_in_}')
print(f'n_features_in_ : {lf_model.n_features_in_}개')
print(f'feature_importances_ : {lf_model.feature_importances_}')


classes_ : [0. 1.]
n_classes_ : 2개

feature_names_in_ : ['alcohol' 'sugar' 'pH']
n_features_in_ : 3개
feature_importances_ : [0.19251569 0.52909084 0.27839348]


In [93]:
# print(f'oob_score_ : {lf_model.oob_score_}')

In [94]:
# print(f'estimators_samples_ : {lf_model.estimators_samples_}')


In [95]:
print(f'classes_ : {lf_model.estimator_}')

for est in lf_model.estimators_:
    print(est)


classes_ : ExtraTreeClassifier()
ExtraTreeClassifier(random_state=327741615)
ExtraTreeClassifier(random_state=976413892)
ExtraTreeClassifier(random_state=1202242073)
ExtraTreeClassifier(random_state=1369975286)
ExtraTreeClassifier(random_state=1882953283)
ExtraTreeClassifier(random_state=2053951699)
ExtraTreeClassifier(random_state=959775639)
ExtraTreeClassifier(random_state=1956722279)
ExtraTreeClassifier(random_state=2052949340)
ExtraTreeClassifier(random_state=1322904761)
ExtraTreeClassifier(random_state=165338510)
ExtraTreeClassifier(random_state=1133316631)
ExtraTreeClassifier(random_state=4812360)
ExtraTreeClassifier(random_state=372560217)
ExtraTreeClassifier(random_state=309457262)
ExtraTreeClassifier(random_state=1801189930)
ExtraTreeClassifier(random_state=1152936666)
ExtraTreeClassifier(random_state=68334472)
ExtraTreeClassifier(random_state=2146978983)
ExtraTreeClassifier(random_state=119248870)
ExtraTreeClassifier(random_state=769786948)
ExtraTreeClassifier(random_state=15

## [4] 성능 평가

In [96]:
train_score = lf_model.score(x_train, y_train)
test_score = lf_model.score(x_test, y_test)

print(f'train_score : {train_score}, test_score : {test_score}')

train_score : 0.9973061381566288, test_score : 0.8992307692307693


-> 과대적합

## [5] 튜닝
- RandomizedSearchCV 하이퍼파라미터 최적화 클래스
    - 범위가 넓은 하이퍼 파라미터 설정에 좋음
    - 지정된 범위에서 지정된 횟수만큼 하이퍼파라미터 추출하여 조합 진행

In [97]:
# 모듈 로딩 
from sklearn.model_selection import RandomizedSearchCV

In [98]:
# RandomForestClassifier 하이퍼 파라미터 설정 
params = {'max_depth' : range(2,16), 
          'min_samples_leaf' : range(5,16),
          'criterion' : ['gini','entropy','log_loss']}

In [99]:
rf_model = RandomForestClassifier(random_state=7)

In [100]:
search_cv = RandomizedSearchCV(rf_model, param_distributions=params, n_iter=50, verbose=4) 
# verbose : 진행단계 보여줌  

In [101]:
search_cv.fit(x_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END criterion=gini, max_depth=4, min_samples_leaf=5;, score=0.842 total time=   0.1s
[CV 2/5] END criterion=gini, max_depth=4, min_samples_leaf=5;, score=0.838 total time=   0.1s
[CV 3/5] END criterion=gini, max_depth=4, min_samples_leaf=5;, score=0.854 total time=   0.1s
[CV 4/5] END criterion=gini, max_depth=4, min_samples_leaf=5;, score=0.858 total time=   0.1s
[CV 5/5] END criterion=gini, max_depth=4, min_samples_leaf=5;, score=0.848 total time=   0.1s
[CV 1/5] END criterion=gini, max_depth=4, min_samples_leaf=10;, score=0.841 total time=   0.1s
[CV 2/5] END criterion=gini, max_depth=4, min_samples_leaf=10;, score=0.838 total time=   0.1s
[CV 3/5] END criterion=gini, max_depth=4, min_samples_leaf=10;, score=0.857 total time=   0.1s
[CV 4/5] END criterion=gini, max_depth=4, min_samples_leaf=10;, score=0.859 total time=   0.1s
[CV 5/5] END criterion=gini, max_depth=4, min_samples_leaf=10;, score=0.838 total time= 

In [102]:
# 모델 파라미터 
print(f'[search_cv.best_score_] {search_cv.best_score_}')
print(f'[search_cv.best_param] {search_cv.best_params_}')
print(f'[search_cv.best_estimator_] {search_cv.best_estimator_}')

cv_result_df = pd.DataFrame(search_cv.cv_results_)
cv_result_df

[search_cv.best_score_] 0.8776253053972015
[search_cv.best_param] {'min_samples_leaf': 5, 'max_depth': 15, 'criterion': 'log_loss'}
[search_cv.best_estimator_] RandomForestClassifier(criterion='log_loss', max_depth=15, min_samples_leaf=5,
                       random_state=7)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.161218,0.008098,0.01438,0.002386,5,4,gini,"{'min_samples_leaf': 5, 'max_depth': 4, 'crite...",0.842308,0.8375,0.853705,0.857555,0.847931,0.8478,0.007298,38
1,0.161324,0.010435,0.015474,0.001294,10,4,gini,"{'min_samples_leaf': 10, 'max_depth': 4, 'crit...",0.841346,0.838462,0.856593,0.85948,0.838306,0.846837,0.009253,40
2,0.280169,0.01103,0.016553,0.004843,7,12,log_loss,"{'min_samples_leaf': 7, 'max_depth': 12, 'crit...",0.885577,0.846154,0.881617,0.885467,0.875842,0.874931,0.014819,3
3,0.170349,0.011115,0.012898,0.00711,12,4,log_loss,"{'min_samples_leaf': 12, 'max_depth': 4, 'crit...",0.834615,0.832692,0.852743,0.85948,0.843118,0.84453,0.010309,42
4,0.247851,0.006112,0.013522,0.004502,13,14,entropy,"{'min_samples_leaf': 13, 'max_depth': 14, 'cri...",0.872115,0.8375,0.877767,0.884504,0.871992,0.868776,0.016297,12
5,0.210356,0.003023,0.021788,0.003238,15,8,log_loss,"{'min_samples_leaf': 15, 'max_depth': 8, 'crit...",0.872115,0.836538,0.873917,0.881617,0.866218,0.866081,0.015569,27
6,0.233443,0.006557,0.0152,0.005801,15,9,log_loss,"{'min_samples_leaf': 15, 'max_depth': 9, 'crit...",0.870192,0.838462,0.876805,0.883542,0.865255,0.866851,0.015475,24
7,0.134416,0.009539,0.009629,0.007876,8,2,gini,"{'min_samples_leaf': 8, 'max_depth': 2, 'crite...",0.755769,0.767308,0.770934,0.753609,0.753609,0.760246,0.007379,46
8,0.243205,0.006625,0.009819,0.003778,14,12,entropy,"{'min_samples_leaf': 14, 'max_depth': 12, 'cri...",0.875,0.839423,0.875842,0.883542,0.871992,0.86916,0.015349,11
9,0.230566,0.007932,0.017444,0.004711,13,12,gini,"{'min_samples_leaf': 13, 'max_depth': 12, 'cri...",0.875,0.833654,0.871992,0.882579,0.870067,0.866659,0.017044,25
