#### 교차검증
훈련세트에서 검증세트를 분할하여 평가하는 과정을 반복하는 방식

k-폴드 교차 검증


In [1]:
import pandas as pd
wine = pd.read_csv('https://bit.ly/wine_csv_data')

In [2]:
data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

In [3]:
from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = train_test_split(
    data, target, test_size=0.2, random_state=42)

In [4]:
sub_input, val_input, sub_target, val_target = train_test_split(
    train_input, train_target, test_size=0.2, random_state=42)

In [5]:
print(sub_input.shape, val_input.shape)

(4157, 3) (1040, 3)


In [6]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(sub_input, sub_target)

print(dt.score(sub_input, sub_target))
print(dt.score(val_input, val_target))

0.9971133028626413
0.864423076923077


####교차 검증 구현
- cross_validate() : 교차검증함수
  - 매개변수 :
  - 회귀모델 : k-fold 분할기
  - 분류모델:  StratifiedFold 분할기

In [7]:
from sklearn.model_selection import cross_validate

scores = cross_validate(dt, train_input, train_target)
print(scores)

{'fit_time': array([0.01228666, 0.01009226, 0.01159787, 0.0127492 , 0.01009631]), 'score_time': array([0.00148225, 0.00142932, 0.00182366, 0.00179696, 0.00147963]), 'test_score': array([0.86923077, 0.84615385, 0.87680462, 0.84889317, 0.83541867])}


In [8]:
import numpy as np

print(scores['test_score'])
print(np.mean(scores['test_score']))

[0.86923077 0.84615385 0.87680462 0.84889317 0.83541867]
0.855300214703487


In [9]:
from sklearn.model_selection import StratifiedKFold

scores = cross_validate(dt, train_input, train_target, cv=StratifiedKFold())    # cross_validate 기본값 속성
print(np.mean(scores['test_score']))

0.855300214703487


In [10]:
splitter = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_validate(dt, train_input, train_target, cv=splitter)
print(np.mean(scores['test_score']))

0.8574181117533719


하이퍼파라미터 튜닝
- 하이퍼 파라미터 : 사용자 지정 파라미터, <br>
  검증 세트를 포함한 모든 데이터 학습해 최적값 찾기
- 하이퍼파라미터 간의 종속성

In [11]:
from sklearn.model_selection import GridSearchCV
params = {'min_impurity_decrease': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}    # 매개변수 최적값 찾기

In [12]:
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)   # n_jobs=-1 : 모든 코어를 사용한다.

In [13]:
gs.fit(train_input, train_target)

In [14]:
dt = gs.best_estimator_ # 가장 좋은 하이퍼파라미터로 다시 학습
print(dt.score(train_input, train_target))

0.9615162593804117


In [15]:
gs.cv_results_

{'mean_fit_time': array([0.01152868, 0.0096148 , 0.01396627, 0.00748835, 0.01119633]),
 'std_fit_time': array([0.00281555, 0.0014586 , 0.00352156, 0.00080061, 0.00374091]),
 'mean_score_time': array([0.00298395, 0.0014502 , 0.0014441 , 0.00126519, 0.00466547]),
 'std_score_time': array([2.86879757e-03, 3.24671835e-05, 4.90254925e-05, 4.40316193e-05,
        4.19192904e-03]),
 'param_min_impurity_decrease': masked_array(data=[0.0001, 0.0002, 0.0003, 0.0004, 0.0005],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'min_impurity_decrease': 0.0001},
  {'min_impurity_decrease': 0.0002},
  {'min_impurity_decrease': 0.0003},
  {'min_impurity_decrease': 0.0004},
  {'min_impurity_decrease': 0.0005}],
 'split0_test_score': array([0.86923077, 0.87115385, 0.86923077, 0.86923077, 0.86538462]),
 'split1_test_score': array([0.86826923, 0.86346154, 0.85961538, 0.86346154, 0.86923077]),
 'split2_test_score': array([0.8825794 , 0.8

In [16]:
gs.cv_results_['mean_test_score']

array([0.86819297, 0.86453617, 0.86492226, 0.86780891, 0.86761605])

In [17]:
print(gs.best_params_)    # 0.0001이 가장 좋은 값

{'min_impurity_decrease': 0.0001}


#### 그리드 서치를 이용한 결정 트리 훈련
- arrange : 파이썬의 range
- 1350가지 하이퍼파리미터 조합
- 한계: 매개변수 간격 설정 등을 경험에 의존해야 함

In [18]:
params = {'min_impurity_decrease': np.arange(0.0001, 0.001, 0.0001),
          'max_depth': range(5, 20, 1),
          'min_samples_split': range(2, 100, 10)
          }

In [19]:
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)
gs.fit(train_input, train_target)

In [20]:
print(gs.best_params_)

{'max_depth': 14, 'min_impurity_decrease': 0.0004, 'min_samples_split': 12}


In [21]:
gs.cv_results_['mean_test_score'].shape   # 1350가지 하이퍼파리미터 조합 = 교차검증 횟수

(1350,)

In [22]:
print(np.max(gs.cv_results_['mean_test_score']))

0.8683865773302731


#### 랜덤 서치
- 너무 많은 매개변수 조합 -> 매개변수 값을 그대로 전달하는 것의 한계
- 매개변수를 샘플링할 수 있는 확률 분포 전달
- 확률분포 사용 = 싸이파이

In [23]:
from scipy.stats import uniform, randint

In [24]:
from sklearn.model_selection import RandomizedSearchCV

In [25]:
params = {'min_impurity_decrease': uniform(0.0001, 0.001),
          'max_depth': randint(20, 50),
          'min_samples_split': randint(2, 25),
          'min_samples_leaf': randint(1, 25),
          }

In [26]:
rs = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), params,
                        n_iter=100, n_jobs=-1, random_state=42)
rs.fit(train_input, train_target)

In [27]:
print(rs.cv_results_['params'])

[{'max_depth': 26, 'min_impurity_decrease': 0.0008965429868602329, 'min_samples_leaf': 15, 'min_samples_split': 12}, {'max_depth': 27, 'min_impurity_decrease': 0.0006986584841970366, 'min_samples_leaf': 7, 'min_samples_split': 20}, {'max_depth': 42, 'min_impurity_decrease': 0.00015808361216819946, 'min_samples_leaf': 24, 'min_samples_split': 22}, {'max_depth': 23, 'min_impurity_decrease': 0.0002428668179219408, 'min_samples_leaf': 3, 'min_samples_split': 23}, {'max_depth': 40, 'min_impurity_decrease': 0.0010699098521619944, 'min_samples_leaf': 12, 'min_samples_split': 7}, {'max_depth': 21, 'min_impurity_decrease': 0.0002818249672071006, 'min_samples_leaf': 21, 'min_samples_split': 2}, {'max_depth': 31, 'min_impurity_decrease': 0.000711653160488281, 'min_samples_leaf': 12, 'min_samples_split': 18}, {'max_depth': 46, 'min_impurity_decrease': 0.0007118528947223795, 'min_samples_leaf': 10, 'min_samples_split': 17}, {'max_depth': 34, 'min_impurity_decrease': 0.000556069984217036, 'min_sampl

In [28]:
dt = rs.best_estimator_
print(dt.score(test_input, test_target))

0.86
