**Table of contents**<a id='toc0_'></a>    
- [교차검증(Cross Validation)](#toc1_)    
- [K-fold Cross Validation](#toc2_)    
  - [KFold](#toc2_1_)    
  - [StratifiedKFold](#toc2_2_)    
  - [For 문 돌려서 각 fold의 성능을 확인하던 것을 간편하게 - cross_val_score](#toc2_3_)    
  - [Train Accuracy Score와 같이 보는 방법](#toc2_4_)    
- [하이퍼파라미터 튜닝](#toc3_)    
  - [튜닝 대상](#toc3_1_)    
  - [최적의 하이퍼 파라미터를 찾아주는 GridSearchCV](#toc3_2_)    
  - [Pipeline에 GridSearch 적용](#toc3_3_)    
  - [표로 성능 결과 정리](#toc3_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[교차검증(Cross Validation)](#toc0_)

- 모델을 객관적으로 평가하는 방법

# <a id='toc2_'></a>[K-fold Cross Validation](#toc0_)

데이터를 k등분하여서 순차적으로 검증데이터를 변경해가는 것

<br>

- 데이터를 train 데이터, test 데이터로 나누고
    - train 데이터로 k-fold 검증을 한 후,
    - 최종적으로 test 데이터로 평가


In [13]:
import pandas as pd
import numpy as np

In [2]:
red_url = 'https://raw.githubusercontent.com/PinkWink/forML_study_data/refs/heads/main/data/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/forML_study_data/refs/heads/main/data/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1
red_wine['color'] = 0

wine = pd.concat([red_wine, white_wine])

wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]

X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train accuacy: ', accuracy_score(y_train, y_pred_tr))
print('Test accuacy: ', accuracy_score(y_test, y_pred_test))

Train accuacy:  0.7294593034442948
Test accuacy:  0.7161538461538461


## <a id='toc2_1_'></a>[KFold](#toc0_)

In [4]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

In [7]:
for train_idx, test_idx in kfold.split(X, y):
    print(len(train_idx), len(test_idx))

5197 1300
5197 1300
5198 1299
5198 1299
5198 1299


In [11]:
cv_accuracy = []

for train_idx, test_idx in kfold.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_test)
    cv_accuracy.append(accuracy_score(y_test, pred))

cv_accuracy

[0.6007692307692307,
 0.6884615384615385,
 0.7090069284064665,
 0.7628945342571208,
 0.7867590454195535]

- 내 모델의 성능을 객관적으로 확인할 수 있게 해주는 것이 k fold

In [None]:
np.mean(cv_accuracy)

0.709578255462782

## <a id='toc2_2_'></a>[StratifiedKFold](#toc0_)

- 특정 label의 데이터가 너무 많거나, 너무 적어 값의 분포가 한쪽으로 치우친 경우 사용
- 지금 데이터셋에서는 red wine과 white wine의 비율을 맞춤

In [18]:
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cv_accuracy = []

for train_idx, test_idx in skfold.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_test)
    cv_accuracy.append(accuracy_score(y_test, pred))
    
cv_accuracy

[0.5523076923076923,
 0.6884615384615385,
 0.7143956889915319,
 0.7321016166281755,
 0.7567359507313318]

In [None]:
np.mean(cv_accuracy)

0.6888004974240539

## <a id='toc2_3_'></a>[For 문 돌려서 각 fold의 성능을 확인하던 것을 간편하게 - cross_val_score](#toc0_)

In [22]:
from sklearn.model_selection import cross_val_score

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cross_val_score(wine_tree_cv, X, y, scoring=None, cv=skfold)

array([0.55230769, 0.68846154, 0.71439569, 0.73210162, 0.75673595])

- depth가 높다고 accuracy가 좋아지는 것도 아님

In [None]:
wine_tree_cv = DecisionTreeClassifier(max_depth=5, random_state=13)

cross_val_score(wine_tree_cv, X, y, scoring=None, cv=skfold)

array([0.51307692, 0.63076923, 0.69745958, 0.7582756 , 0.74903772])

## <a id='toc2_4_'></a>[Train Accuracy Score와 같이 보는 방법](#toc0_)

In [24]:
from sklearn.model_selection import cross_validate

cross_validate(wine_tree_cv, X, y, scoring=None, cv=skfold, return_train_score=True)

{'fit_time': array([0.02633309, 0.02590895, 0.02259874, 0.02061987, 0.02094579]),
 'score_time': array([0.00289488, 0.0041101 , 0.00399423, 0.00390291, 0.00240517]),
 'test_score': array([0.51307692, 0.63076923, 0.69745958, 0.7582756 , 0.74903772]),
 'train_score': array([0.78795459, 0.77968058, 0.77568295, 0.76356291, 0.76279338])}

-> test_score와 train_score의 점수 차가 꽤 나는 편 -> `과적합`

# <a id='toc3_'></a>[하이퍼파라미터 튜닝](#toc0_)

- 하이퍼파라미터: 모델의 성능을 확보하기 위해 조절하는 설정 값

## <a id='toc3_1_'></a>[튜닝 대상](#toc0_)

- Decision Tree 에서 튜닝해볼만한 것은 max_depth뿐

## <a id='toc3_2_'></a>[최적의 하이퍼 파라미터를 찾아주는 GridSearchCV](#toc0_)

In [25]:
from sklearn.model_selection import GridSearchCV

params = {'max_depth': [2, 4, 7, 10]}
wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)

gridsearch = GridSearchCV(estimator=wine_tree, param_grid=params, cv=5)
gridsearch.fit(X, y)

In [26]:
import pprint

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(gridsearch.cv_results_)

{   'mean_fit_time': array([0.0150382 , 0.01863437, 0.02784567, 0.03964667]),
    'mean_score_time': array([0.00543776, 0.00306163, 0.00285697, 0.00319338]),
    'mean_test_score': array([0.6888005 , 0.66356523, 0.65694688, 0.64678605]),
    'param_max_depth': masked_array(data=[2, 4, 7, 10],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object),
    'params': [   {'max_depth': 2},
                  {'max_depth': 4},
                  {'max_depth': 7},
                  {'max_depth': 10}],
    'rank_test_score': array([1, 2, 3, 4], dtype=int32),
    'split0_test_score': array([0.55230769, 0.51230769, 0.52461538, 0.51153846]),
    'split1_test_score': array([0.68846154, 0.63153846, 0.60538462, 0.61307692]),
    'split2_test_score': array([0.71439569, 0.72363356, 0.67975366, 0.66897614]),
    'split3_test_score': array([0.73210162, 0.73210162, 0.7382602 , 0.71901463]),
    'split4_test_score': array([0.75673595, 0.7182448 , 0.73672055, 0.7213241

In [27]:
gridsearch.best_estimator_

In [None]:
gridsearch.best_score_, gridsearch.best_params_

(0.6888004974240539, {'max_depth': 2})

## <a id='toc3_3_'></a>[Pipeline에 GridSearch 적용](#toc0_)

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

estimators = [
    ('scaler', StandardScaler()),
    ('clf', DecisionTreeClassifier(random_state=13))
]

pipe = Pipeline(estimators)

In [30]:
param_grid = [{
    'clf__max_depth': [2, 4, 7, 10]
}]

GridSearch = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=5)
GridSearch.fit(X, y)

In [31]:
GridSearch.best_score_

0.6888004974240539

In [None]:
GridSearch.cv_results_

{'mean_fit_time': array([0.01729393, 0.02156749, 0.03255463, 0.04327192]),
 'std_fit_time': array([0.00320101, 0.00092985, 0.00216581, 0.00192267]),
 'mean_score_time': array([0.00308399, 0.00337319, 0.00359254, 0.00337229]),
 'std_score_time': array([0.00060817, 0.00039547, 0.00133404, 0.00057936]),
 'param_clf__max_depth': masked_array(data=[2, 4, 7, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__max_depth': 2},
  {'clf__max_depth': 4},
  {'clf__max_depth': 7},
  {'clf__max_depth': 10}],
 'split0_test_score': array([0.55230769, 0.51230769, 0.52461538, 0.51153846]),
 'split1_test_score': array([0.68846154, 0.63153846, 0.60769231, 0.61461538]),
 'split2_test_score': array([0.71439569, 0.72363356, 0.67821401, 0.66666667]),
 'split3_test_score': array([0.73210162, 0.73210162, 0.7382602 , 0.71901463]),
 'split4_test_score': array([0.75673595, 0.7182448 , 0.73672055, 0.72209392]),
 'mean_test_score': array([0.688

## <a id='toc3_4_'></a>[표로 성능 결과 정리](#toc0_)

In [33]:
score_df = pd.DataFrame(GridSearch.cv_results_)
score_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.017294,0.003201,0.003084,0.000608,2,{'clf__max_depth': 2},0.552308,0.688462,0.714396,0.732102,0.756736,0.6888,0.071799,1
1,0.021567,0.00093,0.003373,0.000395,4,{'clf__max_depth': 4},0.512308,0.631538,0.723634,0.732102,0.718245,0.663565,0.083905,2
2,0.032555,0.002166,0.003593,0.001334,7,{'clf__max_depth': 7},0.524615,0.607692,0.678214,0.73826,0.736721,0.6571,0.081689,3
3,0.043272,0.001923,0.003372,0.000579,10,{'clf__max_depth': 10},0.511538,0.614615,0.666667,0.719015,0.722094,0.646786,0.078244,4


In [34]:
score_df[['params', 'rank_test_score', 'mean_test_score', 'std_test_score']]

Unnamed: 0,params,rank_test_score,mean_test_score,std_test_score
0,{'clf__max_depth': 2},1,0.6888,0.071799
1,{'clf__max_depth': 4},2,0.663565,0.083905
2,{'clf__max_depth': 7},3,0.6571,0.081689
3,{'clf__max_depth': 10},4,0.646786,0.078244
