GridSearchCV
- 유방암(Breast cancer) 데이터 분류

1. 데이터 전처리 및 탐색

In [2]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [3]:
import pandas as pd
df = pd.DataFrame(cancer.data, columns =cancer.feature_names)
df['target']=cancer.target
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [4]:
df.shape

(569, 31)

In [5]:
#  0 - 악성, 1- 양성
df.target.value_counts()

1    357
0    212
Name: target, dtype: int64

In [6]:
cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

2. 훈련 데이터와 테스트 데이터 분리

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,stratify=cancer.target,test_size = 0.2, random_state = 2023)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((455, 30), (114, 30), (455,), (114,))

In [8]:
#  y값의 분포
import numpy as np
np.unique(y_train, return_counts=True)

(array([0, 1]), array([170, 285]))

In [9]:
np.unique(y_test, return_counts=True)

(array([0, 1]), array([42, 72]))

3.학습

In [11]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=2023)
dtc.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 2023,
 'splitter': 'best'}

In [12]:
#  분류- 지도 학습 -> X,y 데이터 모두 필요 -> 학습(훈련) : X_train, y_train
dtc.fit(X_train, y_train)

4. 예측

In [13]:
pred = dtc.predict(X_test)

In [14]:
rf = pd.DataFrame({'y 실제값': y_test, 'y 예측값':pred})
rf.head()

Unnamed: 0,y 실제값,y 예측값
0,0,0
1,1,1
2,1,1
3,1,1
4,1,1


5.평가

In [15]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

0.9210526315789473

In [16]:
dtc.score(X_test, y_test)

0.9210526315789473

GridSearchCV 적용
- 학습/훈련시 사용
- GridSearchCV클래스의 생성자 정리
 1. estimator : classifier, regressor, pipeline 등 가능
 2. param_grid : 튜닝을 위해 파라미터, 사용될 파라미터를 dictionary 형태로 만들어서 넣는다.
 3. scoring : 예측 성능을 측정할 평가 방법을 넣는다. 보통 accuracy 로 지정하여서 정확도로 성능 평가를 한다.
 4. cv : 교차 검증에서 몇개로 분할되는지 지정한다.
 5. refit : True가 디폴트로 True로 하면 최적의 하이퍼 파라미터를 찾아서 재학습 시킨다.



In [17]:
params ={
    'max_depth': [2,5,8],
    'min_samples_split': [2,3,4]
}

In [20]:
from sklearn.model_selection import GridSearchCV
grid_dt = GridSearchCV(dtc, #estimator, Decision Tree Classifier
                       param_grid =params, # 파라미터 조합
                       scoring='accuracy', # 평가방법 -정확도
                       cv=5 ) # 교차검증 세트 수
                    #    총 3x3x5=45회 훈련

In [21]:
# 학습 실행
grid_dt.fit(X_train, y_train)

In [22]:
#  베스트 파라미터 조합
grid_dt.best_params_

{'max_depth': 5, 'min_samples_split': 2}

In [23]:
# 베스트 스코어
grid_dt.best_score_

0.9472527472527472

파라미터의 범위를 좁혀가면서 계속 수행

In [24]:
params ={
    'max_depth':[4,5,6],
    'min_samples_split' :[2,3,4]
}

grid_dt=GridSearchCV(dtc, param_grid=params, scoring='accuracy',cv=5)
grid_dt.fit(X_train, y_train)

In [26]:
grid_dt.best_params_

{'max_depth': 5, 'min_samples_split': 2}

베스트 모델(최적 분류기)로 예측 및 평가

In [27]:
best_dt=grid_dt.best_estimator_
best_dt.score(X_test, y_test)

0.8947368421052632

In [28]:
params={
    'max_depth':[7,8,10],
    'min_samples_split':[3,4,5]}
grid_dt=GridSearchCV(dtc, param_grid = params, scoring='accuracy', cv=6)
grid_dt.fit(X_train, y_train)


In [29]:
grid_dt.best_params_

{'max_depth': 8, 'min_samples_split': 3}

In [30]:
best_dt = grid_dt.best_estimator_
best_dt.score(X_test, y_test)

0.9210526315789473

In [31]:
params={
    'max_depth':[7,8,10],
    'min_samples_split':[3,4,5]}
grid_dt=GridSearchCV(dtc, param_grid = params, scoring='accuracy', cv=5)
grid_dt.fit(X_train, y_train)

In [32]:
grid_dt.best_params_

{'max_depth': 7, 'min_samples_split': 4}

In [33]:
best_dt = grid_dt.best_estimator_
best_dt.score(X_test, y_test)

0.9210526315789473