# 3.4 실습문제: 유방암 분류


* **분석데이터**:
  https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer


## Step0. 데이터 로딩

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_breast_cancer
import warnings
warnings.filterwarnings('ignore')

breast_cancer = load_breast_cancer()

breast_cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
breast_cancer_df['label'] = breast_cancer.target
breast_cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## Step1. 데이터 전처리

In [2]:
breast_cancer_df.shape

(569, 31)

In [3]:
breast_cancer_df['label'].unique()

array([0, 1])

In [4]:
malignant = (breast_cancer_df['label'] == 0).sum()
benign = (breast_cancer_df['label'] == 1).sum()

print('유방암:{0}, 정상:{1}'.format(malignant, benign))

유방암:212, 정상:357


In [5]:
# 결측치 확인
breast_cancer_df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
label                      0
dtype: int64

In [6]:
# 중복값 확인
breast_cancer_df.duplicated().sum()

0

데이터 분할

In [7]:
from sklearn.model_selection import train_test_split

# 독립변수와 종속변수의 분할
X = breast_cancer_df.iloc[:, :30]
y = breast_cancer_df['label']

# 학습용 데이터와 테스트용 데이터의 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)

## Step2. 학습/예측/평가

분석할 데이터를 다양한 분류기를 사용하여 학습하고, 테스트 셋을 사용하여 정확도를 확인



* DecisionTreeClassifer
* KNeighborsClassifier
* SVM
* RandomForestClassifier
* LogisticRegression
* GradientBoostingClassifier
* XGBClassifier
* LGBMClassifier

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

tree_model = DecisionTreeClassifier(random_state=42)
neighbor_model = KNeighborsClassifier(n_neighbors=5)
svm_model = SVC(random_state=42)
forest_model = RandomForestClassifier(n_estimators=300, random_state=42)
logistic_model = LogisticRegression(random_state=42)
gbm_model = GradientBoostingClassifier(random_state=42)
xgb_model = XGBClassifier(n_estimators=300, random_state=42)
lgb_model = LGBMClassifier(n_estimators=300, random_state=42)

model_list = [tree_model, neighbor_model, svm_model, forest_model, logistic_model, gbm_model, xgb_model, lgb_model]

for model in model_list:
    model.fit(X_train , y_train)
    score = model.score(X_test, y_test)
    model_name = model.__class__.__name__
    print('{0} 정확도: {1:.2f}'.format(model_name, score))    

DecisionTreeClassifier 정확도: 0.91
KNeighborsClassifier 정확도: 0.93
SVC 정확도: 0.92
RandomForestClassifier 정확도: 0.95
LogisticRegression 정확도: 0.92
GradientBoostingClassifier 정확도: 0.94
XGBClassifier 정확도: 0.97
LGBMClassifier 정확도: 0.97


In [9]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def classifier_evaluation(y_test, y_pred):
    confusion = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test , y_pred)
    precision = precision_score(y_test , y_pred)
    recall = recall_score(y_test , y_pred)
    f1 = f1_score(y_test,y_pred)

    print('정확도: {0:.2f}, 정밀도: {1:.2f}, 재현율: {2:.2f}, F1-score: {3:.2f}'.format(accuracy, precision, recall, f1))
    print('혼동행렬', confusion, sep='\n')

In [10]:
model_list = [neighbor_model, svm_model, forest_model, logistic_model, gbm_model, xgb_model, lgb_model]

for model in model_list:
    model.fit(X_train , y_train)
    y_pred = model.predict(X_test)
    model_name = model.__class__.__name__
    print('\n{0} 평가지표:'.format(model_name)) 
    classifier_evaluation(y_test, y_pred)


KNeighborsClassifier 평가지표:
정확도: 0.93, 정밀도: 0.93, 재현율: 0.97, F1-score: 0.95
혼동행렬
[[46  7]
 [ 3 87]]

SVC 평가지표:
정확도: 0.92, 정밀도: 0.90, 재현율: 0.99, F1-score: 0.94
혼동행렬
[[43 10]
 [ 1 89]]

RandomForestClassifier 평가지표:
정확도: 0.95, 정밀도: 0.95, 재현율: 0.98, F1-score: 0.96
혼동행렬
[[48  5]
 [ 2 88]]

LogisticRegression 평가지표:
정확도: 0.92, 정밀도: 0.93, 재현율: 0.93, F1-score: 0.93
혼동행렬
[[47  6]
 [ 6 84]]

GradientBoostingClassifier 평가지표:
정확도: 0.94, 정밀도: 0.96, 재현율: 0.94, F1-score: 0.95
혼동행렬
[[49  4]
 [ 5 85]]

XGBClassifier 평가지표:
정확도: 0.97, 정밀도: 0.97, 재현율: 0.99, F1-score: 0.98
혼동행렬
[[50  3]
 [ 1 89]]

LGBMClassifier 평가지표:
정확도: 0.97, 정밀도: 0.96, 재현율: 1.00, F1-score: 0.98
혼동행렬
[[49  4]
 [ 0 90]]


### GridSearchCV를 사용해서 RandomForestClassifier의 하이퍼파라미터를 튜닝
 *     'n_estimators':[100],
 *   'max_depth' : [6, 8, 10, 12], 
 *   'min_samples_leaf' : [8, 12, 18 ],
 *   'min_samples_split' : [8, 16, 20]

In [11]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators':[100, 300, 500],
    'max_depth' : [6, 8, 10, 12], 
    'min_samples_leaf' : [8, 12, 18],
    'min_samples_split' : [8, 16, 20]
}

grid_cv = GridSearchCV(forest_model , param_grid=params , cv=2)
grid_cv.fit(X_train , y_train)

print('최적 조건:', grid_cv.best_params_)
print('\n -- 테스트 결과 -- ')
y_pred = model.predict(X_test)
classifier_evaluation(y_test, y_pred)

최적 조건: {'max_depth': 6, 'min_samples_leaf': 8, 'min_samples_split': 8, 'n_estimators': 300}

 -- 테스트 결과 -- 
정확도: 0.97, 정밀도: 0.96, 재현율: 1.00, F1-score: 0.98
혼동행렬
[[49  4]
 [ 0 90]]
