유방암 여부 진단 
====


scikit-learn의 예제 데이터 Toy Dataset 중 load_breast_cancer (유방암 데이터)를 사용하여    
여러 건강 지표에 따라서 환자의 유방암 여부를 분류해 보는 실습   

- load_breast_cancer 데이터는 총 569개 
- feature는 총 30개 (여러 사람의 건강 지표에 대한 데이터)
- label은 유방암의 여부가 True, False
    
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

## 1) 필요한 모듈 import하기

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2) 데이터 준비

In [2]:
breast_cancer = load_breast_cancer()

# breast_cancer에는 어떤 정보들이 담겼을지, keys() 메서드로 확인
breast_cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

## 3) 데이터 이해하기

### Feature Data 지정하기

In [3]:
breast_cancer_data = breast_cancer.data

print(breast_cancer_data.shape)

(569, 30)


### 데이터 확인

In [4]:
breast_cancer_data[10]

array([1.602e+01, 2.324e+01, 1.027e+02, 7.978e+02, 8.206e-02, 6.669e-02,
       3.299e-02, 3.323e-02, 1.528e-01, 5.697e-02, 3.795e-01, 1.187e+00,
       2.466e+00, 4.051e+01, 4.029e-03, 9.269e-03, 1.101e-02, 7.591e-03,
       1.460e-02, 3.042e-03, 1.919e+01, 3.388e+01, 1.238e+02, 1.150e+03,
       1.181e-01, 1.551e-01, 1.459e-01, 9.975e-02, 2.948e-01, 8.452e-02])

### Label Data 지정하기

In [5]:
breast_cancer_label = breast_cancer.target

print(breast_cancer_label.shape)

(569,)


### Target Names 출력해 보기

In [6]:
breast_cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

### 데이터 Describe 해 보기

In [7]:
breast_cancer.DESCR



## 4) train, test 데이터 분리

In [8]:
import pandas as pd

breast_cancer_df = pd.DataFrame(data=breast_cancer_data, columns=breast_cancer.feature_names)
breast_cancer_df.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,25.45,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,23.69,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,18.98,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,25.74,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,9.456,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039


In [9]:
# label 추가
breast_cancer_df["label"] = breast_cancer_label

breast_cancer_df.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1


In [10]:
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, breast_cancer_label, 
                                                    test_size=0.2, random_state=20)

print('X_train 개수: ', len(X_train), ', X_test 개수: ', len(X_test))
print('y_train 개수: ', len(y_train), ', y_test 개수: ', len(y_test))

X_train 개수:  455 , X_test 개수:  114
y_train 개수:  455 , y_test 개수:  114


## 5) 다양한 모델로 학습시켜보기

### Decision Tree 사용해 보기

In [11]:
# Decision Tree 모델 
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=200)
print(decision_tree._estimator_type)

classifier


In [12]:
# 모델 학습
decision_tree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=200)

In [13]:
# 예측과 재현율 확인 
from sklearn.metrics import recall_score
y_pred_dt = decision_tree.predict(X_test)

recall = recall_score(y_test, y_pred_dt)
recall

0.9696969696969697

### Random Forest 사용해 보기

In [14]:
# Random Forest 모델 
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(random_state=200)

# 학습 
random_forest.fit(X_train, y_train)

# 예측과 재현율 확인 
y_pred_rf = random_forest.predict(X_test)

recall = recall_score(y_test, y_pred_rf)
recall

1.0

### SVM 사용해 보기

In [15]:
# SVM 모델 
from sklearn import svm
svm_model = svm.SVC()

# 학습
svm_model.fit(X_train, y_train)

# 예측과 재현율 확인 
y_pred_svm = svm_model.predict(X_test)

recall = recall_score(y_test, y_pred_svm)
recall

1.0

### SGD Classifier 사용해 보기

In [16]:
# SGD Classifier 모델
from sklearn.linear_model import SGDClassifier
sgd_model = SGDClassifier()

# 학습 
sgd_model.fit(X_train, y_train)

# 예측과 재현율 확인 
y_pred_sgd = sgd_model.predict(X_test)

recall = recall_score(y_test, y_pred_sgd)
recall

0.4696969696969697

### Logistic Regression 사용해 보기

In [17]:
# Logistic Regression 모델 
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()#solver='liblinear')

# 학습
logistic_model.fit(X_train, y_train)

# 예측과 재현율 확인 
y_pred_lr = logistic_model.predict(X_test)

recall = recall_score(y_test, y_pred_lr)
recall

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9545454545454546

## 6) 모델을 평가해 보기

In [18]:
#  Precision, Recall, F1 score 
# sklearn.metrics의 classification_report를 활용하여 각 지표를 한 번에 확인

# Decision Tree 모델
print("[ Decision Tree ]")
print(classification_report(y_test, y_pred_dt))

[ Decision Tree ]
              precision    recall  f1-score   support

           0       0.96      0.92      0.94        48
           1       0.94      0.97      0.96        66

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.95       114
weighted avg       0.95      0.95      0.95       114



In [19]:
# Random Forest 모델
print("[ Random Forest ]")
print(classification_report(y_test, y_pred_rf))

[ Random Forest ]
              precision    recall  f1-score   support

           0       1.00      0.96      0.98        48
           1       0.97      1.00      0.99        66

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



In [20]:
# SVM 모델
print("[ SVM ]")
print(classification_report(y_test, y_pred_svm))

[ SVM ]
              precision    recall  f1-score   support

           0       1.00      0.83      0.91        48
           1       0.89      1.00      0.94        66

    accuracy                           0.93       114
   macro avg       0.95      0.92      0.93       114
weighted avg       0.94      0.93      0.93       114



In [21]:
# SGD Classifier 모델
print("[ SGD Classifier ]")
print(classification_report(y_test, y_pred_sgd))

[ SGD Classifier ]
              precision    recall  f1-score   support

           0       0.58      1.00      0.73        48
           1       1.00      0.47      0.64        66

    accuracy                           0.69       114
   macro avg       0.79      0.73      0.69       114
weighted avg       0.82      0.69      0.68       114



In [22]:
# Logistic Regression 모델
print("[ Logistic Regression ]")
print(classification_report(y_test, y_pred_lr))

[ Logistic Regression ]
              precision    recall  f1-score   support

           0       0.93      0.90      0.91        48
           1       0.93      0.95      0.94        66

    accuracy                           0.93       114
   macro avg       0.93      0.93      0.93       114
weighted avg       0.93      0.93      0.93       114



유방암 여부 분류 문제는 양성을 음성으로 판단하면 안 되기 때문에 평가지표 중 Recall이 중요합니다.    
    
    Recall, 재현율 : TP / (FN+TP)
    - Recall 값은 클수록 좋음    
    - TP(True Positive)는 맞게 판단한 양성이므로, 이 값은 높을수록 좋음
    - 분모에 있는 FN(False Negative)값이 낮아야, 양성인데 음성으로 판단하는 경우가 적어야 좋음
    
각 모델의 Recall은 
- Decision Tree : 0.94
- Random Forest : 0.98 
- SVM : 0.92
- SGD Classifier : 0.73
- Logistic Regression : 0.93

이므로 이 손글씨 분류 문제에는 Random Forest 모델이 가장 잘 예측한 것으로 볼 수 있습니다. 
