# EP2_3: 유방암 여부 진단
* 유방암 악성종양/양성종양 여부를 진단(True/False).
* 30가지 특성을 가진 총 569개의 유방암 데이터를 이용.

## 1. 데이터 준비 및 확인 

In [1]:
# (1) 필요한 모듈 import하기
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
# (2) 데이터 준비
bc = load_breast_cancer()

In [3]:
print(bc.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [4]:
# (3) 데이터 이해하기
# Feature Data 지정하기
bc_data = bc.data
print(bc_data.shape)
# Label Data 지정하기
bc_label = bc.target
# Target Names 출력해 보기
print("<Target Names>")
print(bc.target_names)
# 데이터 Describe 해 보기
print("<DESCR>")
print(bc.DESCR)

(569, 30)
<Target Names>
['malignant' 'benign']
<DESCR>
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resul

## 2. train, test 데이터 나누기

In [5]:
# (4) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(bc_data, bc_label, test_size=0.2, random_state=7, stratify=bc_label)

print('X_train 개수:', len(X_train), ', X_test 개수:', len(X_test))
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

X_train 개수: 455 , X_test 개수: 114
(455, 30) (455,)
(114, 30) (114,)


## 3. 모델 학습 및 평가

In [6]:
# (5) 다양한 모델로 학습시켜보기

#  Decision Tree 사용해 보기
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)
y_pred1 = decision_tree.predict(X_test)
print("Decision Tree: \n", classification_report(y_test, y_pred1))
print(confusion_matrix(y_test, y_pred1))

#  Random Forest 사용해 보기
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)
y_pred2 = random_forest.predict(X_test)
print("Random Forest: \n", classification_report(y_test, y_pred2))
print(confusion_matrix(y_test, y_pred2))

#  SVM 사용해 보기
from sklearn import svm
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred3 = svm_model.predict(X_test)
print("SVM          : \n", classification_report(y_test, y_pred3))
print(confusion_matrix(y_test, y_pred3))

#  SGD Classifier 사용해 보기
from sklearn.linear_model import SGDClassifier
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred4 = sgd_model.predict(X_test)
print("SGD Classifier:\n", classification_report(y_test, y_pred4))
print(confusion_matrix(y_test, y_pred4))

#  Logistic Regression 사용해 보기
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred5 = logistic_model.predict(X_test)
print("Logistic Reg.: \n", classification_report(y_test, y_pred5))
print(confusion_matrix(y_test, y_pred5))

Decision Tree: 
               precision    recall  f1-score   support

           0       0.88      1.00      0.93        42
           1       1.00      0.92      0.96        72

    accuracy                           0.95       114
   macro avg       0.94      0.96      0.94       114
weighted avg       0.95      0.95      0.95       114

[[42  0]
 [ 6 66]]
Random Forest: 
               precision    recall  f1-score   support

           0       0.93      0.95      0.94        42
           1       0.97      0.96      0.97        72

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114

[[40  2]
 [ 3 69]]
SVM          : 
               precision    recall  f1-score   support

           0       0.95      0.90      0.93        42
           1       0.95      0.97      0.96        72

    accuracy                           0.95       114
   macro avg       0.95      0.94      

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## 결과
전반적으로 모델들의 예측결과가 높게 나왔음. 

의사결정모델과 Random Forest모델이 가장 높은 정확도(0.96)를 가졌다고 판단함. 

악성종양인데도 양성종양으로 판단하는 것이 가장 큰 위험을 가짐. 즉, FN지표가 가장 낮아야 함. 

#### 평가지표는 recall을 보는 것이 좋다고 생각함
이유는 1) 전체 샘플의 분포에 영향을 덜 받고, 2) 각 라벨별로 정확한 라벨을 골랐는지를 알 수 있기 때문이다.(recall) 3) 컴퓨터가 예상한 라벨값중에서 맞고 틀린 것에 대한 지표(precision)가 중요한 것이 아님. 