## Load_breat_cancer

### (1) 데이터준비

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [2]:
# 데이터 로드
breast_cancer = load_breast_cancer()
print(dir(breast_cancer))

['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']


In [3]:
#정보확인
breast_cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [4]:
#breast_cancer 의 feature, label Target names 지정
#feature
breast_cancer_data = breast_cancer.data

#label
breast_cancer_label= breast_cancer.target

print(breast_cancer_data.shape)
print(breast_cancer_label.shape)

(569, 30)
(569,)


In [5]:
print(breast_cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

### (2) train, test 데이터 분리

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data,
                                                    breast_cancer_label,
                                                    test_size=0.2,
                                                    random_state=7)
print('X_train 개수: ',len(X_train), ',X_test 개수:', len(X_test))

X_train 개수:  455 ,X_test 개수: 114


In [7]:
X_train.shape, y_train.shape

((455, 30), (455,))

### (3) 다양한 모델로 학습 및 평가

- Decision Tree 사용해 보기 

In [8]:
#모델 학습
decision_tree = DecisionTreeClassifier(random_state=32) 
decision_tree.fit(X_train, y_train) 
y_pred = decision_tree.predict(X_test)


# 평가
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.92      0.82      0.87        40
           1       0.91      0.96      0.93        74

    accuracy                           0.91       114
   macro avg       0.91      0.89      0.90       114
weighted avg       0.91      0.91      0.91       114

[[33  7]
 [ 3 71]]


- Random Forest 사용해 보기

In [9]:
random_forest = RandomForestClassifier(random_state=32) #요기변경
random_forest.fit(X_train, y_train)#요기변경
y_pred = random_forest.predict(X_test)#요기변경

# 평가
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       1.00      1.00      1.00        74

    accuracy                           1.00       114
   macro avg       1.00      1.00      1.00       114
weighted avg       1.00      1.00      1.00       114

[[40  0]
 [ 0 74]]


- SVM 사용해 보기

In [10]:
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

# 평가
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      0.72      0.84        40
           1       0.87      1.00      0.93        74

    accuracy                           0.90       114
   macro avg       0.94      0.86      0.89       114
weighted avg       0.92      0.90      0.90       114

[[29 11]
 [ 0 74]]


- SGD Classifier 사용해 보기

In [11]:
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred = sgd_model.predict(X_test)

# 평가
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.90      0.85        40
           1       0.94      0.88      0.91        74

    accuracy                           0.89       114
   macro avg       0.87      0.89      0.88       114
weighted avg       0.89      0.89      0.89       114

[[36  4]
 [ 9 65]]


- Logistic Regression 사용해 보기

In [12]:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

# 평가
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.82      0.90        40
           1       0.91      1.00      0.95        74

    accuracy                           0.94       114
   macro avg       0.96      0.91      0.93       114
weighted avg       0.94      0.94      0.94       114

[[33  7]
 [ 0 74]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### (4) 모델을 평가해 보기

- 총 569개의 유방암 데이터에서, 30개의 특성을 가지고 Malignant(악성종양)과 Benign(양성 종양)을 구분하는 문제이다.
- 양성을 음성으로 판단하면 안되기 때문에 정확도는 recall을 사용한다.
- 가장 높은 1 점수인 Random Forest을 채택한다.

- 인공지능과 병진단에 대해서 자료를 읽어보니, raw data를 잘 추출하면 더 정확하게 암진단을 할 수 있다는것을 알았다. 갑상선 악성과 양성을 진단할 때 악성 결절의 산소포화도가 낮다는 점에 착안해 광음향 초음파로 자료로 진단하니 기존보다 3배 더 정확하게 진단을 할 수 있었다. label에 변화를 유발시키는 속성을 파악하고 세팅하는 것이 머신러닝의 첫 걸음인것 같다.
출처: https://www.breaknews.com/817397