In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
# 데이터 준비
breast = load_breast_cancer()

In [3]:
# 데이터 이해하기
data = breast.data
label = breast.target
target_name = breast.target_names
desc = breast.DESCR

In [4]:
print(desc)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [5]:
print(target_name)

['malignant' 'benign']


In [6]:
# train, test 분리
random_seed = 25

x_train, x_test, y_train, y_test = train_test_split(
    data,
    label,
    test_size=0.2,
    random_state=random_seed
)

In [7]:
# 다양한 모델로 학습
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

decision = DecisionTreeClassifier(random_state=random_seed)
random_forest = RandomForestClassifier(random_state=random_seed)
svm_model = svm.SVC(random_state=random_seed)
sgd = SGDClassifier(random_state=random_seed)
logisitic = LogisticRegression(random_state=random_seed)

decision.fit(x_train, y_train)
random_forest.fit(x_train, y_train)
svm_model.fit(x_train, y_train)
sgd.fit(x_train, y_train)
logisitic.fit(x_train, y_train)

decision_pred = decision.predict(x_test)
random_pred = random_forest.predict(x_test)
svm_pred = svm_model.predict(x_test)
sgd_pred = sgd.predict(x_test)
logistic_pred = logisitic.predict(x_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [8]:
# 평가하기
print(classification_report(y_test, decision_pred))
print(classification_report(y_test, random_pred))
print(classification_report(y_test, svm_pred))
print(classification_report(y_test, sgd_pred))
print(classification_report(y_test, logistic_pred))

              precision    recall  f1-score   support

           0       0.92      0.85      0.88        39
           1       0.92      0.96      0.94        75

    accuracy                           0.92       114
   macro avg       0.92      0.90      0.91       114
weighted avg       0.92      0.92      0.92       114

              precision    recall  f1-score   support

           0       0.92      0.90      0.91        39
           1       0.95      0.96      0.95        75

    accuracy                           0.94       114
   macro avg       0.93      0.93      0.93       114
weighted avg       0.94      0.94      0.94       114

              precision    recall  f1-score   support

           0       0.97      0.77      0.86        39
           1       0.89      0.99      0.94        75

    accuracy                           0.91       114
   macro avg       0.93      0.88      0.90       114
weighted avg       0.92      0.91      0.91       114

              preci

In [9]:
print(f'data.shape : {data.shape}\n'
      f'label.shape : {label.shape}')

data.shape : (569, 30)
label.shape : (569,)


### 데이터

|데이터셋 크기|feature 크기|feature 형태|
|:---:|:--:|:--:|
|569|30|float|

### 모델 성능 비교 (macro avg, f1 score)

|randome_forest|logistic_regression|decision_tree|support_vector_machine|stochastic_gradient_descent|
|:---:|:---:|:---:|:---:|:---:|
|0.93|0.93|0.91|0.9|0.86|

### 분석

데이터셋 수 : 보통   
데이터셋 밸런스 : 불균형   
특성 수 : 보통   

### metrics = macro avg, f1 score

- f1 score : 정밀도와 재현율을 모두 따지므로 모든 지표를 볼 수 있어 더 정확하다고 생각합니다.
- macro avg : 각 클래스의 데이터 수와 관계없이 정확도를 반영할 수 있어서 좋다고 생각합니다.   
    weighted avg는 수가 많은 데이터의 영향을 너무 많이 받아, 적은 데이터의 정확도가 무시됩니다.