**분류 모델 평가**
- **Confusion matrix, 혼동 행렬**
  - ```python
    from sklearn.metrics import confusion_matrix
    matrix = confusion_matrix(y_train, pred)
    ```
  - 실제 데이터와 예측 데이터 간의 관계를 나타낸 행렬
  - ```
          예측 P  예측 N
    실제 P   TP    FN
    실제 N   FP    TN
    ```
- **Accuracy, 정확도**
  - ```python
    from sklearn.metrics import accuracy_score
    score = accuracy_score(y_train, pred)
    ```
  - `정답 예측 데이터 / 전체 데이터`
  - `(TP + TN) / TOTAL`
- **Precision, 정밀도**
  - ```python
    from sklearn.metrics import precision_score
    score = precision_score(y_train, pred)
    ```
  - `정답 예측 데이터 / 각 클래스 데이터`
  - `TP / (TP + FP)`
- **Recall, 재현율**
  - ```python
    from sklearn.metrics import recall_score
    score = recall_score(y_train, pred)
    ```
  - `정답 예측 데이터 / 학습 데이터`
  - `TP / (TP + FN)`


## Lecture 01 - 분류 모델 평가

In [62]:
# import pandas
import pandas as pd

pd.options.display.max_columns = 5
pd.options.display.max_rows = 10

In [63]:
# 1. load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [64]:
# 2. set X, y
X = pd.DataFrame(data=data.data, columns=data.feature_names)
y = pd.Series(data=data.target)

print(X.head())

   mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0        17.99         10.38  ...          0.4601                  0.11890
1        20.57         17.77  ...          0.2750                  0.08902
2        19.69         21.25  ...          0.3613                  0.08758
3        11.42         20.38  ...          0.6638                  0.17300
4        20.29         14.34  ...          0.2364                  0.07678

[5 rows x 30 columns]


In [65]:
# 3. check X, y
print(X.info())                 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [66]:
print(X.isnull().sum())

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
                          ..
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
Length: 30, dtype: int64


In [67]:
print(X.describe())

       mean radius  mean texture  ...  worst symmetry  worst fractal dimension
count   569.000000    569.000000  ...      569.000000               569.000000
mean     14.127292     19.289649  ...        0.290076                 0.083946
std       3.524049      4.301036  ...        0.061867                 0.018061
min       6.981000      9.710000  ...        0.156500                 0.055040
25%      11.700000     16.170000  ...        0.250400                 0.071460
50%      13.370000     18.840000  ...        0.282200                 0.080040
75%      15.780000     21.800000  ...        0.317900                 0.092080
max      28.110000     39.280000  ...        0.663800                 0.207500

[8 rows x 30 columns]


In [68]:
print(y.value_counts())

1    357
0    212
dtype: int64


In [69]:
print(y.value_counts() / len(y))

1    0.627417
0    0.372583
dtype: float64


In [70]:
# 4. split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2)

print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

455 114
455 114


In [71]:
# 5. build model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

score = model.score(X_train, y_train)
print(f'SCORE(TRAIN): {score}')

score = model.score(X_test, y_test)
print(f' SCORE(TEST): {score}\n')

pred = model.predict(X_test[:10])
print(f'PREDICT: {pred}')
print(f' ANSWER: {y_test[:10].values}\n')

# weights and bias
print(f'WEIGHT: {model.coef_}')
print(f'  BIAS: {model.intercept_}\n')

# evaluation metrics
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
pred = model.predict(X_train)

matrix = confusion_matrix(y_train, pred)
print(f'Confusion matrix\n{matrix}\n')

score = accuracy_score(y_train, pred)
print(f' Accuracy: {score}')

score = precision_score(y_train, pred)
print(f'Precision: {score}')

score = accuracy_score(y_train, pred)
print(f'   Recall: {score}')

SCORE(TRAIN): 0.9648351648351648
 SCORE(TEST): 0.9385964912280702

PREDICT: [1 1 0 0 0 1 1 1 1 1]
 ANSWER: [1 1 0 0 0 1 1 1 1 1]

WEIGHT: [[ 0.93820373  0.21284791 -0.3640111   0.02655963 -0.13871486 -0.21257181
  -0.48490997 -0.24761432 -0.22108582 -0.03272396 -0.07194268  1.68413916
   0.49477002 -0.14989175 -0.02267109 -0.01318917 -0.09614012 -0.03842796
  -0.05184831 -0.00246896  0.40845622 -0.45945504 -0.12716058 -0.01163199
  -0.29788307 -0.65203753 -1.38944079 -0.50876167 -0.69034601 -0.11071522]]
  BIAS: [30.9795594]

Confusion matrix
[[157   9]
 [  7 282]]

 Accuracy: 0.9648351648351648
Precision: 0.9690721649484536
   Recall: 0.9648351648351648
