## ch04
- https://github.com/thampiman/interpretable-ai-book/blob/master/Chapter_04/chapter_04.ipynb

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.10.05</div>
<div style="text-align: right"> Last update: 2023.10.05</div>

In [1]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

In [2]:
import math
import numpy as np
np.random.seed(24)
import pandas as pd
from tqdm import tqdm

from sympy import *
import operator
from IPython.core.display import display

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.preprocessing import StandardScaler

import torch
torch.manual_seed(24)
from torch.autograd import Variable
import torch.utils.data as data_utils
import torch.nn.init as init
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
sns.set_palette("bright")

## Diagnostics+ AI: Breast cancer diagnosis

- 1장과 2장에서 소개한 진단+로 돌아가 보겠습니다.  
- 이 센터는 유방암 진단에 AI 기능을 확장하고자 약 570명의 환자로부터 유방 종괴를 세침흡인한 이미지를 디지털화했습니다.  
- 이렇게 디지털화된 이미지에서 이미지에 존재하는 세포핵의 특성을 설명하는 특징을 계산했습니다.  
- 각 세포핵에 대해 다음 10가지 특징을 사용하여 특성을 설명합니다:  
    - Radius
    - Texture
    - Perimeter
    - Area
    - Smoothness
    - Compactness
    - Concavity
    - Concave points
    - Symmetry
    - Fractal dimension

환자의 이미지에 존재하는 모든 핵에 대해 이 10가지 특징 각각에 대해 평균, 표준 오차, 최대 또는 최소값이 계산됩니다.  
따라서 각 환자는 총 30개의 특징을 갖게 됩니다.  
이러한 입력 특징이 주어지면 AI 시스템의 목표는 세포가 양성인지 악성인지 예측하고 의사가 진단에 도움을 줄 수 있는 신뢰 점수를 제공하는 것입니다.  
이는 그림 4.1에 요약되어 있습니다.

![Alt text](image.png)

이 정보가 주어지면 이 문제를 머신 러닝 문제로 어떻게 공식화할 수 있을까요?  
모델의 목표는 주어진 유방 종괴가 양성인지 악성인지 예측하는 것이므로, 이 문제를 이진 분류 문제로 공식화할 수 있습니다.

### Load and Prepare data

In [3]:
data = load_breast_cancer()

In [4]:
X = data['data']
y = data['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=24)
X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size=0.5, random_state=24)
X_train = Variable(torch.from_numpy(X_train))
X_val = Variable(torch.from_numpy(X_val))
y_train = Variable(torch.from_numpy(y_train))
y_val = Variable(torch.from_numpy(y_val))
X_test = Variable(torch.from_numpy(X_test))
y_test = Variable(torch.from_numpy(y_test))

In [5]:
X[:5]

array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,
        8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,
        3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,
        1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,
        1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, 1.203e+03, 1.096e-01, 1.599e-01,
        1.974e-01, 1.279e-01, 2.069e-01, 5.999e-02, 7.456e-01, 7.869e-01,
        4.585e+00, 9.403e+01, 6.150e-03, 4.006e-02, 3.832e-02, 2.058e-02,
        2.250e-02, 4.571e-03, 2.357e

In [6]:
X.shape

(569, 30)

In [7]:
X_train.shape

torch.Size([398, 30])

In [8]:
X_test.shape

torch.Size([86, 30])

In [9]:
df_data = pd.DataFrame(X, columns=data['feature_names'])
df_data['target'] = y
df_benign = df_data[df_data['target'] == 1].reset_index(drop=True)
df_malignant = df_data[df_data['target'] == 0].reset_index(drop=True)

In [10]:
df_data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [11]:

df_train = pd.DataFrame(X_train, columns=data['feature_names'])
df_train['target'] = y_train
df_val = pd.DataFrame(X_val, columns=data['feature_names'])
df_val['target'] = y_val
df_test = pd.DataFrame(X_test, columns=data['feature_names'])
df_test['target'] = y_test

### EDA