# 第6章 モデルの評価とハイパーパラメータチューニングのベストプラクティス
<!--  ![machine-learning-process](images/ch6_machine-learning-process.png) -->

## 第6章で取り扱う内容
- モデル性能の偏りのない推定量の算出
- 機械学習アルゴリズムに共通する問題の診断
- 機械学習モデルのチューニング
- 様々な性能指標に基づく予測モデルの評価

## 6.1 パイプラインによるワークフローの効率化
### 6.1.1 Breast Cancer Wisconsin データセット
データの概要
- 胸部の腫瘍画像から得られたデータ。腫瘍が悪性（乳がん）か良性かラベリングされている
- Wisconsin 大学病院の H. Wolberg 博士によって作成された
- クラスラベルは2値
  - malignant（悪性）
  - benign（良性）
- 特徴量は 3 x 10 = 30次元
  - 以下の平均値、極値、標準誤差
    - a) radius (mean of distances from center to points on the perimeter)
    - b) texture (standard deviation of gray-scale values)
    - c) perimeter
    - d) area
    - e) smoothness (local variation in radius lengths)
    - f) compactness (perimeter^2 / area - 1.0)
    - g) concavity (severity of concave portions of the contour)
    - h) concave points (number of concave portions of the contour)
    - i) symmetry
    - j) fractal dimension ("coastline approximation" - 1)

In [48]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

print('[data size]')
print(' - X_train: ', X_train.shape)
print(' - X_test:  ', X_test.shape)
print(' - y_train: ', y_train.shape)
print(' - y_tesst: ', y_test.shape)

[data size]
 - X_train:  (455, 30)
 - X_test:   (114, 30)
 - y_train:  (455,)
 - y_tesst:  (114,)


In [49]:
import pandas as pd

df_X = pd.DataFrame(data.data, columns=data.feature_names)
data.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [41]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [50]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

### パイプラインで変換器と推定機を結合する

#### 目標
ロジスティック回帰を用いて線形分類を行う

#### 手順
1. 特徴量を標準化する
2. 主成分分析を利用して、特徴量を30次元から2次元の部分空間に圧縮する
3. ロジスティック回帰を用いて、予測を行う


In [47]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(StandardScaler(), 
                         PCA(n_components=2), 
                         LogisticRegression(random_state=1))
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print('Test accuracy: {}'.format(pipeline.score(X_test, y_test)))

Test accuracy: 0.956140350877193


# パイプラインが必要となる理由

パイプラインを使わずに同じコードを実装すると