# Section8 汎化性能と過学習

## hold-out

hold-out
- sklearn.model_selection.train_test_split
    - train_test_split(X, y)
    - X,yに引数にして渡すと、それぞれｒの学習データとテストデータを返す
    - test_size引数にテストデータの割合を指定する（デフォルトでは0.25）
    - X_train, X_test, y_train, y_test =  train_test_split(X, y)のように受け取るのが通例
    - random_state=0を引数に渡すことで、サンプリングが固定され毎回同じ結果が得られる

MSE
- sklearn.metrics.mean_squared_error
    - mean_squared_error(y_test, y_pred)のように正解値と予測値のリストを返す

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

df = sns.load_dataset('tips')

y_col = 'tip'
X = df.drop(columns=[y_col])

# 標準化のために数値のカラムのリストを取得
numeric_cols = X.select_dtypes(include=np.number).columns.to_list()

X = pd.get_dummies(X, drop_first=True)
y = df[y_col]

# サンプルデータを学習データ:テストデータ=7:3で分割する
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.3, random_state=0)

In [2]:
# 標準化
# 標準化はサンプルデータを学習とテストに分割した時に実施する
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
# 数値カラムのみ標準化
X_train_scaled[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test_scaled[numeric_cols] = scaler.transform(X_test[numeric_cols])

In [3]:
# 線形回帰モデル学習
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

In [4]:
# テストデータの精度（MSE）
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)
# np.mean(np.square(y_test - y_pred)) と同義

0.9550808988617148

## LOOCV

- sklearn.model_selection.LeaveOneOut
    1. インスタンス生成
    2. split(X)メソッドでイテレーション
        - train_indexとtest_indexを生成

In [5]:
X = df['total_bill'].values.reshape(-1,1)
y = df['tip']

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import mean_squared_error

loo = LeaveOneOut()   

In [7]:
model = LinearRegression()
mse_list = []
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # モデル学習
    model.fit(X_train, y_train)
    # テストデータの予測
    y_pred = model.predict(X_test)
    # MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse) 

In [8]:
print(f"MSE(LOOCV): {np.mean(mse_list)}")
print(f"std: {np.std(mse_list)}")

MSE(LOOCV): 1.0675673489857436
std: 2.099794455177631


## Cross-Validation

Cross Validation
- sklearn.model_selection.cross_val_score
    - cross_val_score(model, X, y, cv=cv)で一発で交差検証(CV)をしてくれる
    - cv引数にはLeaveOneOut()などのcvオブジェクトを渡す
    - n_jobsには使用するCPUコア数を指定する
    - scoringには評価指標を指定

In [9]:
from sklearn.model_selection import cross_val_score
cv = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
print(f"MSE(LOOCV): {np.mean(scores)}")
print(f"std: {np.std(scores)}")

MSE(LOOCV): -1.0675673489857436
std: 2.099794455177631


## k-Fold

k-fold
- sklearn.model_selection.KFold
    1. KFold()インスタンス生成
        - n_splits引数にｋを指定（デフォルトはk=5）
        - shuffle引数にTrueを指定すると事前にシャッフルする（デフォルトはFalse)
    2. .split(X)メソッドでイテレーション
        - train_index, test_indexを生成
- cross_val_score()を使って一発でMSEなどの精度算出を可能
    - cross_val_scoreを使うとモデルとX,yを指定すると一気に答えが出てしまうので、標準化ができなくなってしまう
    - そういう時はPipelineを使うと良い

Repeated k-Fold CV
- sklearn.model_selection.RepeatedKFold
    - k-Fold CVを複数回実施
    - n_repeats引数に回数を指定する
    - 他の使い方はKFoldクラスと同様


In [10]:
from sklearn.model_selection import KFold

k = 5
cv = KFold(n_splits=k, shuffle=True, random_state=0)
model = LinearRegression()
mse_list = []

for train_index, test_index in cv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]   
    
    # 標準化
    
    # モデル学習
    model.fit(X_train, y_train)
    # テストデータ予測
    y_pred = model.predict(X_test)
    # MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

In [11]:
print(mse_list)

[0.8213090642766288, 1.0745842125927971, 1.0880123892600384, 1.3323867714930204, 1.084763004349474]


In [12]:
print(f"MSE(KFoldCV): {np.mean(mse_list)}")
print(f"std: {np.std(mse_list)}")

MSE(KFoldCV): 1.0802110883943918
std: 0.16170100507039514


In [13]:
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')

In [14]:
scores

array([-0.82130906, -1.07458421, -1.08801239, -1.33238677, -1.084763  ])

In [15]:
from sklearn.model_selection import KFold, RepeatedKFold

k = 5
n_repeats = 3
# cv = KFold(n_splits=k, shuffle=True, random_state=0)
cv = RepeatedKFold(n_splits=k, n_repeats=n_repeats, random_state=0)
mse_list = []

for train_index, test_index in cv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # 標準化
    
    # モデル学習
    model.fit(X_train, y_train)
    # テストデータ予測
    y_pred = model.predict(X_test)
    # MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

In [16]:
print(f"MSE(KFoldCV): {np.mean(mse_list)}")
print(f"std: {np.std(mse_list)}")

MSE(KFoldCV): 1.0746387233165984
std: 0.26517178540898434


In [17]:
mse_list

[0.8213090642766288,
 1.0745842125927971,
 1.0880123892600384,
 1.3323867714930204,
 1.084763004349474,
 1.1587839131131425,
 1.6042084002514578,
 1.0307086207441927,
 0.7120290668798744,
 0.8472985410140899,
 0.8856103319481907,
 1.5248521639391936,
 0.6332659028150582,
 1.200354200262607,
 1.121414266809207]

In [18]:
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
scores

array([-0.82130906, -1.07458421, -1.08801239, -1.33238677, -1.084763  ,
       -1.15878391, -1.6042084 , -1.03070862, -0.71202907, -0.84729854,
       -0.88561033, -1.52485216, -0.6332659 , -1.2003542 , -1.12141427])

## Pipeline

### Pipeline + KFold

In [19]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])

cv = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(pipeline, X, y , scoring='neg_mean_squared_error', cv=cv)

In [20]:
pipeline

Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])

In [21]:
scores

array([-0.82130906, -1.07458421, -1.08801239, -1.33238677, -1.084763  ])

### Pipelineなし

In [22]:
## Pipelinenasiなし
# 標準化　＋　線形回帰
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.3, random_state=0)
model = LinearRegression()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test)

In [23]:
y_pred

array([19.35120139, 20.0033239 , 21.12124822,  9.71842019, 18.12148464,
        8.3210148 , 19.65863057, 24.69860603, 19.8728994 , 35.24435875,
       26.51523305, 26.45002079, 16.07195673, 15.07514088, 15.46641439,
       31.28504346, 10.74318415, 16.00674447, 20.45980967, 24.48433721,
       30.96829824, 22.45344136, 17.86995167, 16.97561221, 15.55025871,
       18.24259311, 14.29259386, 30.7260813 , 26.52454908, 17.65568284,
       16.6868151 , 14.59070701, 17.32962158, 12.82997621, 20.0033239 ,
       15.43846628, 18.73634301, 12.58775927, 47.98869595, 17.8513196 ,
       10.95745298, 14.54412683, 17.58115455, 30.77266148, 14.85155601,
       18.77360716, 19.95674372, 23.06829974, 18.86676752, 28.0057988 ,
       33.40909966, 19.63068247, 22.05285182, 47.83963938, 11.90768865,
       14.17148539, 30.84718977, 22.23917254, 16.24896141, 14.75839565,
       28.37844024, 17.69294698, 22.67702623, 26.25438404, 31.04282653,
       17.43209798, 18.30780536, 33.5488402 , 12.58775927, 29.95

In [24]:
## Pipelineあり
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.3, random_state=0)
pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])
pipeline.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [25]:
y_pred

array([19.35120139, 20.0033239 , 21.12124822,  9.71842019, 18.12148464,
        8.3210148 , 19.65863057, 24.69860603, 19.8728994 , 35.24435875,
       26.51523305, 26.45002079, 16.07195673, 15.07514088, 15.46641439,
       31.28504346, 10.74318415, 16.00674447, 20.45980967, 24.48433721,
       30.96829824, 22.45344136, 17.86995167, 16.97561221, 15.55025871,
       18.24259311, 14.29259386, 30.7260813 , 26.52454908, 17.65568284,
       16.6868151 , 14.59070701, 17.32962158, 12.82997621, 20.0033239 ,
       15.43846628, 18.73634301, 12.58775927, 47.98869595, 17.8513196 ,
       10.95745298, 14.54412683, 17.58115455, 30.77266148, 14.85155601,
       18.77360716, 19.95674372, 23.06829974, 18.86676752, 28.0057988 ,
       33.40909966, 19.63068247, 22.05285182, 47.83963938, 11.90768865,
       14.17148539, 30.84718977, 22.23917254, 16.24896141, 14.75839565,
       28.37844024, 17.69294698, 22.67702623, 26.25438404, 31.04282653,
       17.43209798, 18.30780536, 33.5488402 , 12.58775927, 29.95