<a href="https://colab.research.google.com/github/saiku122/AIJobcolle/blob/master/MachineLearning/python/P02S02_Regression-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regression-2: ridge vs ols by holdout and cross-validation

In [1]:
!git clone https://github.com/saiku122/AIJobcolle.git

Cloning into 'AIJobcolle'...
remote: Enumerating objects: 67, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 67 (delta 17), reused 41 (delta 8), pack-reused 0[K
Unpacking objects: 100% (67/67), done.


最小２乗回帰とリッジ回帰モデルを構築しモデル性能とその中身を比較してみましょう。<br>データはボストン・ハウジングデータを使います。

In [2]:
cd /content/AIJobcolle/MachineLearning/python

/content/AIJobcolle/MachineLearning/python


In [3]:
# import data for regression
import pandas as pd
from IPython.core.display import display
from sklearn.datasets import load_boston

# set data by role
dataset = load_boston()
X = pd.DataFrame(dataset.data,
                 columns=dataset.feature_names)
y = pd.Series(dataset.target, name='y')

# check the shape
print('--------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('--------------------------------------------')
display(X.join(y).head())

--------------------------------------------
X shape: (506,13)
--------------------------------------------


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,y
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


OLSとRidgeのどちらが良い予測モデルかをholdoutにより検証してみましょう。このデータでは、OLSとリッジ回帰に大きな性能差は見られないと思います。ただし、リッジ回帰のalphaを大きくすると、係数総和が減少していく様子が確認できます。

In [4]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# ホールドアウトのためデータを訓練とテストに分割
X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size=0.20,
                                                 random_state=1)
# make pipelines
pipelines = {
   'ols': Pipeline([('scl',StandardScaler()), ('est',LinearRegression())])
  ,'ridge1': Pipeline([('scl',StandardScaler()),('est',Ridge(alpha=1.0))])
  ,'ridge2': Pipeline([('scl',StandardScaler()),('est',Ridge(alpha=20.0))])
}

# build models
scores = {}
for pipe_name, est in pipelines.items():
    est.fit(X_train, y_train)
    scores[('train',pipe_name)]=r2_score(y_train, est.predict(X_train))
    scores[('test',pipe_name)]=r2_score(y_test, est.predict(X_test))

display(pd.Series(scores).unstack())
                                        
# 回帰係数の総和比較
# リッジ回帰の正則化項の役割把握（モデルの「性能」評価ではない）
print('OLS coefficient total:%.6f'%(np.absolute(pipelines['ols'].named_steps['est'].coef_).sum()))
print('Ridge coefficient total:%.6f'%(np.absolute(pipelines['ridge1'].named_steps['est'].coef_).sum()))
print('Ridge coefficient total:%.6f'%(np.absolute(pipelines['ridge2'].named_steps['est'].coef_).sum()))

Unnamed: 0,ols,ridge1,ridge2
test,0.763417,0.763404,0.758157
train,0.729359,0.729336,0.725291


OLS coefficient total:22.063408
Ridge coefficient total:21.710242
Ridge coefficient total:18.013609


次に、交差検証（k-fold法）によって評価してみます。holdoutでは見られなかった評価値のブレに気付くでしょう。評価値がブレる場合、(1)学習データに存在する外れ値の処理の見直し、(2)よりシンプルなアルゴリズムの検討、(2)データサンプル数の拡充が基本的な打ち手となりますが、ここでは、<b>交差検証によって、holdoutでは見逃していた可能性のあるモデル精度の堅牢性（robustness）に対する懸念を把握し得ること</b>を理解しましょう。

In [5]:
from sklearn.model_selection import cross_val_score

# build models
scores={}
for pipe_name, est in pipelines.items():
    cv_results = cross_val_score(est,
                                 X,
                                 y,
                                 cv=5,
                                 scoring='r2')
    print('----------')
    print('algorithm:', pipe_name)
    print('cv_results:', cv_results)
    print('avg +- std_dev', cv_results.mean(),'+-', cv_results.std())

----------
algorithm: ols
cv_results: [ 0.63919994  0.71386698  0.58702344  0.07923081 -0.25294154]
avg +- std_dev 0.35327592439588207 +- 0.376567839332623
----------
algorithm: ridge1
cv_results: [ 0.64344111  0.71648023  0.58788768  0.08218971 -0.23681375]
avg +- std_dev 0.3586369955712166 +- 0.3722111586754402
----------
algorithm: ridge2
cv_results: [ 0.69338962  0.74221138  0.59160501  0.12868283 -0.04192937]
avg +- std_dev 0.42279189511615056 +- 0.31818730396014044


<b>[確認してみよう]</b>hodloutによる確認時のrandom_stateを1から0に変更してみよう。評価値のブレは観測されるだろうか？