<a href="https://colab.research.google.com/github/tomonari-masada/course2023-sml/blob/main/07_linear_regression_1_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 課題7

* RMSEによって評価される予測性能を、良くして下さい
* test setとそれ以外の部分の分割は、変えないでください
 * test set以外の部分をどう使うかは、自由です。
 * training setとvalidation setをくっつけて、交差検証をしていいです。
* リッジ回帰とLassoを使ってもいいです
* 高次多項式特徴量を使ってもいいです（cf. `sklearn.preprocessing.PolynomialFeatures`）
* test setでのRMSEによる評価は最後に一回おこなうだけです

In [1]:
import numpy as np
from scipy import stats, special
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import PolynomialFeatures

%config InlineBackend.figure_format = 'retina'

In [2]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
  os.makedirs(housing_path, exist_ok=True)
  tgz_path = os.path.join(housing_path, "housing.tgz")
  urllib.request.urlretrieve(housing_url, tgz_path)
  housing_tgz = tarfile.open(tgz_path)
  housing_tgz.extractall(path=housing_path)
  housing_tgz.close()

In [3]:
fetch_housing_data()

In [4]:
def load_housing_data(housing_path=HOUSING_PATH):
  csv_path = os.path.join(housing_path, "housing.csv")
  return pd.read_csv(csv_path)

（ここより上の詳細はフォローしなくてもいいいです。）

In [5]:
housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


## 1) `ocean_proximity`を0/1の数値データへ変換

* pandasの`get_dummies`を使って、カテゴリカル変数`ocean_proximity`の値を0/1の数値データに変換する。

In [7]:
housing_dummies = pd.get_dummies(housing['ocean_proximity'])

In [8]:
housing_dummies.head()

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0


In [9]:
housing_num = housing.drop('ocean_proximity', axis=1)

In [10]:
housing = pd.concat([housing_num, housing_dummies], axis=1)

In [11]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,0,1,0


In [12]:
X = housing_num.drop('median_house_value', axis=1)
y = housing_num["median_house_value"].copy()

## 2) テストデータの欠損値を訓練データの中央値で埋める
* 本当は、テストデータ全てについて予測をさせて評価すべきなので、欠損箇所を埋める。

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

In [14]:
print(X_train.shape, X_valid.shape, X_test.shape)

(12384, 8) (4128, 8) (4128, 8)


In [15]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12384 entries, 17244 to 8472
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           12384 non-null  float64
 1   latitude            12384 non-null  float64
 2   housing_median_age  12384 non-null  float64
 3   total_rooms         12384 non-null  float64
 4   total_bedrooms      12384 non-null  float64
 5   population          12384 non-null  float64
 6   households          12384 non-null  float64
 7   median_income       12384 non-null  float64
dtypes: float64(8)
memory usage: 870.8 KB


In [16]:
X_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 2071 to 788
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      4128 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
dtypes: float64(8)
memory usage: 290.2 KB


In [17]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 20046 to 3665
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      3921 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
dtypes: float64(8)
memory usage: 290.2 KB


* テストセットで欠測値を含むインスタンスを単に脱落させたものも作っておく。

In [18]:
na_index = X_test.isna().any(axis=1)
X_test_original = X_test[~ na_index]
y_test_original = y_test[~ na_index]

* 欠測箇所を中央値で埋める
 * テストデータにだけ、total_bedroomsの値が欠けているエントリがある
 * ここでは訓練データの中央値で埋めることにする。
 * 訓練データだけから得られる情報を使って埋めているので、問題はない。

In [19]:
median_total_bedrooms = np.median(X_train.total_bedrooms[~ X_train.total_bedrooms.isna()])
X_test.total_bedrooms = X_test.total_bedrooms.replace(np.nan, median_total_bedrooms)

* 欠測箇所がなくなっていることを確認する。

In [20]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 20046 to 3665
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      4128 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
dtypes: float64(8)
memory usage: 290.2 KB


## 3) 目的変数の対数をとる
* RMSEで評価するときに、np.exp()を使って元の値に戻す。

In [21]:
y_train = np.log(y_train)
y_valid = np.log(y_valid)
y_test = np.log(y_test)

## 4) 交差検証をしたいので訓練データと検証データを合併して一つにする

In [22]:
X_train = pd.concat([X_train, X_valid])

In [23]:
print(X_train.shape)

(16512, 8)


In [24]:
y_train = pd.concat([y_train, y_valid])

In [25]:
print(y_train.shape)

(16512,)


* 交差検証は10-foldで行う。

In [26]:
kf = KFold(n_splits=10, shuffle=True, random_state=123)

## 5) 特徴量を加工する
* `sklearn.preprocessing.PolynomialFeatures`を使う

### 5-1) 比較のためにまず元データのまま交差検証を行なう

* 正則化なしの線形回帰

In [27]:
reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  reg.fit(X_train.values[train_index], y_train.values[train_index])
  y_valid_pred = reg.predict(X_train.values[valid_index])
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.3f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 65608.163, 66978.500, 69715.565, 68205.479, 67808.092, 65811.269, 70637.762, 68987.730, 68590.921, 72062.025, 
mean RMSE: 68440.6 (1934.4)


* Ridge回帰

In [28]:
for alpha in 10.0 ** np.arange(-3, 4):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print(f'\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    reg.fit(X_train.values[train_index], y_train.values[train_index])
    y_valid_pred = reg.predict(X_train.values[valid_index])
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 65608.164, 66978.501, 69715.566, 68205.480, 67808.093, 65811.270, 70637.763, 68987.730, 68590.922, 72062.026, 
alpha=1.0e-03 | mean RMSE: 68440.6 (1934.4)
	RMSE: 65608.166, 66978.510, 69715.570, 68205.489, 67808.098, 65811.278, 70637.770, 68987.732, 68590.926, 72062.036, 
alpha=1.0e-02 | mean RMSE: 68440.6 (1934.4)
	RMSE: 65608.189, 66978.599, 69715.615, 68205.579, 67808.144, 65811.358, 70637.838, 68987.746, 68590.974, 72062.134, 
alpha=1.0e-01 | mean RMSE: 68440.6 (1934.4)
	RMSE: 65608.422, 66979.484, 69716.066, 68206.484, 67808.611, 65812.160, 70638.523, 68987.890, 68591.446, 72063.121, 
alpha=1.0e+00 | mean RMSE: 68441.2 (1934.5)
	RMSE: 65610.799, 66988.364, 69720.624, 68215.566, 67813.329, 65820.214, 70645.408, 68989.388, 68596.210, 72073.014, 
alpha=1.0e+01 | mean RMSE: 68447.3 (1935.1)
	RMSE: 65639.436, 67079.334, 69770.863, 68309.136, 67864.750, 65903.738, 70717.067, 69010.118, 68647.349, 72173.951, 
alpha=1.0e+02 | mean RMSE: 68511.6 (1941.4)
	RMSE: 66217.180, 68085.433,

* Lasso

In [29]:
for alpha in 10.0 ** np.arange(-3, 4):
  reg = Lasso(alpha=alpha, random_state=42)
  rmses = []
  print(f'\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    reg.fit(X_train.values[train_index], y_train.values[train_index])
    y_valid_pred = reg.predict(X_train.values[valid_index])
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 65631.688, 67038.294, 69749.911, 68266.233, 67847.201, 65866.671, 70689.390, 69007.426, 68630.179, 72128.076, 
alpha=1.0e-03 | mean RMSE: 68485.5 (1938.3)
	RMSE: 65984.348, 67692.472, 70214.915, 68947.756, 68331.510, 66491.140, 71279.027, 69336.603, 69103.749, 72837.479, 
alpha=1.0e-02 | mean RMSE: 69021.9 (1976.4)
	RMSE: 75413.574, 78164.504, 80543.191, 80162.791, 78271.640, 76982.541, 81690.593, 78894.276, 78800.813, 83208.019, 
alpha=1.0e-01 | mean RMSE: 79213.2 (2154.9)
	RMSE: 108716.987, 109399.156, 111990.575, 109889.650, 109304.745, 109580.356, 114760.454, 111404.495, 113389.669, 112926.999, 
alpha=1.0e+00 | mean RMSE: 111136.3 (1960.1)
	RMSE: 112440.004, 112442.754, 115580.547, 113506.066, 113028.271, 113399.826, 118308.972, 114867.657, 116259.668, 116376.554, 
alpha=1.0e+01 | mean RMSE: 114621.0 (1874.1)
	RMSE: 117163.152, 116156.651, 119897.336, 117538.872, 117174.950, 117766.269, 121569.332, 118084.545, 120584.168, 120216.743, 
alpha=1.0e+02 | mean RMSE: 118615.2 (170

### 5-2) 2次の項を追加する

* 正則化なしの線形回帰

In [30]:
poly = PolynomialFeatures(2)
scaler = MinMaxScaler()

reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
  X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
  X_train_transformed = poly.fit_transform(X_train.values[train_index])
  X_valid_transformed = poly.transform(X_train.values[valid_index])
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 61125.6, 61104.0, 64637.0, 62001.0, 61899.9, 60897.4, 63691.5, 64874.6, 62961.3, 67686.2, 
mean RMSE: 63087.9 (2056.5)


In [31]:
poly = PolynomialFeatures(2)
scaler = StandardScaler()

reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
  X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
  X_train_transformed = poly.fit_transform(X_train.values[train_index])
  X_valid_transformed = poly.transform(X_train.values[valid_index])
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 61125.6, 61104.0, 64637.0, 62001.0, 61899.9, 60897.4, 63691.5, 64874.6, 62961.3, 67686.2, 
mean RMSE: 63087.9 (2056.5)


* Ridge回帰

In [32]:
poly = PolynomialFeatures(2)
scaler = MinMaxScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
    X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 61123.179, 61103.621, 64636.080, 62001.003, 61895.664, 60898.035, 63688.701, 64874.509, 62961.210, 67683.144, 
alpha=1.0e-05 | mean RMSE: 63086.5 (2056.2)
	RMSE: 61106.397, 61095.913, 64630.337, 62000.309, 61879.934, 60903.543, 63667.922, 64871.920, 62961.508, 67659.564, 
alpha=1.0e-04 | mean RMSE: 63077.7 (2052.3)
	RMSE: 61161.037, 61041.098, 64645.646, 62006.311, 61942.652, 60999.582, 63601.209, 64862.491, 63005.680, 67591.003, 
alpha=1.0e-03 | mean RMSE: 63085.7 (2021.6)
	RMSE: 61343.325, 61109.118, 64913.394, 62147.295, 62223.085, 61448.934, 63833.555, 64939.205, 63229.557, 67774.665, 
alpha=1.0e-02 | mean RMSE: 63296.2 (2003.6)
	RMSE: 61510.220, 61414.067, 65225.228, 62614.221, 62540.795, 61713.701, 64256.179, 64973.618, 63434.981, 68159.643, 
alpha=1.0e-01 | mean RMSE: 63584.3 (2014.2)
	RMSE: 62418.326, 63023.018, 66148.650, 64371.589, 64194.165, 62260.602, 66351.360, 66124.407, 64748.228, 69619.107, 
alpha=1.0e+00 | mean RMSE: 64925.9 (2118.4)
	RMSE: 65959.191, 67205.235,

In [33]:
poly = PolynomialFeatures(2)
scaler = StandardScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
    X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 61125.617, 61103.974, 64637.025, 62001.048, 61899.874, 60897.374, 63691.539, 64874.564, 62961.342, 67686.151, 
alpha=1.0e-05 | mean RMSE: 63087.9 (2056.5)
	RMSE: 61125.616, 61103.973, 64637.024, 62001.047, 61899.874, 60897.373, 63691.538, 64874.563, 62961.341, 67686.153, 
alpha=1.0e-04 | mean RMSE: 63087.9 (2056.5)
	RMSE: 61125.602, 61103.970, 64637.014, 62001.040, 61899.869, 60897.368, 63691.534, 64874.550, 62961.336, 67686.165, 
alpha=1.0e-03 | mean RMSE: 63087.8 (2056.6)
	RMSE: 61125.461, 61103.941, 64636.915, 62000.972, 61899.825, 60897.320, 63691.490, 64874.419, 62961.286, 67686.291, 
alpha=1.0e-02 | mean RMSE: 63087.8 (2056.6)
	RMSE: 61124.061, 61103.651, 64635.923, 62000.294, 61899.391, 60896.838, 63691.061, 64873.119, 62960.787, 67687.543, 
alpha=1.0e-01 | mean RMSE: 63087.3 (2056.9)
	RMSE: 61111.116, 61101.005, 64626.569, 61994.413, 61896.093, 60892.521, 63687.759, 64860.642, 62956.172, 67699.918, 
alpha=1.0e+00 | mean RMSE: 63082.6 (2060.3)
	RMSE: 61027.953, 61106.258,

* Lasso

In [34]:
poly = PolynomialFeatures(2)
scaler = MinMaxScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Lasso(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
    X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 

  model = cd_fast.enet_coordinate_descent(


61351.024, 

  model = cd_fast.enet_coordinate_descent(


61243.742, 

  model = cd_fast.enet_coordinate_descent(


65091.308, 

  model = cd_fast.enet_coordinate_descent(


62468.424, 

  model = cd_fast.enet_coordinate_descent(


62229.912, 

  model = cd_fast.enet_coordinate_descent(


61776.709, 

  model = cd_fast.enet_coordinate_descent(


63819.362, 

  model = cd_fast.enet_coordinate_descent(


64615.782, 

  model = cd_fast.enet_coordinate_descent(


63153.479, 

  model = cd_fast.enet_coordinate_descent(


68030.463, 
alpha=1.0e-05 | mean RMSE: 63378.0 (1993.3)
	RMSE: 62423.456, 

  model = cd_fast.enet_coordinate_descent(


63037.277, 66081.138, 64506.486, 

  model = cd_fast.enet_coordinate_descent(


63971.713, 62348.125, 

  model = cd_fast.enet_coordinate_descent(


66436.466, 66003.785, 64542.847, 69596.810, 
alpha=1.0e-04 | mean RMSE: 64894.8 (2104.6)
	RMSE: 68073.671, 69414.051, 71612.693, 70767.017, 70896.917, 68997.703, 73539.052, 71843.655, 72297.279, 75383.823, 
alpha=1.0e-03 | mean RMSE: 71282.6 (2068.5)
	RMSE: 81886.429, 84240.211, 86595.716, 87090.065, 85413.077, 84620.420, 88776.921, 85195.664, 87042.031, 89417.435, 
alpha=1.0e-02 | mean RMSE: 86027.8 (2119.5)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e-01 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+00 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+01 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 1

In [35]:
poly = PolynomialFeatures(2)
scaler = StandardScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Lasso(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
    X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 

  model = cd_fast.enet_coordinate_descent(


61149.238, 

  model = cd_fast.enet_coordinate_descent(


61087.658, 

  model = cd_fast.enet_coordinate_descent(


64620.582, 

  model = cd_fast.enet_coordinate_descent(


62032.734, 

  model = cd_fast.enet_coordinate_descent(


61940.662, 

  model = cd_fast.enet_coordinate_descent(


60991.410, 

  model = cd_fast.enet_coordinate_descent(


63663.205, 

  model = cd_fast.enet_coordinate_descent(


64828.104, 

  model = cd_fast.enet_coordinate_descent(


62943.041, 

  model = cd_fast.enet_coordinate_descent(


67727.127, 
alpha=1.0e-05 | mean RMSE: 63098.4 (2045.4)
	RMSE: 

  model = cd_fast.enet_coordinate_descent(


61127.722, 

  model = cd_fast.enet_coordinate_descent(


61081.062, 

  model = cd_fast.enet_coordinate_descent(


64579.140, 

  model = cd_fast.enet_coordinate_descent(


62013.005, 

  model = cd_fast.enet_coordinate_descent(


61977.310, 

  model = cd_fast.enet_coordinate_descent(


60965.263, 

  model = cd_fast.enet_coordinate_descent(


63683.679, 

  model = cd_fast.enet_coordinate_descent(


64795.892, 

  model = cd_fast.enet_coordinate_descent(


62931.675, 

  model = cd_fast.enet_coordinate_descent(


67769.607, 
alpha=1.0e-04 | mean RMSE: 63092.4 (2054.3)
	RMSE: 

  model = cd_fast.enet_coordinate_descent(


61292.519, 

  model = cd_fast.enet_coordinate_descent(


61452.058, 

  model = cd_fast.enet_coordinate_descent(


64617.017, 

  model = cd_fast.enet_coordinate_descent(


62431.676, 

  model = cd_fast.enet_coordinate_descent(


62403.302, 

  model = cd_fast.enet_coordinate_descent(


61207.144, 

  model = cd_fast.enet_coordinate_descent(


64149.628, 

  model = cd_fast.enet_coordinate_descent(


64865.373, 

  model = cd_fast.enet_coordinate_descent(


63236.756, 

  model = cd_fast.enet_coordinate_descent(


68041.477, 
alpha=1.0e-03 | mean RMSE: 63369.7 (2018.1)
	RMSE: 65692.489, 66915.507, 69362.813, 68156.175, 67928.309, 66383.691, 70702.277, 69487.507, 69055.314, 73088.941, 
alpha=1.0e-02 | mean RMSE: 68677.3 (2072.2)
	RMSE: 85677.779, 88094.747, 90799.679, 90722.494, 89320.846, 88436.683, 92601.370, 89040.112, 91311.153, 92976.113, 
alpha=1.0e-01 | mean RMSE: 89898.1 (2110.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+00 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+01 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+02 | mean RMSE: 119373.7 (1626.4)


### 5-3) 3次までの項を追加する

* 正則化なしの線形回帰

In [36]:
poly = PolynomialFeatures(3)
scaler = MinMaxScaler()

reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
  X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 58616.0, 58101.6, 61680.5, 59146.7, 59369.7, 59643.7, 61798.1, 61453.8, 60188.7, 64196.5, 
mean RMSE: 60419.5 (1755.8)


In [37]:
poly = PolynomialFeatures(3)
scaler = StandardScaler()

reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
  X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 58616.0, 58101.6, 61680.5, 59146.7, 59369.7, 59643.7, 61798.1, 61453.8, 60188.7, 64196.5, 
mean RMSE: 60419.5 (1755.8)


* Ridge回帰

In [38]:
poly = PolynomialFeatures(3)
scaler = MinMaxScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 59532.065, 58865.160, 62471.071, 60256.365, 59018.453, 60765.470, 62397.356, 61888.065, 61619.768, 64611.699, 
alpha=1.0e-05 | mean RMSE: 61142.5 (1714.0)
	RMSE: 59645.567, 59104.297, 62650.626, 60467.657, 58982.816, 60776.882, 62688.825, 62552.942, 61860.911, 65268.682, 
alpha=1.0e-04 | mean RMSE: 61399.9 (1875.7)
	RMSE: 59838.113, 59409.918, 62625.333, 60647.908, 59238.610, 60295.788, 62540.747, 62997.801, 61940.270, 65773.555, 
alpha=1.0e-03 | mean RMSE: 61530.8 (1935.5)
	RMSE: 60046.080, 59758.318, 62563.598, 60429.588, 59668.161, 59514.543, 62673.945, 63302.338, 62099.744, 66005.032, 
alpha=1.0e-02 | mean RMSE: 61606.1 (1997.3)
	RMSE: 60689.454, 60411.116, 63302.127, 61080.790, 60621.113, 60294.142, 63065.365, 63650.828, 62259.676, 66882.425, 
alpha=1.0e-01 | mean RMSE: 62225.7 (1971.0)
	RMSE: 61760.816, 61444.851, 64644.982, 62541.139, 62203.131, 62005.609, 64120.822, 64707.994, 63399.388, 68140.553, 
alpha=1.0e+00 | mean RMSE: 63496.9 (1917.9)
	RMSE: 62976.721, 63290.475,

In [39]:
poly = PolynomialFeatures(3)
scaler = StandardScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 59117.665, 58718.155, 62014.510, 59573.106, 59523.598, 59803.955, 62488.216, 61052.259, 60396.813, 64090.581, 
alpha=1.0e-05 | mean RMSE: 60677.9 (1628.5)
	RMSE: 59391.899, 58944.179, 62294.941, 59731.825, 59652.555, 60209.216, 62744.942, 61373.098, 60678.796, 64364.885, 
alpha=1.0e-04 | mean RMSE: 60938.6 (1646.5)
	RMSE: 59542.290, 59032.466, 62516.103, 59520.869, 59474.496, 60165.392, 62563.321, 61585.994, 60969.123, 64417.744, 
alpha=1.0e-03 | mean RMSE: 60978.8 (1672.5)
	RMSE: 59597.253, 59014.167, 62519.356, 59538.009, 59132.972, 60313.693, 62268.157, 61690.047, 61720.942, 64503.774, 
alpha=1.0e-02 | mean RMSE: 61029.8 (1708.4)
	RMSE: 59613.446, 59260.989, 62402.534, 60317.668, 58926.838, 60469.550, 62229.223, 62458.996, 61763.268, 64949.746, 
alpha=1.0e-01 | mean RMSE: 61239.2 (1766.6)
	RMSE: 59670.081, 59547.571, 62266.018, 60633.873, 59276.177, 59921.802, 62296.073, 62794.014, 61808.355, 65694.933, 
alpha=1.0e+00 | mean RMSE: 61390.9 (1892.5)
	RMSE: 60068.227, 60051.473,

### 5-4) 4次までの項を追加する

In [40]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()

reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
  X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 55876.8, 56330.9, 58727.0, 59546.0, 55853.2, 55936.6, 59185.1, 58817.3, 57557.6, 62954.0, 
mean RMSE: 58078.5 (2140.7)


In [41]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()

reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
  X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 55876.8, 56330.9, 58727.0, 59546.0, 55853.2, 55936.6, 59185.1, 58817.3, 57557.6, 62954.0, 
mean RMSE: 58078.5 (2140.7)


In [42]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
    X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 55051.920, 56262.811, 58281.415, 57116.744, 55403.682, 53933.122, 58194.477, 57039.603, 57643.146, 61677.355, 
alpha=1.0e-05 | mean RMSE: 57060.4 (2040.4)
	RMSE: 55715.914, 55797.550, 58913.201, 57045.222, 55898.646, 55574.830, 58585.765, 57750.130, 58336.731, 62708.484, 
alpha=1.0e-04 | mean RMSE: 57632.6 (2084.5)
	RMSE: 56999.180, 56437.266, 60364.940, 58140.998, 56935.117, 56790.094, 59781.972, 59478.392, 59211.113, 63519.593, 
alpha=1.0e-03 | mean RMSE: 58765.9 (2075.6)
	RMSE: 58345.010, 57937.717, 61848.788, 58977.859, 58097.480, 57896.421, 60760.494, 61139.148, 59871.613, 64738.130, 
alpha=1.0e-02 | mean RMSE: 59961.3 (2093.5)
	RMSE: 59936.429, 59378.452, 63168.396, 60773.898, 59745.549, 59296.852, 62528.873, 62870.715, 61225.315, 66092.577, 
alpha=1.0e-01 | mean RMSE: 61501.7 (2056.5)
	RMSE: 61678.687, 61583.069, 64935.285, 62968.502, 62156.669, 60784.585, 65172.866, 64729.129, 63091.911, 68320.369, 
alpha=1.0e+00 | mean RMSE: 63542.1 (2149.6)
	RMSE: 64350.654, 65238.914,

In [43]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = poly.fit_transform(scaler.fit_transform(X_train.values[train_index]))
    X_valid_transformed = poly.transform(scaler.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


55876.674, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


56331.009, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


58726.328, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


59544.916, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


55851.584, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


55936.274, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


59184.718, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


58817.491, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


57557.531, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


62954.568, 
alpha=1.0e-05 | mean RMSE: 58078.1 (2141.0)
	RMSE: 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


55874.967, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


56332.223, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


58721.217, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


59536.332, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


55834.896, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


55934.886, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


59182.504, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


58816.534, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


57557.026, 

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


62961.244, 
alpha=1.0e-04 | mean RMSE: 58075.2 (2143.6)
	RMSE: 55859.816, 56344.671, 58677.512, 59455.113, 55722.110, 55919.196, 59161.182, 58807.013, 57552.529, 63196.537, 
alpha=1.0e-03 | mean RMSE: 58069.6 (2203.9)
	RMSE: 55818.991, 56435.730, 58553.368, 58732.692, 55576.423, 55779.364, 59024.082, 58743.939, 57522.095, 63475.448, 
alpha=1.0e-02 | mean RMSE: 57966.2 (2250.6)
	RMSE: 55771.089, 56582.584, 58596.845, 58436.922, 55270.253, 55380.979, 58769.491, 58494.423, 57427.980, 63441.498, 
alpha=1.0e-01 | mean RMSE: 57817.2 (2286.4)
	RMSE: 55849.602, 56459.313, 58749.613, 57429.417, 55480.720, 54654.070, 58256.102, 57573.162, 57288.001, 63586.150, 
alpha=1.0e+00 | mean RMSE: 57532.6 (2349.5)
	RMSE: 56070.765, 55924.006, 58341.770, 57509.184, 55846.861, 54347.018, 58463.492, 58017.531, 57489.109, 63366.634, 
alpha=1.0e+01 | mean RMSE: 57537.6 (2313.2)
	RMSE: 56639.376, 57408.631, 59857.821, 59804.426, 58362.549, 55923.368, 60103.895, 59324.314, 59454.629, 64027.868, 
alpha=1.0e+02 | 

In [44]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()
scaler2 = StandardScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train.values[train_index])))
    X_valid_transformed = scaler2.transform(poly.transform(scaler.transform(X_train.values[valid_index])))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 55965.851, 56535.177, 58508.539, 58221.427, 55226.392, 55518.948, 58938.983, 58012.631, 57409.576, 63090.925, 
alpha=1.0e-05 | mean RMSE: 57742.8 (2166.2)
	RMSE: 56131.077, 56579.011, 59154.543, 58357.039, 55286.409, 55383.699, 59078.356, 58045.875, 57362.205, 62181.505, 
alpha=1.0e-04 | mean RMSE: 57756.0 (1991.0)
	RMSE: 56534.988, 56108.072, 59111.268, 58029.193, 55109.440, 55967.446, 59423.806, 57614.664, 57434.377, 61430.494, 
alpha=1.0e-03 | mean RMSE: 57676.4 (1807.3)
	RMSE: 56833.816, 56273.478, 58356.648, 57764.019, 55756.530, 55960.350, 58522.135, 57460.081, 57470.955, 61428.045, 
alpha=1.0e-02 | mean RMSE: 57582.6 (1567.5)
	RMSE: 55901.815, 56717.788, 58215.488, 57307.064, 56122.994, 54008.779, 58251.438, 57066.405, 56982.120, 60768.492, 
alpha=1.0e-01 | mean RMSE: 57134.2 (1678.3)
	RMSE: 55069.343, 56782.927, 58376.995, 57639.667, 56750.818, 55279.497, 58732.541, 57439.361, 57798.235, 60827.674, 
alpha=1.0e+00 | mean RMSE: 57469.7 (1593.1)
	RMSE: 55442.745, 56381.047,

In [45]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()
scaler2 = MinMaxScaler()

for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train.values[train_index])))
    X_valid_transformed = scaler2.transform(poly.transform(scaler.transform(X_train.values[valid_index])))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 55762.214, 56600.219, 57805.727, 57671.581, 55593.631, 52827.577, 58224.887, 56735.889, 57256.694, 61660.720, 
alpha=1.0e-05 | mean RMSE: 57013.9 (2133.9)
	RMSE: 54987.504, 56482.737, 57987.799, 57331.492, 55750.247, 53619.271, 58083.441, 57964.635, 57533.054, 62014.465, 
alpha=1.0e-04 | mean RMSE: 57175.5 (2135.9)
	RMSE: 55611.160, 56059.336, 58698.827, 58484.456, 56400.681, 55442.379, 58860.287, 58684.238, 57946.339, 62156.515, 
alpha=1.0e-03 | mean RMSE: 57834.4 (1936.7)
	RMSE: 56891.590, 56204.160, 59812.401, 58506.599, 56808.000, 57263.065, 59889.875, 59591.170, 58724.956, 63262.957, 
alpha=1.0e-02 | mean RMSE: 58695.5 (1986.2)
	RMSE: 58249.413, 57132.037, 61031.383, 58670.416, 57741.125, 58505.727, 60744.345, 60637.466, 59758.188, 64425.708, 
alpha=1.0e-01 | mean RMSE: 59689.6 (2027.3)
	RMSE: 59654.303, 58817.156, 62532.578, 60136.893, 59465.069, 59068.007, 61995.178, 62136.572, 61206.882, 66177.783, 
alpha=1.0e+00 | mean RMSE: 61119.0 (2115.4)
	RMSE: 61723.057, 61625.862,

In [46]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()
scaler2 = MinMaxScaler()

for alpha in 10.0 ** np.arange(-9, -5):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train.values[train_index])))
    X_valid_transformed = scaler2.transform(poly.transform(scaler.transform(X_train.values[valid_index])))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 56081.286, 56608.472, 58472.972, 58181.679, 55226.927, 55456.180, 58960.319, 58194.152, 57382.881, 62992.225, 
alpha=1.0e-09 | mean RMSE: 57755.7 (2137.8)
	RMSE: 56153.506, 56445.211, 59101.215, 58147.845, 55186.568, 56039.237, 59060.814, 57814.390, 57368.105, 61880.571, 
alpha=1.0e-08 | mean RMSE: 57719.7 (1861.6)
	RMSE: 56749.228, 56135.162, 58810.551, 57163.829, 55059.690, 55605.023, 58517.534, 57211.374, 57386.379, 61791.241, 
alpha=1.0e-07 | mean RMSE: 57443.0 (1825.3)
	RMSE: 56772.685, 56064.151, 58152.425, 57773.010, 55514.751, 54877.957, 58637.518, 57229.098, 57083.543, 61957.414, 
alpha=1.0e-06 | mean RMSE: 57406.3 (1876.9)


In [47]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()
scaler2 = MinMaxScaler()

for alpha in 10.0 ** np.arange(-7, 1):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train.values[train_index])))
    X_valid_transformed = scaler2.transform(poly.transform(scaler.transform(X_train.values[valid_index])))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 56567.848, 56319.455, 59106.887, 57986.205, 55117.190, 56017.971, 59388.349, 57961.050, 57404.251, 61448.588, 
alpha=1.0e-07 | mean RMSE: 57731.8 (1783.2)
	RMSE: 57129.182, 56350.102, 58222.319, 57681.477, 55742.398, 55953.650, 58533.385, 57544.655, 57434.029, 61452.319, 
alpha=1.0e-06 | mean RMSE: 57604.4 (1551.8)
	RMSE: 56048.815, 56728.860, 58208.130, 57398.294, 56058.331, 54000.894, 58226.651, 57486.027, 57063.993, 60743.877, 
alpha=1.0e-05 | mean RMSE: 57196.4 (1668.9)
	RMSE: 54767.093, 56902.837, 58091.087, 57419.677, 56375.058, 55113.483, 58448.177, 56977.964, 57405.314, 60537.502, 
alpha=1.0e-04 | mean RMSE: 57203.8 (1570.7)
	RMSE: 54793.507, 56320.377, 58626.650, 58322.930, 56002.117, 55534.071, 58728.508, 57281.587, 58140.595, 61730.328, 
alpha=1.0e-03 | mean RMSE: 57548.1 (1913.4)
	RMSE: 55931.819, 56497.470, 59569.470, 58731.213, 56935.220, 56333.852, 59297.016, 58368.728, 59049.680, 62810.466, 
alpha=1.0e-02 | mean RMSE: 58352.5 (1956.0)
	RMSE: 57455.450, 57650.536,

In [48]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()
scaler2 = StandardScaler()

for alpha in 10.0 ** np.arange(-7, 1):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train.values[train_index])))
    X_valid_transformed = scaler2.transform(poly.transform(scaler.transform(X_train.values[valid_index])))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 55968.571, 56491.208, 58584.643, 58993.273, 55323.351, 55302.453, 59003.036, 58977.353, 57313.199, 63154.434, 
alpha=1.0e-07 | mean RMSE: 57911.2 (2260.4)
	RMSE: 55979.598, 56571.938, 58406.003, 58526.582, 55200.218, 55656.480, 58931.091, 58974.582, 57247.341, 63154.712, 
alpha=1.0e-06 | mean RMSE: 57864.9 (2206.0)
	RMSE: 55943.178, 56568.526, 58526.396, 58111.599, 55223.296, 55445.166, 58836.003, 58125.612, 57349.640, 62978.224, 
alpha=1.0e-05 | mean RMSE: 57710.8 (2142.2)
	RMSE: 56046.401, 56372.244, 59046.364, 57347.455, 55160.672, 56018.692, 58677.662, 57341.975, 57335.430, 61998.660, 
alpha=1.0e-04 | mean RMSE: 57534.6 (1874.3)
	RMSE: 56498.960, 56012.257, 58463.191, 57252.716, 55143.024, 55500.412, 58644.298, 57270.977, 57354.775, 62176.272, 
alpha=1.0e-03 | mean RMSE: 57431.7 (1922.5)
	RMSE: 56179.835, 56691.782, 57955.240, 57621.704, 55745.868, 53914.186, 58744.765, 57216.415, 57095.751, 62396.154, 
alpha=1.0e-02 | mean RMSE: 57356.2 (2100.3)
	RMSE: 55936.870, 56690.121,

### 5-5) 乱数のシードを変える

* `random_state`の値を変更した10-fold交差検証をおこなって、似たような性能が出せるか、確認する。

In [49]:
kf = KFold(n_splits=10, shuffle=True, random_state=23456)

poly = PolynomialFeatures(4)
scaler = MinMaxScaler()
scaler2 = MinMaxScaler()

for alpha in 10.0 ** np.arange(-8, 0):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train.values[train_index])))
    X_valid_transformed = scaler2.transform(poly.transform(scaler.transform(X_train.values[valid_index])))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 60673.064, 55416.765, 55407.022, 58969.291, 55181.389, 60541.268, 59091.628, 55810.507, 57722.679, 56232.956, 
alpha=1.0e-08 | mean RMSE: 57504.7 (2064.3)
	RMSE: 60692.410, 55145.570, 55202.582, 58956.678, 55214.082, 59917.979, 58995.125, 55538.572, 58356.427, 56104.248, 
alpha=1.0e-07 | mean RMSE: 57412.4 (2071.1)
	RMSE: 60521.181, 54904.455, 54165.384, 58482.577, 55419.735, 59799.059, 60220.091, 55989.557, 57796.402, 55695.833, 
alpha=1.0e-06 | mean RMSE: 57299.4 (2240.2)
	RMSE: 59712.900, 55021.919, 54143.318, 57942.401, 55340.369, 59687.695, 60107.891, 56011.801, 56878.333, 55872.238, 
alpha=1.0e-05 | mean RMSE: 57071.9 (2053.5)
	RMSE: 60076.231, 55845.848, 53687.271, 58422.005, 55176.802, 60291.427, 59405.992, 55385.600, 57080.293, 56166.212, 
alpha=1.0e-04 | mean RMSE: 57153.8 (2163.5)
	RMSE: 61681.755, 56768.241, 53921.582, 59064.534, 56112.692, 60315.400, 60262.263, 56061.403, 57370.974, 57277.598, 
alpha=1.0e-03 | mean RMSE: 57883.6 (2267.6)
	RMSE: 62823.097, 57240.166,

* `alpha=1.0e-05`で良さそう。

## 6) チューニング済みの手法をテストデータ上で評価

In [50]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()
scaler2 = MinMaxScaler()

X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train)))
X_test_transformed = scaler2.transform(poly.transform(scaler.transform(X_test)))

reg = Ridge(alpha=1.0e-5, random_state=42)
reg.fit(X_train_transformed, y_train)
y_test_pred = reg.predict(X_test_transformed)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = mean_squared_error(np.exp(y_test), np.exp(y_test_pred), squared=False)
print(f'test RMSE: {rmse:.1f}')

test RMSE: 61197.6


* 欠測箇所を含むインスタンスを脱落させて作ったテストセットで評価する。

In [51]:
poly = PolynomialFeatures(4)
scaler = MinMaxScaler()
scaler2 = MinMaxScaler()

X_train_transformed = scaler2.fit_transform(poly.fit_transform(scaler.fit_transform(X_train)))
X_test_transformed = scaler2.transform(poly.transform(scaler.transform(X_test_original)))

reg = Ridge(alpha=1.0e-5, random_state=42)
reg.fit(X_train_transformed, y_train)
y_test_pred = reg.predict(X_test_transformed)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = mean_squared_error(y_test_original, np.exp(y_test_pred), squared=False)
print(f'test RMSE: {rmse:.1f}')

test RMSE: 57907.2


* 何の工夫もない線形回帰だとテスト性能がどうなるか、確認しておく。

In [52]:
reg = LinearRegression()
reg.fit(X_train, y_train)
y_test_pred = reg.predict(X_test)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = mean_squared_error(np.exp(y_test), np.exp(y_test_pred), squared=False)
print(f'test RMSE: {rmse:.1f}')

test RMSE: 70139.7


* 欠測箇所を含むインスタンスを脱落させて作ったテストセットで評価する。

In [53]:
reg = LinearRegression()
reg.fit(X_train, y_train)
y_test_pred = reg.predict(X_test_original)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = mean_squared_error(y_test_original, np.exp(y_test_pred), squared=False)
print(f'test RMSE: {rmse:.1f}')

test RMSE: 69744.5
