<a href="https://colab.research.google.com/github/tomonari-masada/course2022-sml/blob/main/07_linear_regression_2_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 課題7

* RMSEによって評価される予測性能を、良くして下さい
* test setとそれ以外の部分の分割は、変えないでください
 * test set以外の部分をどう使うかは、自由です。
 * training setとvalidation setをくっつけて、交差検証をしていいです。
* リッジ回帰とLassoを使ってもいいです
* 高次多項式特徴量を使ってもいいです（cf. `sklearn.preprocessing.PolynomialFeatures`）
* test setでのRMSEによる評価は最後に一回おこなうだけです

In [1]:
import numpy as np
from scipy import stats, special
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error

%config InlineBackend.figure_format = 'retina'

In [2]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
  os.makedirs(housing_path, exist_ok=True)
  tgz_path = os.path.join(housing_path, "housing.tgz")
  urllib.request.urlretrieve(housing_url, tgz_path)
  housing_tgz = tarfile.open(tgz_path)
  housing_tgz.extractall(path=housing_path)
  housing_tgz.close()

In [3]:
fetch_housing_data()

In [4]:
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

（ここより上の詳細はフォローしなくてもいいいです。）

In [5]:
housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


## 1) `ocean_proximity`を0/1の数値データへ変換

* pandasの`get_dummies`を使って、カテゴリカル変数`ocean_proximity`の値を0/1の数値データに変換する。

In [7]:
housing_dummies = pd.get_dummies(housing['ocean_proximity'])

In [8]:
housing_dummies.head()

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0


In [9]:
housing_num = housing.drop('ocean_proximity', axis=1)

In [10]:
housing = pd.concat([housing_num, housing_dummies], axis=1)

In [11]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,0,1,0


In [12]:
X = housing_num.drop('median_house_value', axis=1)
y = housing_num["median_house_value"].copy()

## 2) テストデータの欠損値を訓練データの中央値で埋める

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

In [14]:
print(X_train.shape, X_valid.shape, X_test.shape)

(12384, 8) (4128, 8) (4128, 8)


In [15]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12384 entries, 17244 to 8472
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           12384 non-null  float64
 1   latitude            12384 non-null  float64
 2   housing_median_age  12384 non-null  float64
 3   total_rooms         12384 non-null  float64
 4   total_bedrooms      12384 non-null  float64
 5   population          12384 non-null  float64
 6   households          12384 non-null  float64
 7   median_income       12384 non-null  float64
dtypes: float64(8)
memory usage: 870.8 KB


In [16]:
X_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 2071 to 788
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      4128 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
dtypes: float64(8)
memory usage: 290.2 KB


In [17]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 20046 to 3665
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      3921 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
dtypes: float64(8)
memory usage: 290.2 KB


* 欠測箇所を中央値で埋める
 * テストデータにだけ、total_bedroomsの値が欠けているエントリがある
 * ここでは訓練データの中央値で埋めることにする。
 * 訓練データだけから得られる情報を使って埋めているので、問題はない。

In [18]:
median_total_bedrooms = np.median(X_train.total_bedrooms[~ X_train.total_bedrooms.isna()])
X_test.total_bedrooms = X_test.total_bedrooms.replace(np.nan, median_total_bedrooms)

* 欠測箇所がなくなっていることを確認する。

In [19]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 20046 to 3665
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      4128 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
dtypes: float64(8)
memory usage: 290.2 KB


## 3) 目的変数の対数をとる
* RMSEで評価するときに、np.exp()を使って元の値に戻す。

In [20]:
y_train = np.log(y_train)
y_valid = np.log(y_valid)
y_test = np.log(y_test)

## 4) 交差検証をしたいので訓練データと検証データを合併して一つにする

In [21]:
X_train = pd.concat([X_train, X_valid])

In [22]:
print(X_train.shape)

(16512, 8)


In [23]:
y_train = pd.concat([y_train, y_valid])

In [24]:
print(y_train.shape)

(16512,)


* 交差検証は10-foldで行う。

In [25]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True, random_state=123)

## 5) 特徴量を加工する
* `sklearn.preprocessing.PolynomialFeatures`を使う

### 5-1) 比較のためにまず元データのまま交差検証を行なう

* 正則化なしの線形回帰

In [26]:
reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  reg.fit(X_train.values[train_index], y_train.values[train_index])
  y_valid_pred = reg.predict(X_train.values[valid_index])
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.3f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 65608.163, 66978.500, 69715.565, 68205.479, 67808.092, 65811.269, 70637.762, 68987.730, 68590.921, 72062.025, 
mean RMSE: 68440.6 (1934.4)


* Ridge回帰

In [27]:
for alpha in 10.0 ** np.arange(-3, 4):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print(f'\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    reg.fit(X_train.values[train_index], y_train.values[train_index])
    y_valid_pred = reg.predict(X_train.values[valid_index])
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 65608.164, 66978.501, 69715.566, 68205.480, 67808.093, 65811.270, 70637.763, 68987.730, 68590.922, 72062.026, 
alpha=1.0e-03 | mean RMSE: 68440.6 (1934.4)
	RMSE: 65608.166, 66978.510, 69715.570, 68205.489, 67808.098, 65811.278, 70637.770, 68987.732, 68590.926, 72062.036, 
alpha=1.0e-02 | mean RMSE: 68440.6 (1934.4)
	RMSE: 65608.189, 66978.599, 69715.615, 68205.579, 67808.144, 65811.358, 70637.838, 68987.746, 68590.974, 72062.134, 
alpha=1.0e-01 | mean RMSE: 68440.6 (1934.4)
	RMSE: 65608.422, 66979.484, 69716.066, 68206.484, 67808.611, 65812.160, 70638.523, 68987.890, 68591.446, 72063.121, 
alpha=1.0e+00 | mean RMSE: 68441.2 (1934.5)
	RMSE: 65610.799, 66988.364, 69720.624, 68215.566, 67813.329, 65820.214, 70645.408, 68989.388, 68596.210, 72073.014, 
alpha=1.0e+01 | mean RMSE: 68447.3 (1935.1)
	RMSE: 65639.436, 67079.334, 69770.863, 68309.136, 67864.750, 65903.738, 70717.067, 69010.118, 68647.349, 72173.951, 
alpha=1.0e+02 | mean RMSE: 68511.6 (1941.4)
	RMSE: 66217.180, 68085.433,

* Lasso

In [28]:
for alpha in 10.0 ** np.arange(-3, 4):
  reg = Lasso(alpha=alpha, random_state=42)
  rmses = []
  print(f'\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    reg.fit(X_train.values[train_index], y_train.values[train_index])
    y_valid_pred = reg.predict(X_train.values[valid_index])
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 65631.688, 67038.294, 69749.911, 68266.233, 67847.201, 65866.671, 70689.390, 69007.426, 68630.179, 72128.076, 
alpha=1.0e-03 | mean RMSE: 68485.5 (1938.3)
	RMSE: 65984.348, 67692.472, 70214.915, 68947.756, 68331.510, 66491.140, 71279.027, 69336.603, 69103.749, 72837.479, 
alpha=1.0e-02 | mean RMSE: 69021.9 (1976.4)
	RMSE: 75413.574, 78164.504, 80543.191, 80162.791, 78271.640, 76982.541, 81690.593, 78894.276, 78800.813, 83208.019, 
alpha=1.0e-01 | mean RMSE: 79213.2 (2154.9)
	RMSE: 108716.987, 109399.156, 111990.575, 109889.650, 109304.745, 109580.356, 114760.454, 111404.495, 113389.669, 112926.999, 
alpha=1.0e+00 | mean RMSE: 111136.3 (1960.1)
	RMSE: 112440.004, 112442.754, 115580.547, 113506.066, 113028.271, 113399.826, 118308.972, 114867.657, 116259.668, 116376.554, 
alpha=1.0e+01 | mean RMSE: 114621.0 (1874.1)
	RMSE: 117163.152, 116156.651, 119897.336, 117538.872, 117174.950, 117766.269, 121569.332, 118084.545, 120584.168, 120216.743, 
alpha=1.0e+02 | mean RMSE: 118615.2 (170

### 5-2) 2次の項を追加する

In [29]:
from sklearn.preprocessing import PolynomialFeatures

* 正則化なしの線形回帰

In [30]:
poly = PolynomialFeatures(2)
reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(X_train.values[train_index])
  X_valid_transformed = poly.transform(X_train.values[valid_index])
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 61125.6, 61104.0, 64637.0, 62001.0, 61899.9, 60897.4, 63691.5, 64874.6, 62961.3, 67686.2, 
mean RMSE: 63087.9 (2056.5)


* Ridge回帰
 * スケーリングもする。

In [31]:
poly = PolynomialFeatures(2)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 61125.258, 61103.812, 64636.739, 62000.755, 61899.553, 60897.153, 63691.108, 64873.990, 62961.024, 67686.175, 
alpha=1.0e-05 | mean RMSE: 63087.6 (2056.6)
	RMSE: 61122.079, 61102.334, 64634.206, 61998.170, 61896.713, 60895.198, 63687.294, 64868.868, 62958.193, 67686.386, 
alpha=1.0e-04 | mean RMSE: 63084.9 (2056.8)
	RMSE: 61091.902, 61088.790, 64612.257, 61976.650, 61872.648, 60878.153, 63654.299, 64821.530, 62932.110, 67688.465, 
alpha=1.0e-03 | mean RMSE: 63061.7 (2059.3)
	RMSE: 60964.888, 61038.562, 64512.378, 61904.690, 61792.105, 60811.059, 63520.784, 64565.125, 62789.678, 67713.474, 
alpha=1.0e-02 | mean RMSE: 62961.3 (2066.2)
	RMSE: 60929.503, 61124.266, 64337.179, 61882.315, 61907.442, 60823.754, 63504.905, 64360.939, 62674.316, 67877.886, 
alpha=1.0e-01 | mean RMSE: 62942.3 (2067.5)
	RMSE: 61256.583, 61618.419, 64397.568, 62122.325, 62477.092, 61359.135, 64045.361, 64675.134, 63055.714, 68211.215, 
alpha=1.0e+00 | mean RMSE: 63321.9 (2016.4)
	RMSE: 61635.353, 62022.052,

* Lasso
 * スケーリングもする。

In [32]:
poly = PolynomialFeatures(2)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-5, 3):
  reg = Lasso(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


61578.833, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62062.232, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


64513.222, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62345.406, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62795.761, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62024.176, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


64479.724, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


65039.333, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


63416.590, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


68446.528, 
alpha=1.0e-05 | mean RMSE: 63670.2 (1957.2)
	RMSE: 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


61603.647, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62088.859, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


64681.008, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62434.019, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62859.866, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62052.291, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


64592.090, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


65176.113, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


63575.718, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


68527.677, 
alpha=1.0e-04 | mean RMSE: 63759.1 (1980.9)
	RMSE: 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62138.350, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62425.108, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


65929.358, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


63307.464, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


63448.351, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


62740.286, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


65577.979, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


65820.626, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


64374.234, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


68893.703, 
alpha=1.0e-03 | mean RMSE: 64465.5 (1994.1)
	RMSE: 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


65896.566, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


66793.441, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


69569.845, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


68515.557, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


67848.082, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


66417.451, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


71013.650, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


69555.015, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


68814.859, 

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


73199.994, 
alpha=1.0e-02 | mean RMSE: 68762.4 (2110.2)
	RMSE: 83973.998, 86401.167, 89094.308, 89019.988, 87807.271, 86674.411, 90655.946, 87700.415, 89700.589, 91382.734, 
alpha=1.0e-01 | mean RMSE: 88241.1 (2090.3)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+00 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+01 | mean RMSE: 119373.7 (1626.4)
	RMSE: 118129.692, 117029.965, 120651.582, 118474.958, 118086.381, 118395.189, 122211.901, 118638.001, 121252.012, 120866.908, 
alpha=1.0e+02 | mean RMSE: 119373.7 (1626.4)


* Lassoはあまりうまくいかないらしい。

### 5-3) 3次までの項を追加する

* 正則化なしの線形回帰

In [33]:
poly = PolynomialFeatures(3)
reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(X_train.values[train_index])
  X_valid_transformed = poly.transform(X_train.values[valid_index])
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 60081.2, 58548.8, 62502.6, 59529.7, 59736.3, 59898.0, 63502.1, 61847.9, 61088.2, 67018.7, 
mean RMSE: 61375.4 (2364.0)


* Ridge回帰
 * スケーリングもする。

In [34]:
poly = PolynomialFeatures(3)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 59117.665, 58718.155, 62014.510, 59573.106, 59523.598, 59803.955, 62488.216, 61052.259, 60396.813, 64090.581, 
alpha=1.0e-05 | mean RMSE: 60677.9 (1628.5)
	RMSE: 59391.899, 58944.179, 62294.941, 59731.825, 59652.556, 60209.216, 62744.942, 61373.098, 60678.796, 64364.885, 
alpha=1.0e-04 | mean RMSE: 60938.6 (1646.5)
	RMSE: 59542.290, 59032.466, 62516.103, 59520.869, 59474.496, 60165.392, 62563.321, 61585.994, 60969.123, 64417.744, 
alpha=1.0e-03 | mean RMSE: 60978.8 (1672.5)
	RMSE: 59597.253, 59014.167, 62519.356, 59538.009, 59132.972, 60313.693, 62268.157, 61690.047, 61720.942, 64503.774, 
alpha=1.0e-02 | mean RMSE: 61029.8 (1708.4)
	RMSE: 59613.446, 59260.989, 62402.534, 60317.668, 58926.838, 60469.550, 62229.223, 62458.996, 61763.268, 64949.746, 
alpha=1.0e-01 | mean RMSE: 61239.2 (1766.6)
	RMSE: 59670.081, 59547.571, 62266.018, 60633.873, 59276.177, 59921.802, 62296.073, 62794.014, 61808.355, 65694.933, 
alpha=1.0e+00 | mean RMSE: 61390.9 (1892.5)
	RMSE: 60068.227, 60051.473,

In [35]:
poly = PolynomialFeatures(3)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-8, -5):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 58622.542, 58114.310, 61692.986, 59150.940, 59392.223, 59641.009, 61806.301, 61440.260, 60181.579, 64177.407, 
alpha=1.0e-08 | mean RMSE: 60422.0 (1748.7)
	RMSE: 58671.941, 58194.794, 61732.682, 59186.590, 59341.113, 59638.911, 61867.079, 61311.720, 60155.850, 64088.290, 
alpha=1.0e-07 | mean RMSE: 60418.9 (1715.8)
	RMSE: 58846.539, 58410.612, 61847.116, 59335.691, 59392.042, 59693.899, 62074.553, 60986.873, 60202.157, 64020.162, 
alpha=1.0e-06 | mean RMSE: 60481.0 (1653.5)


### 5-4) 4次までの項を追加する

In [36]:
poly = PolynomialFeatures(4)
reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(X_train.values[train_index])
  X_valid_transformed = poly.transform(X_train.values[valid_index])
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 58862.1, 58546.9, 60615.4, 61390.5, 59628.8, 58798.0, 62859.0, 62793.6, 60464.9, 67449.5, 
mean RMSE: 61140.9 (2568.6)


In [37]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 58483.647, 56802.364, 59234.198, 58081.340, 57027.497, 56448.437, 60008.654, 59149.583, 58581.014, 64619.787, 
alpha=1.0e-05 | mean RMSE: 58843.7 (2212.4)
	RMSE: 58983.078, 57356.093, 60014.980, 58608.315, 57284.058, 56064.510, 60788.856, 59370.442, 59216.002, 64137.432, 
alpha=1.0e-04 | mean RMSE: 59182.4 (2118.8)
	RMSE: 58650.640, 58325.964, 60033.874, 58778.672, 57395.681, 56986.048, 61724.659, 60540.234, 59474.943, 64221.999, 
alpha=1.0e-03 | mean RMSE: 59613.3 (2045.8)
	RMSE: 57892.092, 58297.422, 60317.726, 58744.367, 57687.035, 57318.957, 61264.545, 60351.861, 59277.528, 63706.193, 
alpha=1.0e-02 | mean RMSE: 59485.8 (1868.0)
	RMSE: 57975.487, 58071.251, 60652.358, 59309.654, 57655.585, 57261.152, 61523.918, 60537.660, 60407.681, 64240.223, 
alpha=1.0e-01 | mean RMSE: 59763.5 (2047.3)
	RMSE: 58440.912, 58116.037, 60964.335, 59112.657, 58091.635, 57661.677, 61335.089, 60812.826, 60840.459, 64827.409, 
alpha=1.0e+00 | mean RMSE: 60020.3 (2079.8)
	RMSE: 59399.456, 58846.146,

In [38]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-8, -5):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 56318.975, 56841.687, 58758.484, 58687.287, 56263.355, 56636.375, 59952.618, 60319.169, 57863.412, 64382.163, 
alpha=1.0e-08 | mean RMSE: 58602.4 (2375.0)
	RMSE: 56576.340, 56847.291, 58415.483, 58675.852, 56306.751, 56244.295, 59551.125, 60423.975, 57554.244, 65009.049, 
alpha=1.0e-07 | mean RMSE: 58560.4 (2537.7)
	RMSE: 57437.663, 56650.501, 58587.219, 57825.936, 56439.526, 56038.536, 59818.187, 59668.886, 57913.823, 64848.949, 
alpha=1.0e-06 | mean RMSE: 58522.9 (2430.4)


### 5-5) 5次までの項を追加する

In [39]:
poly = PolynomialFeatures(5)
reg = LinearRegression()
rmses = []
print(f'\tRMSE:', end=' ')
for train_index, valid_index in kf.split(X_train):
  X_train_transformed = poly.fit_transform(X_train.values[train_index])
  X_valid_transformed = poly.transform(X_train.values[valid_index])
  reg.fit(X_train_transformed, y_train.values[train_index])
  y_valid_pred = reg.predict(X_valid_transformed)
  y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
  rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
  rmses.append(rmse)
  print(f'{rmse:.1f}', end=', ')
print()
rmses = np.array(rmses)
print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 64942.5, 63927.7, 71069.9, 68668.5, 62664.2, 63866.5, 69432.4, 67384.3, 61980.1, 70349.8, 
mean RMSE: 66428.6 (3175.8)


In [40]:
poly = PolynomialFeatures(5)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-5, 3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 57264.454, 58297.020, 61593.996, 60423.565, 57360.488, 57995.633, 61350.052, 60471.011, 59734.684, 65179.874, 
alpha=1.0e-05 | mean RMSE: 59967.1 (2302.7)
	RMSE: 58677.293, 57568.881, 60654.513, 59520.578, 56501.550, 57289.432, 60662.511, 59431.116, 60161.312, 65200.497, 
alpha=1.0e-04 | mean RMSE: 59566.8 (2320.8)
	RMSE: 58230.278, 57051.971, 60925.122, 59848.991, 57424.528, 56709.768, 60630.102, 59324.167, 59699.388, 63870.238, 
alpha=1.0e-03 | mean RMSE: 59371.5 (2052.6)
	RMSE: 58399.423, 57060.497, 60496.846, 59984.262, 57496.163, 57737.573, 60686.368, 60290.367, 59166.415, 63523.607, 
alpha=1.0e-02 | mean RMSE: 59484.2 (1839.6)
	RMSE: 57581.788, 57871.498, 60287.483, 59019.924, 57493.500, 57180.383, 60306.231, 61023.920, 59453.449, 63836.945, 
alpha=1.0e-01 | mean RMSE: 59405.5 (1956.6)
	RMSE: 57968.084, 58185.132, 60707.375, 59163.248, 57606.109, 57297.950, 60802.384, 60382.171, 60425.748, 64150.925, 
alpha=1.0e+00 | mean RMSE: 59668.9 (1968.8)
	RMSE: 58636.154, 58439.294,

### 5-6) 結論

* `PolynomialFeatures(4)`を使い、`StandardScaler()`でスケーリングし、`Ridge(alpha=1e-6)`で予測する場合が最も良かった。

* `random_state`の値を変更した10-fold交差検証をおこなって、似たような性能が出せるか、確認する。

In [41]:
kf = KFold(n_splits=10, shuffle=True, random_state=2345)

poly = PolynomialFeatures(4)
scaler = StandardScaler()
for alpha in 10.0 ** np.arange(-8, -3):
  reg = Ridge(alpha=alpha, random_state=42)
  rmses = []
  print('\tRMSE:', end=' ')
  for train_index, valid_index in kf.split(X_train):
    X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train.values[train_index]))
    X_valid_transformed = scaler.transform(poly.transform(X_train.values[valid_index]))
    reg.fit(X_train_transformed, y_train.values[train_index])
    y_valid_pred = reg.predict(X_valid_transformed)
    y_valid_pred[y_valid_pred > y_train.values[train_index].max()] = y_train.values[train_index].max()
    rmse = mean_squared_error(np.exp(y_train.values[valid_index]), np.exp(y_valid_pred), squared=False)
    rmses.append(rmse)
    print(f'{rmse:.3f}', end=', ')
  print()
  rmses = np.array(rmses)
  print(f'alpha={alpha:.1e} | mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

	RMSE: 58708.127, 58628.282, 57681.899, 61094.448, 57220.015, 60820.353, 58823.196, 55963.411, 57535.783, 57494.414, 
alpha=1.0e-08 | mean RMSE: 58397.0 (1511.8)
	RMSE: 58523.845, 59174.913, 58294.113, 60249.877, 57548.619, 61176.895, 58971.810, 54806.296, 57985.415, 57762.520, 
alpha=1.0e-07 | mean RMSE: 58449.4 (1621.1)
	RMSE: 58396.561, 59383.276, 57804.356, 60377.019, 57713.359, 61490.963, 58794.931, 54715.286, 58221.332, 57419.356, 
alpha=1.0e-06 | mean RMSE: 58431.6 (1731.3)
	RMSE: 59631.826, 60166.375, 57577.376, 59770.579, 58305.893, 61799.884, 58885.392, 55350.793, 58989.932, 57603.670, 
alpha=1.0e-05 | mean RMSE: 58808.2 (1661.6)
	RMSE: 60161.893, 60787.585, 58028.643, 59678.964, 58130.911, 62057.813, 58738.213, 55710.183, 59732.937, 57964.060, 
alpha=1.0e-04 | mean RMSE: 59099.1 (1687.0)


* alphaは1e-8から1e-6ぐらいなら、あまり性能は変わらないのかもしれない。
* とりあえずalpha=1e-6を選ぶ。

## 6) チューニング済みの手法をテストデータ上で評価

In [42]:
poly = PolynomialFeatures(4)
scaler = StandardScaler()
X_train_transformed = scaler.fit_transform(poly.fit_transform(X_train))
X_test_transformed = scaler.transform(poly.transform(X_test))

In [43]:
reg = Ridge(alpha=1e-6, random_state=42)
reg.fit(X_train_transformed, y_train)
y_test_pred = reg.predict(X_test_transformed)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = mean_squared_error(np.exp(y_test), np.exp(y_test_pred), squared=False)
print(f'test RMSE: {rmse:.1f}')

test RMSE: 64047.9
