# プランナー課題６

* RMSEによって評価される予測性能を、良くして下さい
* test setとそれ以外の部分の分割は、変えないでください
  * test set以外の部分をどう使うかは、自由です。
  * training setとvalidation setをくっつけて、交差検証をしていいです。
* その他、いろいろ試行錯誤してみてください。
  * リッジ回帰とLassoを使ってもいいです
  * 高次多項式特徴量を使ってもいいです（cf. `sklearn.preprocessing.PolynomialFeatures`）
* 予測手法のチューニングを尽くした上で、最後にtest setでのRMSEによる評価を実施してください。
  * test setでの評価結果を見て、チューニングに戻ってはいけません。

## 解答例

In [1]:
from tqdm.auto import tqdm
import numpy as np
from scipy import stats, special
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import PolynomialFeatures

%config InlineBackend.figure_format = 'retina'

np.random.seed(42)

In [2]:
!wget https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.tgz
!tar zxvf housing.tgz

--2025-06-13 21:02:50--  https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.tgz
raw.githubusercontent.com (raw.githubusercontent.com) をDNSに問いあわせています... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 409488 (400K) [application/octet-stream]
`housing.tgz.4' に保存中


2025-06-13 21:02:50 (3.80 MB/s) - `housing.tgz.4' へ保存完了 [409488/409488]

x housing.csv


In [3]:
df = pd.read_csv("housing.csv")
df_onehot = pd.get_dummies(df, dtype=int)
X = df_onehot.drop('median_house_value', axis=1)
y = df_onehot["median_house_value"].copy()

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1234)

### 交差検証をするために、training setとvalidation setを合併する。

In [5]:
X_train = pd.concat([X_train, X_val])
y_train = pd.concat([y_train, y_val])

### 目的変数の対数をとる。

In [6]:
y_train = np.log(y_train)
y_test = np.log(y_test)

### 10-foldの交差検証を3回おこなう。

In [7]:
kfold = []
for i in range(3):
  kfold.append(KFold(n_splits=10, shuffle=True, random_state=np.random.randint(1, 10000)))

### リッジ回帰＋min-maxスケーリング＋2次多項式特徴量

In [8]:
scaler = MinMaxScaler()
poly = PolynomialFeatures(2)

for alpha in 10. ** np.arange(-6, 2):
  reg = Ridge(alpha=alpha, random_state=123)
  print(f"---- Ridge regression for alpha={alpha:.2e}")
  rmses = []
  for i in tqdm(range(len(kfold))):
    for train_index, val_index in kfold[i].split(X_train):
      _X_train, _X_val = X_train.iloc[train_index], X_train.iloc[val_index]
      _y_train, _y_val = y_train.iloc[train_index], y_train.iloc[val_index]
      total_bedrooms_median = _X_train["total_bedrooms"].median()
      _X_train = _X_train.fillna({'total_bedrooms': total_bedrooms_median})
      _X_val = _X_val.fillna({'total_bedrooms': total_bedrooms_median})
      _X_train = poly.fit_transform(scaler.fit_transform(_X_train))
      _X_val = poly.transform(scaler.transform(_X_val))
      reg.fit(_X_train, _y_train)
      y_val_pred = reg.predict(_X_val)
      y_val_pred[y_val_pred > _y_train.max()] = _y_train.max()
      rmse = root_mean_squared_error(np.exp(_y_val), np.exp(y_val_pred))
      rmses.append(rmse)
  rmses = np.array(rmses)
  print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

---- Ridge regression for alpha=1.00e-06


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 61614.2 (1326.8)
---- Ridge regression for alpha=1.00e-05


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 61613.1 (1327.9)
---- Ridge regression for alpha=1.00e-04


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 61609.1 (1332.1)
---- Ridge regression for alpha=1.00e-03


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 61631.0 (1328.2)
---- Ridge regression for alpha=1.00e-02


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 61864.5 (1327.8)
---- Ridge regression for alpha=1.00e-01


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 62443.1 (1339.4)
---- Ridge regression for alpha=1.00e+00


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 64303.8 (1405.3)
---- Ridge regression for alpha=1.00e+01


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 67873.3 (1588.3)


### リッジ回帰＋min-maxスケーリング＋3次多項式特徴量

In [9]:
scaler = MinMaxScaler()
poly = PolynomialFeatures(3)

for alpha in 10. ** np.arange(-8, -2):
  reg = Ridge(alpha=alpha, random_state=123)
  print(f"---- Ridge regression for alpha={alpha:.2e}")
  rmses = []
  for i in tqdm(range(len(kfold))):
    for train_index, val_index in kfold[i].split(X_train):
      _X_train, _X_val = X_train.iloc[train_index], X_train.iloc[val_index]
      _y_train, _y_val = y_train.iloc[train_index], y_train.iloc[val_index]
      total_bedrooms_median = _X_train["total_bedrooms"].median()
      _X_train = _X_train.fillna({'total_bedrooms': total_bedrooms_median})
      _X_val = _X_val.fillna({'total_bedrooms': total_bedrooms_median})
      _X_train = poly.fit_transform(scaler.fit_transform(_X_train))
      _X_val = poly.transform(scaler.transform(_X_val))
      reg.fit(_X_train, _y_train)
      y_val_pred = reg.predict(_X_val)
      y_val_pred[y_val_pred > _y_train.max()] = _y_train.max()
      rmse = root_mean_squared_error(np.exp(_y_val), np.exp(y_val_pred))
      rmses.append(rmse)
  rmses = np.array(rmses)
  print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

---- Ridge regression for alpha=1.00e-08


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56920.9 (1558.8)
---- Ridge regression for alpha=1.00e-07


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56860.7 (1519.2)
---- Ridge regression for alpha=1.00e-06


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56811.4 (1470.0)
---- Ridge regression for alpha=1.00e-05


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56883.9 (1447.8)
---- Ridge regression for alpha=1.00e-04


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 57019.3 (1412.1)
---- Ridge regression for alpha=1.00e-03


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 57398.9 (1454.1)


### リッジ回帰＋min-maxスケーリング＋3次多項式特徴量（再）
* データセットの分割を変えてもう一度。

In [10]:
kfold = []
for i in range(3):
  kfold.append(KFold(n_splits=10, shuffle=True, random_state=np.random.randint(1, 10000)))

In [11]:
scaler = MinMaxScaler()
poly = PolynomialFeatures(3)

for alpha in 10. ** np.arange(-8, -2):
  reg = Ridge(alpha=alpha, random_state=123)
  print(f"---- Ridge regression for alpha={alpha:.2e}")
  rmses = []
  for i in tqdm(range(len(kfold))):
    for train_index, val_index in kfold[i].split(X_train):
      _X_train, _X_val = X_train.iloc[train_index], X_train.iloc[val_index]
      _y_train, _y_val = y_train.iloc[train_index], y_train.iloc[val_index]
      total_bedrooms_median = _X_train["total_bedrooms"].median()
      _X_train = _X_train.fillna({'total_bedrooms': total_bedrooms_median})
      _X_val = _X_val.fillna({'total_bedrooms': total_bedrooms_median})
      _X_train = poly.fit_transform(scaler.fit_transform(_X_train))
      _X_val = poly.transform(scaler.transform(_X_val))
      reg.fit(_X_train, _y_train)
      y_val_pred = reg.predict(_X_val)
      y_val_pred[y_val_pred > _y_train.max()] = _y_train.max()
      rmse = root_mean_squared_error(np.exp(_y_val), np.exp(y_val_pred))
      rmses.append(rmse)
  rmses = np.array(rmses)
  print(f'mean RMSE: {rmses.mean():.1f} ({rmses.std():.1f})')

---- Ridge regression for alpha=1.00e-08


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56806.8 (1586.8)
---- Ridge regression for alpha=1.00e-07


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56738.3 (1572.1)
---- Ridge regression for alpha=1.00e-06


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56690.8 (1474.0)
---- Ridge regression for alpha=1.00e-05


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56763.6 (1455.0)
---- Ridge regression for alpha=1.00e-04


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 56980.5 (1511.6)
---- Ridge regression for alpha=1.00e-03


  0%|          | 0/3 [00:00<?, ?it/s]

mean RMSE: 57358.2 (1588.9)


* `alpha=1.0e-06`で良さそう。

## チューニング済みの予測手法をテストデータ上で評価

In [12]:
scaler = MinMaxScaler()
poly = PolynomialFeatures(3)

alpha = 10. ** -6
reg = Ridge(alpha=alpha, random_state=123)
total_bedrooms_median = X_train["total_bedrooms"].median()
_X_train = X_train.fillna({'total_bedrooms': total_bedrooms_median})
_X_test = X_test.fillna({'total_bedrooms': total_bedrooms_median})
_X_train = poly.fit_transform(scaler.fit_transform(_X_train))
_X_test = poly.transform(scaler.transform(_X_test))
reg.fit(_X_train, y_train)
y_test_pred = reg.predict(_X_test)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = root_mean_squared_error(np.exp(y_test), np.exp(y_test_pred))
print(f'test RMSE: {rmse:.1f}')

test RMSE: 58992.5


### baseline（何の工夫もない線形回帰）のテスト性能を確認しておく。

In [13]:
df = pd.read_csv("housing.csv")
df_onehot = pd.get_dummies(df, dtype=int)
X = df_onehot.drop('median_house_value', axis=1)
y = df_onehot["median_house_value"].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

reg = LinearRegression()
total_bedrooms_median = X_train["total_bedrooms"].median()
_X_train = X_train.fillna({'total_bedrooms': total_bedrooms_median})
_X_test = X_test.fillna({'total_bedrooms': total_bedrooms_median})
reg.fit(_X_train, y_train)
y_test_pred = reg.predict(_X_test)
y_test_pred[y_test_pred > y_train.max()] = y_train.max()
rmse = root_mean_squared_error(y_test, y_test_pred)
print(f'baseline test RMSE: {rmse:.1f}')

baseline test RMSE: 69077.4


* これに比べれば、かなり良くなっている。