# ランダムフォレストモデル

このノートブックでは、前処理済みデータを用いてランダムフォレストによるモデルを構築し、評価します。

- 目的変数: `price_actual`
- モデル: ランダムフォレスト（RandomForestRegressor）
- 評価指標: RMSE
- ハイパーパラメータチューニング: GridSearchCV


## 1. ライブラリのインポートとデータ読み込み

In [1]:
# ファイルの存在確認
!ls -a ../..
# アクティベート
!source ../../.venv/bin/activate

[34m.[m[m                   [34m.git[m[m                main.py             uv.lock
[34m..[m[m                  .gitignore          pyproject.toml
[34m.cursor[m[m             .python-version     README.md
.DS_Store           [34m.venv[m[m               [34msignate_smbc_202506[m[m


In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
import optuna

# データディレクトリ
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data'
print(DATA_DIR)
# 前処理済みデータの読み込み
train = pd.read_csv(DATA_DIR / 'train_processed.csv')
test = pd.read_csv(DATA_DIR / 'test_processed.csv')

print('train shape:', train.shape)
print('test shape:', test.shape)

/Users/m0122wt/Desktop/02.プライベート/01.ノウハウ/07.データ分析/notebook/signate_smbc_202506/data
train shape: (26280, 114)
test shape: (8760, 113)


## 2. 特徴量・目的変数の設定

In [3]:
# 目的変数
target_col = 'price_actual'

# 説明変数（目的変数とtime列以外）
drop_cols = ['time', target_col] if target_col in train.columns else ['time']
feature_cols = [col for col in train.columns if col not in drop_cols]

X = train[feature_cols]
y = train[target_col] if target_col in train.columns else train.iloc[:, -1]  # 念のため

print('Features:', feature_cols)
print('Target:', target_col)
print('X shape:', X.shape)
print('y shape:', y.shape)

Features: ['generation_biomass', 'generation_fossil_brown_coal/lignite', 'generation_fossil_gas', 'generation_fossil_hard_coal', 'generation_fossil_oil', 'generation_hydro_pumped_storage_consumption', 'generation_hydro_run_of_river_and_poundage', 'generation_hydro_water_reservoir', 'generation_nuclear', 'generation_other', 'generation_other_renewable', 'generation_solar', 'generation_waste', 'generation_wind_onshore', 'total_load_actual', 'valencia_temp', 'valencia_temp_min', 'valencia_temp_max', 'valencia_pressure', 'valencia_humidity', 'valencia_wind_speed', 'valencia_wind_deg', 'valencia_rain_1h', 'valencia_rain_3h', 'valencia_snow_3h', 'valencia_clouds_all', 'valencia_weather_id', 'valencia_weather_main', 'valencia_weather_description', 'valencia_weather_icon', 'madrid_temp', 'madrid_temp_min', 'madrid_temp_max', 'madrid_pressure', 'madrid_humidity', 'madrid_wind_speed', 'madrid_wind_deg', 'madrid_rain_1h', 'madrid_rain_3h', 'madrid_snow_3h', 'madrid_clouds_all', 'madrid_weather_id

## 3. 学習・検証データ分割

In [4]:
# ベイズ最適化によるハイパーパラメータ探索
tscv = TimeSeriesSplit(n_splits=5)
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 300),
        'max_depth': trial.suggest_int('max_depth', 5, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 5),
        'random_state': 42
    }
    rmses = []
    for train_idx, valid_idx in tscv.split(X):
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        model = RandomForestRegressor(**params)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_valid)
        rmse = root_mean_squared_error(y_valid, y_pred)
        rmses.append(rmse)
    return np.mean(rmses)

## 4. ランダムフォレストモデルの学習とハイパーパラメータチューニング

In [None]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)
print('Best params:', study.best_params)
print('Best CV RMSE:', study.best_value)
best_params = study.best_params
best_params['random_state'] = 42
best_model = RandomForestRegressor(**best_params)
best_model.fit(X, y)

[I 2025-06-19 22:32:34,246] A new study created in memory with name: no-name-682b73ec-f086-4b92-93ad-2f3b748e6876


ハイパーパラメータの設定  
ランダムフォレスト：グリッドサーチの結果  
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}  
Best score: 19.873329344971545  

## 5. 最適なモデルでの予測と評価

In [None]:
# 特徴量重要度で下位20%を除外し再学習
importances = best_model.feature_importances_
threshold = np.percentile(importances, 20)
selected_features = [f for f, imp in zip(feature_cols, importances) if imp > threshold]
best_model.fit(X[selected_features], y)


## 6. テストデータへの予測と保存

In [None]:
# テストデータ予測と提出ファイル出力（フォーマット厳守）
X_test = test[selected_features]
test_pred = best_model.predict(X_test)
submission = test[['time']].copy()
submission['price_actual_pred'] = test_pred
assert submission.iloc[0,0] == '2018-01-01 00:00:00+01:00', '1行1列目が要件を満たしません'
submission.to_csv(DATA_DIR / 'submission_random_forest_v2.csv', index=False, header=False)
print('Saved: submission_random_forest_v2.csv')


In [None]:
submission