# ベースラインモデル（線形回帰）

このノートブックでは、前処理済みデータを用いて線形回帰によるベースラインモデルを構築し、評価します。

- 目的変数: `price_actual`
- モデル: 線形回帰（LinearRegression）
- 評価指標: RMSE


## 1. ライブラリのインポートとデータ読み込み

In [10]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# データディレクトリ
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data'

# 前処理済みデータの読み込み
train = pd.read_csv(DATA_DIR / 'train_processed.csv')
test = pd.read_csv(DATA_DIR / 'test_processed.csv')

print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (26280, 113)
test shape: (8760, 112)


## 2. 特徴量・目的変数の設定

In [11]:
# 目的変数
target_col = 'price_actual'

# 説明変数（目的変数とtime列以外）
drop_cols = ['time', target_col] if target_col in train.columns else ['time']
feature_cols = [col for col in train.columns if col not in drop_cols]

X = train[feature_cols]
y = train[target_col] if target_col in train.columns else train.iloc[:, -1]  # 念のため

print('Features:', feature_cols)
print('Target:', target_col)
print('X shape:', X.shape)
print('y shape:', y.shape)

Features: ['generation_biomass', 'generation_fossil_brown_coal/lignite', 'generation_fossil_gas', 'generation_fossil_hard_coal', 'generation_fossil_oil', 'generation_hydro_pumped_storage_consumption', 'generation_hydro_run_of_river_and_poundage', 'generation_hydro_water_reservoir', 'generation_nuclear', 'generation_other', 'generation_other_renewable', 'generation_solar', 'generation_waste', 'generation_wind_onshore', 'total_load_actual', 'valencia_temp', 'valencia_temp_min', 'valencia_temp_max', 'valencia_pressure', 'valencia_humidity', 'valencia_wind_speed', 'valencia_wind_deg', 'valencia_rain_1h', 'valencia_rain_3h', 'valencia_snow_3h', 'valencia_clouds_all', 'valencia_weather_id', 'valencia_weather_main', 'valencia_weather_description', 'valencia_weather_icon', 'madrid_temp', 'madrid_temp_min', 'madrid_temp_max', 'madrid_pressure', 'madrid_humidity', 'madrid_wind_speed', 'madrid_wind_deg', 'madrid_rain_1h', 'madrid_rain_3h', 'madrid_snow_3h', 'madrid_clouds_all', 'madrid_weather_id

## 3. 学習・検証データ分割

In [12]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
print('X_train:', X_train.shape, 'X_valid:', X_valid.shape)

X_train: (21024, 112) X_valid: (5256, 112)


## 4. 線形回帰モデルの学習と予測

In [13]:
model = LinearRegression()
model.fit(X_train, y_train)

# 検証データで予測
y_pred = model.predict(X_valid)

# RMSEで評価
rmse = mean_squared_error(y_valid, y_pred, squared=False)
print('Validation RMSE:', rmse)

ValueError: could not convert string to float: 'clear'

## 5. テストデータへの予測と保存

In [None]:
# テストデータの予測
X_test = test[feature_cols]
test_pred = model.predict(X_test)

# 予測結果の保存
submission = test[['time']].copy()
submission['price_actual_pred'] = test_pred
submission.to_csv(DATA_DIR / 'submission_baseline.csv', index=False)
print('Saved: submission_baseline.csv')