# Phase 2 â€” Train / Validation / Test Strategy

## Why time-based splitting?
This is a time-series regression problem, so the data must be split **chronologically**.  
Random shuffling is avoided to prevent **data leakage**, where information from the future could influence training.
- Hyperparameter tuning via GridSearch or RandomizedSearch was intentionally avoided to prevent temporal leakage and to prioritize interpretability and methodological correctness.


## Split Plan
- **Train:** first 70% of the timeline  
- **Validation:** next 15%  
- **Test:** final 15%  

The test set remains untouched until the final evaluation.


In [2]:
import sys
from pathlib import Path

ROOT = Path.cwd().parent          # notebooks -> repo root
SRC = ROOT / "src"

if str(SRC) not in sys.path:
    sys.path.append(str(SRC))

print("Using SRC:", SRC)


Using SRC: c:\ML\test_project\project-regression\src


In [3]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge

from energy_forecast.io import load_data
from energy_forecast.split import time_split
from energy_forecast.evaluate import root_mean_squared_error


In [5]:
df = load_data("../data/Energy Production Dataset.csv", date_col="Date")
train_df, val_df, test_df = time_split(df, time_col="Date")  # defaults to 70/15/15

print("train:", len(train_df), "val:", len(val_df), "test:", len(test_df))



train: 36304 val: 7780 test: 7780


-  Time ranges for each split

In [6]:
print("Train max:", train_df["Date"].max())
print("Val min/max:", val_df["Date"].min(), val_df["Date"].max())
print("Test min:", test_df["Date"].min())


Train max: 2024-02-21 00:00:00
Val min/max: 2024-02-21 00:00:00 2025-01-10 00:00:00
Test min: 2025-01-10 00:00:00


## Phase 2 Summary
- The dataset was split **chronologically** into train (70%), validation (15%), and test (15%) sets.
- This splitting strategy prevents future information from leaking into training.
- The test set is kept untouched for final model evaluation.
- Date ranges were verified to ensure no overlap between splits.


---