# Phase 2 — Time-Aware Train / Validation / Test Split

In time series forecasting, data splitting is not optional — it defines whether the model is realistic or misleading.

Unlike traditional ML tasks, we **must preserve chronological order** to prevent information leakage.

---

## Why Random Splitting Is Wrong Here

Using `train_test_split(shuffle=True)` would:
- mix past and future data
- allow the model to "see" future patterns during training
- artificially inflate performance

This would invalidate the entire forecasting setup.

Therefore, we apply a strict chronological split.

---
## Implementation Note

The chronological splitting logic is implemented in a reusable module:

`src/energy_forecast/split.py`

This ensures:
- clean separation of concerns
- reusable data pipeline
- consistency across experiments

The notebook imports and uses this function rather than re-implementing splitting inline.

---

In [2]:
import sys
from pathlib import Path

ROOT = Path.cwd().parent          # notebooks -> repo root
SRC = ROOT / "src"

if str(SRC) not in sys.path:
    sys.path.append(str(SRC))

print("Using SRC:", SRC)


Using SRC: c:\ML\test_project\project-regression\src


In [3]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge

from energy_forecast.io import load_data
from energy_forecast.split import time_split
from energy_forecast.evaluate import root_mean_squared_error


## Splitting Strategy

We split the dataset in time order using:

- **70% → Training**
- **15% → Validation**
- **15% → Test**

### Why 70 / 15 / 15?

- Training set learns long-term patterns
- Validation set is used for model comparison & tuning
- Test set is untouched until final evaluation

The test set acts as a true simulation of future unseen data.

---

In [5]:
df = load_data("../data/Energy Production Dataset.csv", date_col="Date")
train_df, val_df, test_df = time_split(df, time_col="Date")  # defaults to 70/15/15

print("train:", len(train_df), "val:", len(val_df), "test:", len(test_df))



train: 36304 val: 7780 test: 7780


-  Time ranges for each split

In [None]:
print("Train max:", train_df["Date"].max())
print("Val min/max:", val_df["Date"].min(), val_df["Date"].max())
print("Test min:", test_df["Date"].min())


Train max: 2024-02-21 00:00:00
Val min/max: 2024-02-21 00:00:00 2025-01-10 00:00:00
Test min: 2025-01-10 00:00:00


- The chronological boundaries confirm strict time ordering.
- Validation begins exactly where training ends, and test begins after validation.
No forward leakage is observed.

---

## Sanity Checks After Splitting
We confirm:
- Training data ends before validation begins
- Validation ends before test begins
- No timestamp overlap
- Target distribution remains consistent across splits

This guards against silent leakage.


## Why Validation Is Critical in Time Series

Validation is used to:
- Compare baseline vs advanced models
- Tune hyperparameters
- Evaluate feature engineering impact

The test set is used **once** — only at the very end.

---


## Phase 2 — Summary

We implemented a strict chronological split (70/15/15) to simulate real-world forecasting conditions.

By preventing shuffling and preserving time order, we eliminated future data leakage and established a reliable validation framework.

The model can now be evaluated honestly on unseen future data.


---