# Phase 2 — Train / Validation / Test Strategy

## Why time-based splitting?
This is a time-series regression problem, so the data must be split **chronologically**.  
Random shuffling is avoided to prevent **data leakage**, where information from the future could influence training.
- Hyperparameter tuning via GridSearch or RandomizedSearch was intentionally avoided to prevent temporal leakage and to prioritize interpretability and methodological correctness.


## Split Plan
- **Train:** first 70% of the timeline  
- **Validation:** next 15%  
- **Test:** final 15%  

The test set remains untouched until the final evaluation.


In [15]:
# split data into train and validation sets. 
len_data = len(data)

train_split = int(0.70 * len_data)
valid_split = int((0.85) * len_data)

train_df = data.iloc[:train_split].copy()
valid_df = data.iloc[train_split:valid_split].copy()
test_df  = data.iloc[valid_split:].copy()

print("train:",len(train_df)," val:",len(valid_df), " test:",len(test_df)  )

train: 36304  val: 7780  test: 7780


-  Time ranges for each split

In [16]:
print("Train:", train_df["Date"].min(), "→", train_df["Date"].max())
print("Val  :", valid_df["Date"].min(),   "→", valid_df["Date"].max())
print("Test :", test_df["Date"].min(),  "→", test_df["Date"].max())


Train: 2020-01-01 00:00:00 → 2024-02-21 00:00:00
Val  : 2024-02-21 00:00:00 → 2025-01-10 00:00:00
Test : 2025-01-10 00:00:00 → 2025-11-30 00:00:00


- Check for row overlaps

In [17]:
assert train_df.index.max() < valid_df.index.min(), "train/val overlap detected!"
assert valid_df.index.max() < test_df.index.min(), "val/test overlap detected!"
print("no overlap detected")


no overlap detected


**Check for Target drift:**
-     This tells you if production behaviour changes over time.
-     A gradual increase in mean production across train, validation, and test sets indicates temporal drift, reinforcing the need for time-aware splitting and robust evaluation.


In [18]:
train_mean = train_df["Production"].mean()
valid_mean   = valid_df["Production"].mean()
test_mean  = test_df["Production"].mean()

print("mean production -> train:", train_mean)
print("mean production -> val  :", valid_mean)
print("mean production -> test :", test_mean)

mean production -> train: 5949.373429925077
mean production -> val  : 6608.822493573265
mean production -> test : 7061.143316195373


## Phase 2 Summary
- The dataset was split **chronologically** into train (70%), validation (15%), and test (15%) sets.
- This splitting strategy prevents future information from leaking into training.
- The test set is kept untouched for final model evaluation.
- Date ranges were verified to ensure no overlap between splits.


---