# Phase 5 — Forecasting Feature Engineering (Lags & Rolling Statistics)

Baselines and simple pipelines confirm that the dataset has learnable structure.
Now we add forecasting-specific features that allow the model to use **recent history** safely.

This phase introduces:
- lag features (previous values of the target)
- rolling statistics (recent trend summaries)

These features often provide the biggest performance jump in time-series regression.

---

## Why Lags Are Powerful

Energy production has strong short-term dependency:
- the previous hour often influences the next hour
- daily cycles repeat (24-hour periodicity)

Lag features make this explicit by providing the model with controlled memory.

Common choices:
- `lag_1`  → captures short-term continuity
- `lag_24` → captures daily seasonality

---


**⚠️ Important rule:**
- Lag features must be created before splitting OR created separately for each split.
- We’ll do it safely on the full dataset, then re-split.

In [1]:
import sys
from pathlib import Path
ROOT = Path.cwd().parent
SRC = ROOT / "src"
if str(SRC) not in sys.path:
    sys.path.append(str(SRC))

from energy_forecast.io import load_data
from energy_forecast.split import time_split
from energy_forecast.evaluate import root_mean_squared_error
from energy_forecast.features import add_lag_features, add_rolling_features

from sklearn.linear_model import Ridge


In [2]:
df = load_data("../data/Energy Production Dataset.csv", date_col="Date")
df.shape


(51864, 9)

## Rolling Statistics (Trend / Smoothing)

Rolling features summarize recent behavior, such as:
- rolling mean (e.g., last 24 hours)
- rolling median (optional)
- rolling std (optional, later)

These features help the model handle:
- gradual shifts
- noisy patterns
- changing volatility

---


In [3]:
df_feat = add_lag_features(df, lags=(1, 24))
df_feat = add_rolling_features(df_feat, windows=(24,))
df_feat = df_feat.dropna().reset_index(drop=True)

df_feat.shape


(51840, 12)

### **Recreate train / val / test splits:** 
    We must re-split because rows were dropped.

In [4]:
train_df, val_df, test_df = time_split(df_feat, time_col="Date")
print(len(train_df), len(val_df), len(test_df))


36288 7776 7776


### **Train pipeline model with lag features**

In [5]:
TARGET = "Production"

def numeric_X(d):
    return d.drop(columns=[TARGET]).select_dtypes(include="number")

X_train, y_train = numeric_X(train_df), train_df[TARGET]
X_val, y_val = numeric_X(val_df), val_df[TARGET]
X_test, y_test = numeric_X(test_df), test_df[TARGET]


In [6]:
ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train, y_train)

val_rmse = root_mean_squared_error(y_val, ridge.predict(X_val))
test_rmse = root_mean_squared_error(y_test, ridge.predict(X_test))

print("Phase 5 RMSE (lag + rolling features)")
print("Val :", val_rmse)
print("Test:", test_rmse)


Phase 5 RMSE (lag + rolling features)
Val : 2855.410911539755
Test: 2920.1614005955084


## Leakage Prevention Rule

All lag/rolling features must be built so that at timestamp **t**,
they use only information from **t-1 and earlier**.

Implementation discipline:
- lags use `.shift(lag)`
- rolling windows must use `.shift(1)` before `.rolling(...)`

This ensures the model never sees the target value of the same hour it is predicting.

---

## Phase 5 — Summary

We engineered forecasting-specific features using lag and rolling statistics.

Key additions:
- short-term memory via `lag_1`
- daily periodicity via `lag_24`
- trend smoothing via `24h rolling mean`

All features were created with strict shifting to prevent data leakage.
With these time-aware signals in place, the model is now ready for advanced regressors to capture non-linear patterns and improve RMSE.


---