# Phase 5 — Feature Engineering (Lag & Rolling Features)

In this phase, time-dependent features are introduced to explicitly capture temporal continuity in energy production.  
Lagged values and rolling statistics allow the model to leverage recent historical behavior, which is critical for time-series forecasting.

**⚠️ Important rule:**
- Lag features must be created before splitting OR created separately for each split.
- We’ll do it safely on the full dataset, then re-split.

In [1]:
import sys
from pathlib import Path
ROOT = Path.cwd().parent
SRC = ROOT / "src"
if str(SRC) not in sys.path:
    sys.path.append(str(SRC))

from energy_forecast.io import load_data
from energy_forecast.split import time_split
from energy_forecast.evaluate import root_mean_squared_error
from energy_forecast.features import add_lag_features, add_rolling_features

from sklearn.linear_model import Ridge


In [2]:
df = load_data("../data/Energy Production Dataset.csv", date_col="Date")
df.shape


(51864, 9)

### Drop rows with NaNs introduced by lags

In [3]:
df_feat = add_lag_features(df, lags=(1, 24))
df_feat = add_rolling_features(df_feat, windows=(24,))
df_feat = df_feat.dropna().reset_index(drop=True)

df_feat.shape


(51840, 12)

### **Recreate train / val / test splits:** 
    We must re-split because rows were dropped.

In [5]:
train_df, val_df, test_df = time_split(df_feat, time_col="Date")
print(len(train_df), len(val_df), len(test_df))


36288 7776 7776


### **Train pipeline model with lag features**

In [6]:
TARGET = "Production"

def numeric_X(d):
    return d.drop(columns=[TARGET]).select_dtypes(include="number")

X_train, y_train = numeric_X(train_df), train_df[TARGET]
X_val, y_val = numeric_X(val_df), val_df[TARGET]
X_test, y_test = numeric_X(test_df), test_df[TARGET]


In [7]:
ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train, y_train)

val_rmse = root_mean_squared_error(y_val, ridge.predict(X_val))
test_rmse = root_mean_squared_error(y_test, ridge.predict(X_test))

print("Phase 5 RMSE (lag + rolling features)")
print("Val :", val_rmse)
print("Test:", test_rmse)


Phase 5 RMSE (lag + rolling features)
Val : 2855.410911539755
Test: 2920.1614005955084


## Phase 5 Summary — Feature Engineering

In this phase, lagged and rolling features were introduced to capture short-term temporal dependencies in energy production.

Key outcomes:
- Lag features (1-hour and 24-hour) and a 24-hour rolling mean were added to the feature set.
- Rows affected by lag-induced missing values were removed to maintain data integrity.
- The Ridge Regression pipeline was retrained using the enhanced feature set.
- Validation RMSE improved further compared to earlier phases, demonstrating the importance of time-dependent feature engineering for this task.

This phase confirms that historical production behavior is a strong predictor of future output and provides a solid foundation for introducing non-linear models and advanced tuning in subsequent phases.


---