# Phase 3 — Baseline Models & Initial Feature Definition

This phase establishes baseline performance for the energy production forecasting task using simple and interpretable models.  
The goal is to create a reference point that all subsequent models and feature engineering efforts must outperform.

## Objectives
- Define feature and target variables explicitly
- Evaluate naive and simple baseline models
- Establish realistic performance expectations
- Avoid complex preprocessing, encoding, or tuning at this stage

**At this stage, only a minimal set of numerical time-based features is used to establish a clean and interpretable baseline. Additional features and preprocessing are introduced in later phases.**


In [1]:
import sys
from pathlib import Path

ROOT = Path.cwd().parent
SRC = ROOT / "src"
if str(SRC) not in sys.path:
    sys.path.append(str(SRC))



**Note on feature selection:**  
The `End_Hour` feature is intentionally excluded at this stage because it is highly correlated with `Start_Hour` (typically representing the next hour). Including both would add redundancy without providing meaningful additional signal for baseline models. A minimal and interpretable feature set is preferred in this phase. The inclusion of `End_Hour` is revisited in later phases when using more flexible models and feature importance analysis.



In [2]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge

from energy_forecast.io import load_data
from energy_forecast.split import time_split
from energy_forecast.evaluate import root_mean_squared_error


In [3]:
df = load_data("../data/Energy Production Dataset.csv", date_col="Date")
df.shape

(51864, 9)

In [4]:
train_df, val_df, test_df = time_split(df, time_col="Date")  # defaults to 70/15/15
print("train:", len(train_df), "val:", len(val_df), "test:", len(test_df))

train: 36304 val: 7780 test: 7780


### **Baseline Model-1: Mean Predictor**

In [5]:
TARGET = "Production"

def numeric_X(d):
    return d.drop(columns=[TARGET]).select_dtypes(include=["number"])

X_train, y_train = numeric_X(train_df), train_df[TARGET]
X_val, y_val = numeric_X(val_df), val_df[TARGET]
X_test, y_test = numeric_X(test_df), test_df[TARGET]


### **Baseline Model-2: Ridge Regression**

In [6]:
# Mean baseline
mean_model = DummyRegressor(strategy="mean")
mean_model.fit(X_train, y_train)
mean_val_pred = mean_model.predict(X_val)
mean_test_pred = mean_model.predict(X_test)

# Ridge baseline
ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train, y_train)
ridge_val_pred = ridge.predict(X_val)
ridge_test_pred = ridge.predict(X_test)

print("RMSE (root_mean_squared_error)")
print("Mean  - Val :", root_mean_squared_error(y_val, mean_val_pred))
print("Mean  - Test:", root_mean_squared_error(y_test, mean_test_pred))
print("Ridge - Val :", root_mean_squared_error(y_val, ridge_val_pred))
print("Ridge - Test:", root_mean_squared_error(y_test, ridge_test_pred))


RMSE (root_mean_squared_error)
Mean  - Val : 4474.629669570113
Mean  - Test: 4213.391316725995
Ridge - Val : 4434.160660379344
Ridge - Test: 4192.933525782805


### **Compare Baselines**

**Interpretation:**  
The Ridge Regression model achieves a lower RMSE than the naive mean predictor, indicating that the selected time-based features contain meaningful predictive signal. This confirms that the problem is learnable and that simple linear relationships can already improve upon trivial baselines.


## Phase 3 Summary — Baseline Models

In this phase, baseline performance for the energy production forecasting task was established.

Key outcomes:
- Feature and target variables were explicitly defined using basic time-based numerical features.
- A naive mean predictor was evaluated to establish a minimum performance benchmark.
- A Ridge Regression model was trained as a simple, interpretable baseline.
- Ridge Regression outperformed the naive baseline, confirming the presence of learnable signal in the data.

These baseline results provide a reference point for evaluating more complex models, feature engineering, and preprocessing pipelines in subsequent phases.
