# Phase 0 — Problem Framing (Time Series Regression)

## Context
Energy operators need reliable **hourly production forecasts** to support:
- grid stability and load balancing
- short-term capacity planning
- smoother integration of renewables (where variability is high)

This project treats forecasting as a *real-world ML system problem* — not just a leaderboard exercise.

---

## Problem Statement
Given historical, time-ordered observations of energy production and contextual signals, **predict the energy produced for each hour** while strictly respecting temporal order (no shuffling, no leakage).

**Task type:** Supervised learning → **Regression** (continuous target)  
**Data type:** Tabular **time series** (each row = one timestamp/hour)

---

## Target Variable
- **`Production`** → continuous numeric value representing energy produced during a specific hour.

---

## Inputs (Features)
The dataset contains:
- **Temporal signals:** timestamp-derived variables (hour / day-of-year / month / day name / season)
- **Categorical context:** e.g., **energy source** (Wind, etc.) and seasonal labels

> Key rule: since time drives the structure, features and evaluation must respect chronology.

---

## Forecasting Setup (What “Good” Looks Like)
A good model should:
- generalize to *future* periods (not memorize past patterns)
- remain stable across different time windows (not just one lucky split)
- avoid hidden leakage through preprocessing or feature engineering

---

## Evaluation Metric (Why RMSE)
Primary metric: **RMSE (Root Mean Squared Error)**  
RMSE is chosen because:
- it’s standard for continuous forecasting
- it penalizes large misses heavily (important in energy planning)
- it is easy to compare across models and phases

---

## Validation Strategy (Leakage-Safe)
Because this is time series:
- data must be split **chronologically**
- no random shuffling
- the **test set is used once** at the end (final unbiased check)

Split policy used in this project:
- **70% train / 15% validation / 15% test** (time-ordered)

---

## Project Deliverables
By the end of this notebook, we will produce:
- a leakage-safe ML pipeline (preprocessing + modeling)
- baseline benchmarks (so improvements are meaningful)
- feature engineering designed for forecasting (lags/rolling stats with shifting)
- model comparisons using validation RMSE
- error analysis explaining *where and why* the model fails (peaks, seasons, hours, sources)

---

## Methodology Snapshot
The workflow progresses in phases:
1. framing + metric + constraints
2. data sanity + EDA
3. time-aware splitting
4. baselines
5. preprocessing pipelines
6. feature engineering (lags/rolling)
7. advanced models
8. error analysis + robustness
9. conclusions + future work


---