# 04 — Modeling and Forecasting

Objectif : entraîner, comparer et évaluer plusieurs modèles de prévision de la
consommation électrique nationale à partir du dataset enrichi par feature engineering.

Les modèles sont évalués :
- sur des données réelles (baseline)
- sur des données reconstruites (Prophet-filled)
- avec une séparation temporelle stricte


In [5]:
from pathlib import Path
import pandas as pd
import numpy as np

from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge
from xgboost import XGBRegressor


## Chargement des datasets de features

Deux jeux sont utilisés :
- dataset de référence (sans interpolation)
- dataset reconstruit par Prophet (contrefactuel)


In [7]:
PROJECT_ROOT = Path("/home/onyxia/france-grid-stress-prediction")
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"

BASELINE_PATH = DATA_PROCESSED / "dataset_features.parquet"
PROPHET_PATH  = DATA_PROCESSED / "dataset_features_prophetfilled.parquet"

df_base = pd.read_parquet(BASELINE_PATH)
df_prophet = pd.read_parquet(PROPHET_PATH)

df_base.head()


Unnamed: 0,datetime,y,split,temperature_2m,wind_speed_10m,direct_radiation,diffuse_radiation,cloud_cover,hour,dayofweek,...,doy_sin,doy_cos,load_lag_1h,load_lag_24h,load_lag_48h,load_lag_168h,load_roll_mean_24h,load_roll_std_24h,load_roll_mean_168h,load_roll_std_168h
0,2010-01-08 00:00:00,74564.5,train,-2.365344,12.290582,0.0,0.0,67.0625,0,4,...,0.137185,0.990545,73921.5,73233.0,72064.5,52685.0,82903.416667,4661.83838,74125.791667,10498.593411
1,2010-01-08 01:00:00,77065.5,train,-2.537219,12.808883,0.0,0.0,70.78125,1,4,...,0.137185,0.990545,74564.5,75735.5,74674.5,52142.5,82958.895833,4548.289956,74256.026786,10365.896658
2,2010-01-08 02:00:00,82297.0,train,-2.552844,13.657961,0.0,0.0,73.9375,2,4,...,0.137185,0.990545,77065.5,80790.5,79808.5,52081.5,83014.3125,4463.770184,74404.377976,10224.908115
3,2010-01-08 03:00:00,87563.0,train,-2.551281,14.603605,0.0,0.0,77.5625,3,4,...,0.137185,0.990545,82297.0,85729.0,84932.0,52331.5,83077.083333,4441.676382,74584.232143,10094.816596
4,2010-01-08 04:00:00,89394.5,train,-2.530969,15.81296,0.0,0.0,80.96875,4,4,...,0.137185,0.990545,87563.0,86940.0,87177.5,52171.0,83153.5,4504.615445,74793.943452,9995.227806


## Séparation train / validation / test

La séparation temporelle a été définie lors du feature engineering.


In [8]:
TARGET = "y"
META_COLS = ["datetime", "split"]

FEATURES = [c for c in df_base.columns if c not in META_COLS + [TARGET]]

def split_data(df):
    train = df[df["split"] == "train"]
    valid = df[df["split"] == "valid"]
    test  = df[df["split"] == "test"]

    return (
        train[FEATURES], train[TARGET],
        valid[FEATURES], valid[TARGET],
        test[FEATURES],  test[TARGET],
    )
