# 02 — Nettoyage & préparation des données (consommation + météo)

Ce notebook prépare les **datasets propres** utilisés dans la suite du projet *Stress Grid* :

1. **Consommation nationale** : consolidation des fichiers `consommation_YYYY_long.csv` (pas 30 min) → `consommation_clean.parquet`
2. **Météo multi-villes** : nettoyage + contrôle de continuité horaire → agrégation nationale → `weather_national_hourly.parquet`
3. **Alignement temporel** : consommation (resample horaire) + météo nationale (horaire) → `dataset_model_hourly.parquet`

> Remarque : le “trou” 2020 (mai→septembre) est une **absence de données source** et n’est pas imputé dans le baseline.


## 0. Imports & configuration

Le code “lourd” est déporté dans `src/data/` afin de garder ce notebook lisible (orchestration + contrôles).


In [13]:
import pandas as pd
import sys
from pathlib import Path

# Notebook lancé depuis .../notebooks
PROJECT_ROOT = Path.cwd().resolve().parent

# Ajoute la racine du projet en priorité (avant site-packages)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / "data"
DATA_RAW = DATA_DIR / "raw*"
DATA_PROCESSED = DATA_DIR / "processed"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_RAW exists:", DATA_RAW.exists())
print("DATA_PROCESSED exists:", DATA_PROCESSED.exists())





PROJECT_ROOT: /home/onyxia/france-grid-stress-prediction
DATA_RAW exists: True
DATA_PROCESSED exists: True


In [14]:
import importlib, sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().resolve().parent
sys.path.insert(0, str(PROJECT_ROOT))

print("Trying import src ...")
import src
print("src imported from:", src.__file__ if hasattr(src, "__file__") else src)

print("Trying import src.data ...")
import src.data
print("src.data imported from:", src.data.__file__ if hasattr(src.data, "__file__") else src.data)


Trying import src ...
src imported from: /home/onyxia/france-grid-stress-prediction/src/__init__.py
Trying import src.data ...
src.data imported from: /home/onyxia/france-grid-stress-prediction/src/data/__init__.py


## 1. Consommation électrique — nettoyage et consolidation

- Standardisation des colonnes (`datetime`, `load_mw`)
- Tri temporel + déduplication
- Contrôle simple de continuité (pas attendu = 30 min)
- Export en Parquet


In [15]:
from src.data.consumption_cleaning import ConsumptionCleanConfig, build_consumption_dataset


cfg_cons = ConsumptionCleanConfig(
    raw_dir=DATA_RAW / "consommation",   
    out_path=DATA_PROCESSED / "consommation_clean.parquet",
    pattern="consommation_*_long.csv",
)

df_cons, report_cons = build_consumption_dataset(cfg_cons)






### 1.1 Contrôles rapides

- Bornes temporelles
- Unicité de `datetime`
- Pas de temps observé (distribution des deltas)


In [16]:
df_cons["datetime"].min(), df_cons["datetime"].max(), len(df_cons), df_cons["datetime"].is_unique


(Timestamp('2010-01-01 00:00:00'),
 Timestamp('2024-12-31 23:30:00'),
 252672,
 True)

In [17]:
df_cons = df_cons.sort_values("datetime")
df_cons["datetime"].diff().value_counts().head(10)


datetime
0 days 00:30:00      252669
154 days 00:30:00         1
61 days 00:30:00          1
Name: count, dtype: int64

## 2. Météo — nettoyage, contrôle, agrégation nationale

- Nettoyage : colonnes, datetime (UTC → naïf), doublons (`city`, `datetime`)
- Contrôle de continuité horaire **par ville**
- Agrégation nationale : moyenne horaire sur l’ensemble des villes


In [18]:
from src.data.weather_cleaning import WeatherCleanConfig, build_weather_national_dataset

# Adapter ici si besoin (ex: un seul CSV global pour toutes les années)
raw_weather_path = DATA_RAW / "weather" / "weather_32_cities_2019.csv"

cfg_w = WeatherCleanConfig(
    raw_path=raw_weather_path,
    out_path=DATA_PROCESSED / "weather_national_hourly.parquet",
)

df_weather_nat, report_weather = build_weather_national_dataset(cfg_w)

display(report_weather.head(10))
df_weather_nat.head()


Unnamed: 0,city,rows,min_dt,max_dt,n_bad_steps
0,Angers,8760,2019-01-01,2019-12-31 23:00:00,0
1,Avignon,8760,2019-01-01,2019-12-31 23:00:00,0
2,Bordeaux,8760,2019-01-01,2019-12-31 23:00:00,0
3,Brest,8760,2019-01-01,2019-12-31 23:00:00,0
4,Béthune,8760,2019-01-01,2019-12-31 23:00:00,0
5,Caen,8760,2019-01-01,2019-12-31 23:00:00,0
6,Clermont-Ferrand,8760,2019-01-01,2019-12-31 23:00:00,0
7,Dijon,8760,2019-01-01,2019-12-31 23:00:00,0
8,Douai - Lens,8760,2019-01-01,2019-12-31 23:00:00,0
9,Genève - Annemasse (partie française),8760,2019-01-01,2019-12-31 23:00:00,0


Unnamed: 0,datetime,temperature_2m,wind_speed_10m,direct_radiation,diffuse_radiation,cloud_cover
0,2019-01-01 00:00:00,6.235828,8.380237,0.0,0.0,63.96875
1,2019-01-01 01:00:00,6.265516,8.466251,0.0,0.0,68.5625
2,2019-01-01 02:00:00,6.070203,8.645449,0.0,0.0,69.59375
3,2019-01-01 03:00:00,5.979578,9.143577,0.0,0.0,72.53125
4,2019-01-01 04:00:00,5.824891,9.201295,0.0,0.0,73.84375


## 3. Alignement temporel et fusion (baseline)

- Consommation : resample horaire (moyenne)
- Fusion sur `datetime` (inner join)
- Contrôles : doublons, continuité horaire


In [19]:
from src.data.merge_datasets import MergeConfig, build_hourly_dataset

cfg_merge = MergeConfig(
    consumption_path=DATA_PROCESSED / "consommation_clean.parquet",
    weather_path=DATA_PROCESSED / "weather_national_hourly.parquet",
    out_path=DATA_PROCESSED / "dataset_model_hourly.parquet",
)

df_model = build_hourly_dataset(cfg_merge)
df_model.head()


  .resample("1H")


Unnamed: 0,datetime,load_mw,temperature_2m,wind_speed_10m,direct_radiation,diffuse_radiation,cloud_cover
0,2019-01-01 00:00:00,53228.0,6.235828,8.380237,0.0,0.0,63.96875
1,2019-01-01 01:00:00,53005.0,6.265516,8.466251,0.0,0.0,68.5625
2,2019-01-01 02:00:00,53438.0,6.070203,8.645449,0.0,0.0,69.59375
3,2019-01-01 03:00:00,53706.0,5.979578,9.143577,0.0,0.0,72.53125
4,2019-01-01 04:00:00,53387.0,5.824891,9.201295,0.0,0.0,73.84375


In [20]:
# Contrôles de base
df_model["datetime"].duplicated().sum(), df_model["datetime"].diff().value_counts().head(5)


(np.int64(0),
 datetime
 0 days 01:00:00    8759
 Name: count, dtype: int64)

## 4. Focus : diagnostic du “trou” 2020 (consommation)

Objectif : vérifier que la discontinuité en 2020 provient bien des **fichiers source**.

- On filtre 2020 dans le dataset propre
- On compare la distribution mensuelle avec le fichier brut 2020


In [21]:
df_cons_2020 = df_cons[df_cons["datetime"].dt.year == 2020].copy()
df_cons_2020["month"] = df_cons_2020["datetime"].dt.month
df_cons_2020.groupby("month").size()


month
1     1488
2     1392
3     1488
4     1440
10    1440
dtype: int64

In [22]:
raw_2020_path = DATA_RAW / "consommation" / "consommation_2020_long.csv"
df_raw_2020 = pd.read_csv(raw_2020_path)
pd.to_datetime(df_raw_2020["datetime"]).dt.month.value_counts().sort_index()


datetime
1     1488
2     1392
3     1488
4     1440
10    1440
Name: count, dtype: int64

## 5. Sorties

- `data/processed/consommation_clean.parquet`
- `data/processed/weather_national_hourly.parquet`
- `data/processed/dataset_model_hourly.parquet`
