# 02 — Nettoyage & préparation des données (consommation + météo)

Ce notebook prépare les **datasets propres** utilisés dans la suite du projet *Stress Grid* :

1. **Consommation nationale** : consolidation des fichiers `consommation_YYYY_long.csv` (pas 30 min) → `consommation_clean.parquet`
2. **Météo multi-villes** : nettoyage + contrôle de continuité horaire → agrégation nationale → `weather_national_hourly.parquet`
3. **Alignement temporel** : consommation (resample horaire) + météo nationale (horaire) → `dataset_model_hourly.parquet`

> Remarque : le “trou” 2020 (mai→septembre) est une **absence de données source** et n’est pas imputé dans le baseline.


## 0. Imports & configuration

Le code “lourd” est déporté dans `src/data/` afin de garder ce notebook lisible (orchestration + contrôles).


In [10]:
import sys
from pathlib import Path

# Notebook lancé depuis .../notebooks
PROJECT_ROOT = Path.cwd().resolve().parent

# Ajoute la racine du projet en priorité (avant site-packages)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / "data"
DATA_RAW = DATA_DIR / "raw"
DATA_PROCESSED = DATA_DIR / "processed"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_RAW exists:", DATA_RAW.exists())
print("DATA_PROCESSED exists:", DATA_PROCESSED.exists())





PROJECT_ROOT: /home/onyxia/france-grid-stress-prediction
DATA_RAW exists: False
DATA_PROCESSED exists: True


## 1. Consommation électrique — nettoyage et consolidation

- Standardisation des colonnes (`datetime`, `load_mw`)
- Tri temporel + déduplication
- Contrôle simple de continuité (pas attendu = 30 min)
- Export en Parquet


In [None]:
from pathlib import Path
import sys

print("CWD:", Path.cwd())
print("Notebook parent:", Path.cwd().parent)
print("sys.path (first 5):")
for p in sys.path[:5]:
    print("  ", p)


CWD: /home/onyxia/france-grid-stress-prediction/notebooks
Notebook parent: /home/onyxia/france-grid-stress-prediction
sys.path (first 5):
   /opt/python/lib/python313.zip
   /opt/python/lib/python3.13
   /opt/python/lib/python3.13/lib-dynload
   
   /opt/python/lib/python3.13/site-packages


In [7]:
root = Path.cwd().parent

print("\nRoot content:")
for p in root.iterdir():
    print(" ", p)

print("\nNotebooks content:")
for p in (root / "notebooks").iterdir():
    print(" ", p)

print("\nSrc content:")
for p in (root / "src").iterdir():
    print(" ", p)



Root content:
  /home/onyxia/france-grid-stress-prediction/train_xgboost.py
  /home/onyxia/france-grid-stress-prediction/models
  /home/onyxia/france-grid-stress-prediction/meteo_globale_hist.py
  /home/onyxia/france-grid-stress-prediction/meteo_idf.ipynb
  /home/onyxia/france-grid-stress-prediction/.git
  /home/onyxia/france-grid-stress-prediction/agglomerations.xlsx
  /home/onyxia/france-grid-stress-prediction/backtest_jplus12023.py
  /home/onyxia/france-grid-stress-prediction/.gitignore
  /home/onyxia/france-grid-stress-prediction/notebooks
  /home/onyxia/france-grid-stress-prediction/meteo_paris.py
  /home/onyxia/france-grid-stress-prediction/src
  /home/onyxia/france-grid-stress-prediction/.cache.sqlite
  /home/onyxia/france-grid-stress-prediction/meteo_globale.py
  /home/onyxia/france-grid-stress-prediction/data
  /home/onyxia/france-grid-stress-prediction/Découverte de git

Notebooks content:
  /home/onyxia/france-grid-stress-prediction/notebooks/nettoyagerapport
  /home/onyxia

In [8]:
from src import consumption_cleaning
print([name for name in dir(consumption_cleaning) if not name.startswith("_")])


['ConsumptionCleanConfig', 'Dict', 'EXPECTED_FREQ', 'Iterable', 'List', 'Optional', 'Path', 'annotations', 'build_consumption_dataset', 'clean_consumption_file', 'dataclass', 'pd']


In [9]:
print((root / "data" / "raw").exists())
print((root / "data" / "processed").exists())


False
True


In [5]:
from src.consumption_cleaning import ConsumptionCleanConfig, build_consumption_dataset

cfg_cons = ConsumptionCleanConfig(
    raw_dir=DATA_RAW / "consumption",
    out_path=DATA_PROCESSED / "consommation_clean.parquet",
    pattern="consommation_*_long.csv",
)

df_cons, report_cons = build_consumption_dataset(cfg_cons)

display(report_cons)
df_cons.head()


FileNotFoundError: No files matching consommation_*_long.csv in /home/onyxia/france-grid-stress-prediction/data/raw/consumption

### 1.1 Contrôles rapides

- Bornes temporelles
- Unicité de `datetime`
- Pas de temps observé (distribution des deltas)


In [None]:
df_cons["datetime"].min(), df_cons["datetime"].max(), len(df_cons), df_cons["datetime"].is_unique


In [None]:
df_cons = df_cons.sort_values("datetime")
df_cons["datetime"].diff().value_counts().head(10)


## 2. Météo — nettoyage, contrôle, agrégation nationale

- Nettoyage : colonnes, datetime (UTC → naïf), doublons (`city`, `datetime`)
- Contrôle de continuité horaire **par ville**
- Agrégation nationale : moyenne horaire sur l’ensemble des villes


In [None]:
from src.data.weather_cleaning import WeatherCleanConfig, build_weather_national_dataset

# Adapter ici si besoin (ex: un seul CSV global pour toutes les années)
raw_weather_path = DATA_RAW / "weather" / "weather_32_cities_2019.csv"

cfg_w = WeatherCleanConfig(
    raw_path=raw_weather_path,
    out_path=DATA_PROCESSED / "weather_national_hourly.parquet",
)

df_weather_nat, report_weather = build_weather_national_dataset(cfg_w)

display(report_weather.head(10))
df_weather_nat.head()


## 3. Alignement temporel et fusion (baseline)

- Consommation : resample horaire (moyenne)
- Fusion sur `datetime` (inner join)
- Contrôles : doublons, continuité horaire


In [None]:
from src.data.merge_datasets import MergeConfig, build_hourly_dataset

cfg_merge = MergeConfig(
    consumption_path=DATA_PROCESSED / "consommation_clean.parquet",
    weather_path=DATA_PROCESSED / "weather_national_hourly.parquet",
    out_path=DATA_PROCESSED / "dataset_model_hourly.parquet",
)

df_model = build_hourly_dataset(cfg_merge)
df_model.head()


In [None]:
# Contrôles de base
df_model["datetime"].duplicated().sum(), df_model["datetime"].diff().value_counts().head(5)


## 4. Focus : diagnostic du “trou” 2020 (consommation)

Objectif : vérifier que la discontinuité en 2020 provient bien des **fichiers source**.

- On filtre 2020 dans le dataset propre
- On compare la distribution mensuelle avec le fichier brut 2020


In [None]:
df_cons_2020 = df_cons[df_cons["datetime"].dt.year == 2020].copy()
df_cons_2020["month"] = df_cons_2020["datetime"].dt.month
df_cons_2020.groupby("month").size()


In [None]:
raw_2020_path = DATA_RAW / "consumption" / "consommation_2020_long.csv"
df_raw_2020 = pd.read_csv(raw_2020_path)
pd.to_datetime(df_raw_2020["datetime"]).dt.month.value_counts().sort_index()


## 5. Sorties

- `data/processed/consommation_clean.parquet`
- `data/processed/weather_national_hourly.parquet`
- `data/processed/dataset_model_hourly.parquet`
