# 01 — Exploration des données
Ce notebook sert à **comprendre la structure** des données (plage temporelle, continuité horaire, valeurs manquantes).

**Entrées attendues** (générées par le pipeline) :
- `data/processed/consommation_clean.parquet`
- (optionnel) `data/processed/weather_national_hourly.parquet`


In [None]:
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

PROJECT_ROOT = Path('.').resolve()
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'
OUTPUT_FIG = PROJECT_ROOT / 'outputs' / 'figures'
OUTPUT_FIG.mkdir(parents=True, exist_ok=True)

CONS_PATH = DATA_PROCESSED / 'consommation_clean.parquet'
WEATHER_PATH = DATA_PROCESSED / 'weather_national_hourly.parquet'

CONS_PATH.exists(), WEATHER_PATH.exists()

/home/onyxia/work/france-grid-stress-prediction/data/processed/consommation_clean.parquet

(False, False)

## 1) Chargement

In [2]:
df_load = pd.read_parquet(CONS_PATH)
df_load['datetime'] = pd.to_datetime(df_load['datetime'])
df_load = df_load.sort_values('datetime').reset_index(drop=True)
df_load.head()

FileNotFoundError: [Errno 2] No such file or directory: '/home/onyxia/work/france-grid-stress-prediction/notebooks/data/processed/consommation_clean.parquet'

In [None]:
df_load.info()

In [None]:
df_load[['datetime','load_mw']].describe(include='all')

## 2) Couverture temporelle et continuité

In [None]:
start, end = df_load['datetime'].min(), df_load['datetime'].max()
freq_counts = df_load['datetime'].diff().value_counts().head(10)
start, end, freq_counts

In [None]:
# Vérification des gaps horaires (attendu: 1H)
gaps = df_load['datetime'].diff().dropna()
largest_gaps = gaps.sort_values(ascending=False).head(20)
largest_gaps

In [None]:
# Nombre de valeurs manquantes sur la cible
df_load['load_mw'].isna().sum()

## 3) Visualisations rapides

In [None]:
plt.figure(figsize=(14,4))
plt.plot(df_load['datetime'], df_load['load_mw'])
plt.title('Consommation électrique nationale — série complète')
plt.xlabel('Temps')
plt.ylabel('MW')
plt.tight_layout()
plt.savefig(OUTPUT_FIG / 'load_timeseries_full.png', dpi=150)
plt.show()

In [None]:
# Zoom sur 2019–2020 pour visualiser la zone lacunaire 2020
mask = (df_load['datetime'] >= '2019-01-01') & (df_load['datetime'] <= '2020-12-31')
df_zoom = df_load.loc[mask]

plt.figure(figsize=(14,4))
plt.plot(df_zoom['datetime'], df_zoom['load_mw'])
plt.title('Consommation — zoom 2019–2020 (visualisation des lacunes)')
plt.xlabel('Temps')
plt.ylabel('MW')
plt.tight_layout()
plt.savefig(OUTPUT_FIG / 'load_timeseries_2019_2020.png', dpi=150)
plt.show()

## 4) Météo agrégée (optionnel)

In [None]:
if WEATHER_PATH.exists():
    df_w = pd.read_parquet(WEATHER_PATH)
    if 'datetime' in df_w.columns:
        df_w['datetime'] = pd.to_datetime(df_w['datetime'])
    df_w = df_w.sort_values('datetime').reset_index(drop=True)
    display(df_w.head())
    display(df_w.info())
else:
    print('weather_national_hourly.parquet absent — section ignorée.')

## Notes
- L'analyse descriptive détaillée (profils hebdomadaires, boxplots mensuels, etc.) est déplacée dans `05_analysis.ipynb`.
- Ce notebook doit rester léger et reproductible.