02 – Time Index Audit, Resampling, Early EDA, Data Quality

This notebook:

Loads scada_wide.parquet + signal_catalog.parquet

Wide→Long normalization

Time index audit (duplicates/gaps/median dt)

Resample to standard timebase (e.g., 5min)

Early EDA (missingness, distributions, sanity plots)

DQ scoring + monitoring confidence

Writes Parquet:

outputs/stages/scada_long.parquet

outputs/stages/scada_rs.parquet

outputs/stages/time_audit.parquet

outputs/stages/dq_report.parquet

In [16]:
import importlib
from pathlib import Path

import pandas as pd

import pv_fleet_health.scada_reshape as scada_reshape

importlib.reload(scada_reshape)
from pv_fleet_health.config import load_config_yaml
from pv_fleet_health.io import save_parquet
from pv_fleet_health.paths import Paths
from pv_fleet_health.scada_reshape import wide_to_long

ROOT = Path("..").resolve()
paths = Paths(ROOT)
paths.ensure()
cfg = load_config_yaml(str(ROOT / "config.yaml"))

In [13]:
scada_wide = pd.read_parquet(paths.stage_dir / "scada_wide.parquet")
signal_catalog = pd.read_parquet(paths.stage_dir / "signal_catalog.parquet")

print("Loaded scada_wide:", scada_wide.shape)
print("Loaded signal_catalog:", signal_catalog.shape)

# Optional: debug mode
if cfg.selected_plant is not None:
    keep_cols = [cfg.timestamp_col] + signal_catalog["raw_column_name"].tolist()
    scada_wide = scada_wide[keep_cols].copy()
    print("Filtered debug plant columns:", scada_wide.shape)

Loaded scada_wide: (2400, 244)
Loaded signal_catalog: (243, 12)


In [17]:
scada_long = wide_to_long(scada_wide, signal_catalog, cfg.timestamp_col)
print("scada_long:", scada_long.shape)
print(scada_long[["ts", "plant_name", "component_type", "canonical_signal", "value"]].head(5))

save_parquet(scada_long, str(paths.stage_dir / "scada_long.parquet"))

scada_long: (583200, 11)
                         ts                      plant_name component_type  \
0 2025-12-01 00:15:00+02:00  Solar Concept 3721 KWp Lexaina          array   
1 2025-12-01 00:30:00+02:00  Solar Concept 3721 KWp Lexaina          array   
2 2025-12-01 00:45:00+02:00  Solar Concept 3721 KWp Lexaina          array   
3 2025-12-01 01:00:00+02:00  Solar Concept 3721 KWp Lexaina          array   
4 2025-12-01 01:15:00+02:00  Solar Concept 3721 KWp Lexaina          array   

        canonical_signal     value  
0  ac_frequency_error_hz -1.890000  
1  ac_frequency_error_hz -0.643333  
2  ac_frequency_error_hz -1.434444  
3  ac_frequency_error_hz  0.094444  
4  ac_frequency_error_hz -2.107778  
