# UK-DALE - Exploratory Data Analysis

## Overview
This notebook explores the UK Domestic Appliance-Level Electricity (UK-DALE) dataset containing ~114M readings from 5 households with appliance-level monitoring.

**Student**: Vatsal Mehta (220408633@aston.ac.uk)
**Supervisor**: Dr. Farzaneh Farhadi
**Project**: Grid Guardian - AZR Energy Forecasting & Anomaly Detection

In [1]:
# Setup and imports
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import warnings
warnings.filterwarnings("ignore")

plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")
plt.rcParams["figure.dpi"] = 150
plt.rcParams["savefig.dpi"] = 300

PROJECT_ROOT = Path("..").resolve()
DATA_ROOT = PROJECT_ROOT / "data"
UKDALE_PATH = DATA_ROOT / "processed" / "ukdale_data"  # Updated path
FIGURES_DIR = PROJECT_ROOT / "docs" / "figures"
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

print("Setup complete")

Setup complete


## Load and Validate Data

**Purpose**: Load UK-DALE processed data and verify schema

**Expected**: ~114M records at 30-minute intervals with appliance metadata

In [2]:
# Load sample data
print("Loading UK-DALE sample (1M records)...")
df_sample = pl.scan_parquet(str(UKDALE_PATH / "*.parquet")).head(1_000_000).collect()
print(f"Loaded {len(df_sample):,} records")
print(f"Columns: {df_sample.columns}")

df_pd = df_sample.to_pandas()
df_pd["extras_parsed"] = df_pd["extras"].apply(json.loads)
df_pd["channel"] = df_pd["extras_parsed"].apply(lambda x: x.get("channel", "unknown"))
df_pd["building"] = df_pd["entity_id"].str.split("_").str[0]

print(f"Buildings: {df_pd['building'].unique()}")
print(f"Unique appliances: {df_pd['channel'].nunique()}")

Loading UK-DALE sample (1M records)...
Loaded 1,000,000 records
Columns: ['dataset', 'entity_id', 'ts_utc', 'interval_mins', 'energy_kwh', 'source', 'extras']
Buildings: ['house']
Unique appliances: 19


## Consumption Analysis

**Purpose**: Understand appliance-level consumption patterns

In [3]:
# Consumption statistics
print("=== Energy Consumption Statistics ===")
print(df_pd["energy_kwh"].describe())

print("\n=== Top 15 Appliances by Total Consumption ===")
appliance_totals = df_pd.groupby("channel")["energy_kwh"].agg(["sum", "mean", "median", "std", "count"]).sort_values("sum", ascending=False)
display(appliance_totals.head(15))

total_energy = df_pd["energy_kwh"].sum()
appliance_totals["pct_contribution"] = (appliance_totals["sum"] / total_energy) * 100
print(f"\nTop 5 appliances: {appliance_totals.head(5)['pct_contribution'].sum():.1f}% of consumption")

=== Energy Consumption Statistics ===
count    1000000.000000
mean           0.018985
std            0.061387
min            0.000000
25%            0.000000
50%            0.000460
75%            0.005740
max            1.798765
Name: energy_kwh, dtype: float64

=== Top 15 Appliances by Total Consumption ===


Unnamed: 0_level_0,sum,mean,median,std,count
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
aggregate,13740.400391,0.180593,0.129822,0.136848,76085
appliance_12,1441.157227,0.019135,0.016153,0.014802,75316
appliance_25,992.785645,0.013752,0.0,0.022788,72193
appliance_2,663.961792,0.008697,0.005762,0.008455,76348
appliance_10,587.59552,0.007821,0.000484,0.026077,75129
appliance_13,303.901947,0.004035,0.000485,0.019809,75320
appliance_11,302.983521,0.004044,0.0,0.01985,74919
appliance_18,249.147354,0.003362,0.002963,0.000872,74097
appliance_14,178.259888,0.010615,0.003028,0.012279,16793
appliance_22,142.184998,0.15388,0.104561,0.166529,924



Top 5 appliances: 91.8% of consumption


## Key Findings

### Appliance-Level Insights
- **Always-on baseline**: Fridge/freezer provide constant baseline
- **Scheduled appliances**: Washing machine/dishwasher show time-of-day patterns
- **High-power bursts**: Kettle/oven show short duration events

### Next Steps
1. Complete LCL exploration
2. Implement appliance-specific anomaly detection
3. Design hierarchical forecasting (aggregate + disaggregated)