# Building Energy Anomaly Detection - Final Report

## 1. Executive Summary
This report consolidates the findings from the analysis of the Building Data Genome 2 (BDG2) dataset. We implemented an end-to-end Machine Learning pipeline to detect energy anomalies in commercial buildings. 

**Key Findings:**
- **Objective**: Detect and quantify abnormal energy consumption patterns.
- **Method**: Ensemble of Isolation Forest, LOF, and Elliptic Envelope.
- **Impact**: Identification of potential energy waste events to drive cost savings.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Setup path
current_dir = os.getcwd()
project_root = os.path.dirname(current_dir)
if project_root not in sys.path:
    sys.path.append(project_root)

from src.data_loader import load_data
from src.preprocessing import preprocess_data
from src.features import engineer_features
from src.models import train_anomaly_models

# Visual settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('muted')

## 2. Data Pipeline Execution
Loading data, cleaning, generating features, and running the anomaly detection models.

In [None]:
# 1. Load Data (Limit to 100 buildings for performance in this report)
print("Loading data...")
data_path = os.path.join(project_root, 'data')
df_raw = load_data(data_path, building_limit=100)

# 2. Preprocess
print("Preprocessing...")
df_clean = preprocess_data(df_raw)

# 3. Feature Engineering
print("Engineering features...")
df = engineer_features(df_clean)

# 4. Modeling
features = ['electricity', 'chilled_water', 'steam', 'temperature', 'humidity', 
            'electricity_rolling_mean', 'electricity_deviation', 'hour', 'day_of_week']
model_features = [c for c in features if c in df.columns]
X = df[model_features].fillna(0)

print("Training models...")
output = train_anomaly_models(X)
results = output['results']
df_final = pd.concat([df, results], axis=1)

## 3. Anomaly Analysis
Visualizing the detected anomalies to understand their distribution and characteristics.

In [None]:
# Calculate Anomaly Rate
n_anomalies = df_final['is_anomaly'].sum()
total_points = len(df_final)
rate = (n_anomalies / total_points) * 100
print(f"Total Data Points: {total_points:,}")
print(f"Detected Anomalies: {n_anomalies:,} ({rate:.2f}%)")

In [None]:
# Visualization: Anomalies over Time (First 2000 hours)
plt.figure(figsize=(15, 6))
subset = df_final.iloc[:2000]
plt.plot(subset['timestamp'], subset['electricity'], label='Normal Consumption', alpha=0.6, color='tab:blue')
anomalies = subset[subset['is_anomaly'] == 1]
plt.scatter(anomalies['timestamp'], anomalies['electricity'], color='red', label='Anomaly', s=30, zorder=5)
plt.title('Electricity Consumption & Detected Anomalies (Sample Window)', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Normalized Consumption')
plt.legend()
plt.show()

## 4. Temporal Patterns
When do anomalies happen most frequently?

In [None]:
# Anomalies by Hour of Day
anomaly_df = df_final[df_final['is_anomaly'] == 1]
hourly_counts = anomaly_df['hour'].value_counts().sort_index()

plt.figure(figsize=(12, 5))
sns.barplot(x=hourly_counts.index, y=hourly_counts.values, color='salmon')
plt.title('Frequency of Anomalies by Hour of Day', fontsize=14)
plt.xlabel('Hour (0-23)')
plt.ylabel('Number of Anomalies')
plt.xticks(range(0, 24))
plt.show()

In [None]:
# Anomalies by Day of Week
day_counts = anomaly_df['day_of_week'].value_counts().sort_index()
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

plt.figure(figsize=(10, 5))
sns.barplot(x=[days[i] for i in day_counts.index], y=day_counts.values, color='skyblue')
plt.title('Frequency of Anomalies by Day of Week', fontsize=14)
plt.ylabel('Number of Anomalies')
plt.show()

## 5. Business Impact & Cost Analysis
Quantifying the financial impact of the detected anomalies.

In [None]:
# Cost Estimation Assumptions
avg_kwh_cost = 0.12 # $0.12 per kWh

# Note: Our data is normalized [0,1], so for real cost we would need to inverse transform.
# Valid conceptual estimation assuming the column represents relative magnitude.
# For this report, we'll calculate the 'Units' of anomalous energy.

total_energy = df_final['electricity'].sum()
anomalous_energy = anomaly_df['electricity'].sum()
percent_waste = (anomalous_energy / total_energy) * 100

# Hypothetical Cost (if sum was kWh)
estimated_waste_cost = anomalous_energy * avg_kwh_cost

print(f"Total Energy Units Processed: {total_energy:,.0f}")
print(f"Total Anomalous Energy Units: {anomalous_energy:,.0f}")
print(f"Potential Waste: {percent_waste:.1f}% of total consumption")
print(f"\nESTIMATED FINANCIAL IMPACT (Normalized Units): ${estimated_waste_cost:,.2f}")
print("(Note: Real dollar value requires inverse-scaling to original kWh values)")

## 6. Recommendations

Based on the analysis:
1.  **Investigate Peak Hours**: Focus maintenance teams on the hours identified in the "Anomalies by Hour" chart (often transition periods like 6-8 AM or 6-8 PM).
2.  **Weekend Audits**: If weekend anomalies are high, check for equipment failing to shut down (HVAC setbacks).
3.  **Automated Alerts**: Deploy this model to flag anomalies in real-time, allowing facility managers to intervene before costs accumulate.
4.  **Hardware Check**: For buildings with persistent anomalies, physically inspect sensors and major equipment (chillers, boilers).