# 📊 Notebook 01: Data Collection

**Solar Swarm Intelligence - IEEE PES Energy Utopia Challenge**

This notebook demonstrates:
- Synthetic data generation for 50 households
- 90 days of realistic solar production and consumption data
- Weather data simulation
- Data validation and quality checks

In [None]:
import sys
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from pathlib import Path

# Import our custom generator
from src.data_collection.generate_synthetic import SyntheticDataGenerator

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print(" Libraries imported successfully")

## 1. Initialize Data Generator

We'll generate data for:
- **50 households** (agents)
- **90 days** of operation
- **Hourly resolution** (2,160 data points per house)

In [None]:
# Configuration
NUM_HOUSES = 50
NUM_DAYS = 90
RANDOM_SEED = 42

# Initialize generator
generator = SyntheticDataGenerator(
    num_houses=NUM_HOUSES,
    days=NUM_DAYS,
    seed=RANDOM_SEED
)

print(f" Configured for {NUM_HOUSES} houses")
print(f" Simulation period: {NUM_DAYS} days")
print(f" Total hours: {NUM_DAYS * 24}")
print(f" Expected data points: {NUM_HOUSES * NUM_DAYS * 24:,}")

## 2. View House Profiles

Each house has unique characteristics that affect energy patterns

In [None]:
# Display sample house profiles
profiles_df = pd.DataFrame(generator.house_profiles)
print("\n Sample House Profiles:")
print(profiles_df.head(10))

# Statistics
print("\n Profile Statistics:")
print(f"Consumption types: {profiles_df['consumption_type'].value_counts().to_dict()}")
print(f"Average panel capacity: {profiles_df['panel_capacity_kw'].mean():.2f} kW")
print(f"Average battery capacity: {profiles_df['battery_capacity_kwh'].mean():.2f} kWh")
print(f"Houses with EV: {profiles_df['has_ev'].sum()} ({profiles_df['has_ev'].sum()/len(profiles_df)*100:.1f}%)")

## 3. Generate Complete Dataset

This creates realistic time-series data for all houses

In [None]:
# Generate the dataset
print(" Generating synthetic data...\n")
df = generator.generate_dataset()

print("\n Data generation complete!")
print(f"\n Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## 4. Data Quality Checks

In [None]:
# Check for missing values
print(" Missing Values:")
print(df.isnull().sum())

# Data types
print("\n Data Types:")
print(df.dtypes)

# Basic statistics
print("\n Statistical Summary:")
print(df.describe())

## 5. Visualize Sample Data

In [None]:
# Select one house for detailed view
house_0 = df[df['house_id'] == 0].copy()
house_0['timestamp'] = pd.to_datetime(house_0['timestamp'])

# Plot first week
first_week = house_0.iloc[:168]  # 7 days * 24 hours

fig, axes = plt.subplots(3, 1, figsize=(15, 10))

# Production vs Consumption
axes[0].plot(first_week['timestamp'], first_week['production_kwh'], label='Production', linewidth=2)
axes[0].plot(first_week['timestamp'], first_week['consumption_kwh'], label='Consumption', linewidth=2)
axes[0].set_title('House 0: Solar Production vs Consumption (First Week)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Energy (kWh)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Temperature
axes[1].plot(first_week['timestamp'], first_week['temperature_c'], color='red', linewidth=2)
axes[1].set_title('Temperature', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Temperature (°C)')
axes[1].grid(True, alpha=0.3)

# Cloud Cover
axes[2].fill_between(first_week['timestamp'], first_week['cloud_cover_pct'], alpha=0.5, color='gray')
axes[2].set_title('Cloud Cover', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Cloud Cover (%)')
axes[2].set_xlabel('Time')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(" Visualization complete")

## 6. Community-Level Analysis

In [None]:
# Aggregate by timestamp
community_data = df.groupby('timestamp').agg({
    'production_kwh': 'sum',
    'consumption_kwh': 'sum',
    'temperature_c': 'mean',
    'cloud_cover_pct': 'mean'
}).reset_index()

community_data['net_energy'] = community_data['production_kwh'] - community_data['consumption_kwh']

print(" Community-Level Statistics:")
print(f"Total production: {community_data['production_kwh'].sum():,.1f} kWh")
print(f"Total consumption: {community_data['consumption_kwh'].sum():,.1f} kWh")
print(f"Net balance: {community_data['net_energy'].sum():,.1f} kWh")
print(f"Self-sufficiency: {(community_data['production_kwh'].sum() / community_data['consumption_kwh'].sum() * 100):.1f}%")

In [None]:
# Visualize community patterns
first_week_community = community_data.iloc[:168]

fig, ax = plt.subplots(figsize=(15, 6))

ax.plot(first_week_community['timestamp'], first_week_community['production_kwh'], 
        label='Total Production', linewidth=2.5, color='orange')
ax.plot(first_week_community['timestamp'], first_week_community['consumption_kwh'], 
        label='Total Consumption', linewidth=2.5, color='blue')
ax.fill_between(first_week_community['timestamp'], 
                first_week_community['net_energy'], 
                alpha=0.3, label='Net Energy', color='green')

ax.set_title('Community Energy Profile (First Week)', fontsize=16, fontweight='bold')
ax.set_xlabel('Time', fontsize=12)
ax.set_ylabel('Energy (kWh)', fontsize=12)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Save Dataset

In [None]:
# Save to CSV
output_path = '../data/processed/synthetic/community_90days.csv'
Path(output_path).parent.mkdir(parents=True, exist_ok=True)

df.to_csv(output_path, index=False)
print(f" Dataset saved to: {output_path}")

# Save house profiles
profiles_path = '../data/processed/synthetic/house_profiles.csv'
profiles_df.to_csv(profiles_path, index=False)
print(f" House profiles saved to: {profiles_path}")

# Save community aggregated data
community_path = '../data/processed/synthetic/community_aggregated.csv'
community_data.to_csv(community_path, index=False)
print(f" Community data saved to: {community_path}")

print("\n All data saved successfully!")

## 8. Summary

**Data Collection Complete! ✅**

Generated:
- **108,000** data points (50 houses × 90 days × 24 hours)
- Realistic solar production patterns
- Diverse consumption profiles (low/medium/high users)
- Weather conditions (temperature, cloud cover, humidity, wind)
- House characteristics (panel capacity, battery, EV ownership)

**Next Steps:**
- Notebook 02: Exploratory Data Analysis
- Notebook 03: Solar Forecasting Models
- Notebook 04: Anomaly Detection
- Notebook 05: Swarm Simulation