# üîã Milestone 1: Smart Energy Consumption Analysis

## Week 1-2: Data Collection, Understanding & Preprocessing

---

### üìå Project Scope & Objectives

This notebook implements **Milestone 1** of the AI/ML-Driven Analysis and Forecasting of Device-Level Energy Consumption project.

#### Module 1: Data Collection and Understanding
- ‚úÖ Define project scope and functional objectives for smart energy analysis
- ‚úÖ Collect and structure the SmartHome Energy Monitoring Dataset
- ‚úÖ Verify data integrity, handle missing timestamps, and perform exploratory analysis
- ‚úÖ Organize energy readings by device, room, and timestamp

#### Module 2: Data Cleaning and Preprocessing
- ‚úÖ Handle missing values and outliers in power consumption readings
- ‚úÖ Convert timestamps to datetime format and resample data (hourly/daily)
- ‚úÖ Normalize or scale energy values for model compatibility
- ‚úÖ Split dataset into training, validation, and testing sets

---

**Author:** Suraj Surve  
**Date:** January 2026  
**Infosys Springboard Internship - Project 1**

---

## 1Ô∏è‚É£ Import Libraries & Configuration

In [None]:
# Data Manipulation & Analysis
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

# Preprocessing & ML
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy import stats

# Create images directory if not exists
import os
os.makedirs('images', exist_ok=True)

print("‚úÖ All libraries imported successfully!")
print(f"üì¶ Pandas version: {pd.__version__}")
print(f"üì¶ NumPy version: {np.__version__}")

---

## 2Ô∏è‚É£ Module 1: Data Collection and Understanding

### 2.1 Dataset Description

The **Individual Household Electric Power Consumption Dataset** contains:
- **Source:** UCI Machine Learning Repository
- **Period:** December 2006 - November 2010 (~4 years)
- **Granularity:** 1-minute sampling rate
- **Records:** ~2 million measurements

#### Feature Descriptions:
| Feature | Description | Unit |
|---------|-------------|------|
| `Global_active_power` | Total active power consumed | kilowatt (kW) |
| `Global_reactive_power` | Total reactive power consumed | kilowatt (kW) |
| `Voltage` | Minute-averaged voltage | volt (V) |
| `Global_intensity` | Current intensity | ampere (A) |
| `Sub_metering_1` | Kitchen appliances (dishwasher, oven, microwave) | watt-hour (Wh) |
| `Sub_metering_2` | Laundry room (washing machine, dryer, refrigerator) | watt-hour (Wh) |
| `Sub_metering_3` | Climate control (water heater, AC) | watt-hour (Wh) |

### 2.2 Load Dataset

In [None]:
# Load the dataset
# Dataset uses semicolon separator and '?' for missing values
df = pd.read_csv('../household_power_consumption.txt', 
                 sep=';', 
                 na_values=['?', ''],
                 low_memory=False)

# Display loading summary
print("=" * 60)
print("üìä DATASET LOADED SUCCESSFULLY!")
print("=" * 60)
print(f"\nüìè Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"üíæ Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nüìã Column Names:")
for i, col in enumerate(df.columns, 1):
    print(f"   {i}. {col}")

print("\nüîç First 5 rows:")
df.head()

### 2.3 Data Integrity Verification

In [None]:
# Check data types and info
print("üìä Dataset Information:")
print("=" * 60)
df.info()

print("\n" + "=" * 60)
print("üìà Statistical Summary:")
print("=" * 60)
df.describe()

In [None]:
# Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"üîÑ Duplicate rows: {duplicate_count:,}")

# Check date range
print(f"\nüìÖ Date Range: {df['Date'].min()} to {df['Date'].max()}")

### 2.4 Missing Values Analysis

In [None]:
# Comprehensive missing values analysis
print("üîç Missing Values Analysis:")
print("=" * 60)

missing_data = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage (%)': missing_percent
})
print(missing_df)

total_missing = df.isnull().sum().sum()
rows_with_missing = df.isnull().any(axis=1).sum()

print(f"\nüìä Total missing values: {total_missing:,}")
print(f"üìä Rows with missing values: {rows_with_missing:,} ({rows_with_missing/len(df)*100:.2f}%)")

# Store missing values count for before/after comparison
missing_before = df.isnull().sum().sum()

In [None]:
# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Missing values bar chart
missing_cols = missing_df[missing_df['Missing Count'] > 0]
ax1 = axes[0]
bars = ax1.bar(range(len(missing_cols)), missing_cols['Missing Count'], color='coral', edgecolor='darkred')
ax1.set_xticks(range(len(missing_cols)))
ax1.set_xticklabels(missing_cols.index, rotation=45, ha='right')
ax1.set_ylabel('Count')
ax1.set_title('Missing Values by Column')
ax1.set_xlabel('Column')
for bar, pct in zip(bars, missing_cols['Percentage (%)']):
    ax1.annotate(f'{pct}%', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                 ha='center', va='bottom', fontsize=9)

# Missing pattern heatmap (sample for visualization)
ax2 = axes[1]
sample_size = min(500, len(df))
sample_df = df.sample(sample_size, random_state=42)
sns.heatmap(sample_df.isnull(), cbar=True, cmap='YlOrRd', ax=ax2)
ax2.set_title(f'Missing Values Pattern (Sample n={sample_size})')
ax2.set_ylabel('Row Index')

plt.tight_layout()
plt.savefig('images/01_missing_values_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/01_missing_values_analysis.png")

### 2.5 Device-Level Energy Organization

The sub-metering columns represent different areas/devices in the household:

In [None]:
# Device mapping
device_mapping = {
    'Sub_metering_1': 'Kitchen (dishwasher, oven, microwave)',
    'Sub_metering_2': 'Laundry (washing machine, dryer, refrigerator)',
    'Sub_metering_3': 'Climate Control (water heater, AC)'
}

print("üè† Device-Level Energy Organization:")
print("=" * 60)
for col, description in device_mapping.items():
    print(f"   {col}: {description}")

# Convert numeric columns to proper types
numeric_cols = ['Global_active_power', 'Global_reactive_power', 'Voltage', 
                'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']

for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Device-wise statistics
print("\nüìä Device-wise Statistics:")
print("=" * 60)
for col, device in device_mapping.items():
    values = df[col].dropna()
    print(f"\n   {device}:")
    print(f"      Mean: {values.mean():.2f} Wh")
    print(f"      Max: {values.max():.2f} Wh")
    print(f"      Min: {values.min():.2f} Wh")
    print(f"      Std Dev: {values.std():.2f} Wh")

In [None]:
# Visualize device-level distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
sub_cols = ['Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']
titles = ['Kitchen', 'Laundry', 'Climate Control']

for i, (col, title, color) in enumerate(zip(sub_cols, titles, colors)):
    ax = axes[i]
    data = df[col].dropna()
    ax.hist(data[data > 0], bins=50, color=color, edgecolor='white', alpha=0.8)
    ax.set_title(f'{title}\n({col})')
    ax.set_xlabel('Energy (Wh)')
    ax.set_ylabel('Frequency')
    ax.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}')
    ax.legend()

plt.suptitle('Device-Level Energy Distribution', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('images/02_device_level_distribution.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/02_device_level_distribution.png")

---

## 3Ô∏è‚É£ Module 2: Data Cleaning and Preprocessing

### 3.1 DateTime Processing

In [None]:
# Create DateTime column
df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H:%M:%S')

# Set DateTime as index
df.set_index('DateTime', inplace=True)

# Drop original Date and Time columns
df.drop(['Date', 'Time'], axis=1, inplace=True)

# Sort by index
df.sort_index(inplace=True)

print("‚úÖ DateTime index created!")
print(f"üìÖ Date Range: {df.index.min()} to {df.index.max()}")
print(f"üìÜ Total Duration: {(df.index.max() - df.index.min()).days} days")
print(f"\nüìä Data Shape: {df.shape}")
df.head()

### 3.2 Missing Values Handling (Multiple Strategies)

In [None]:
print("üîß Handling Missing Values with Multiple Strategies...")
print("=" * 60)
print(f"   Before: {df.isnull().sum().sum():,} missing values")

# Strategy 1: Linear interpolation (best for time-series)
df_cleaned = df.copy()
df_cleaned = df_cleaned.interpolate(method='linear', limit_direction='both')

# Strategy 2: Forward fill for any remaining NaNs
df_cleaned = df_cleaned.ffill()

# Strategy 3: Backward fill for any remaining NaNs at the start
df_cleaned = df_cleaned.bfill()

missing_after = df_cleaned.isnull().sum().sum()
print(f"   After: {missing_after:,} missing values")
print("\n‚úÖ Missing values handled successfully!")

# Verify no missing values remain
print("\nüìä Missing Values Check (After Cleaning):")
print(df_cleaned.isnull().sum())

### 3.3 Outlier Detection & Treatment

In [None]:
print("üîç Outlier Detection using IQR Method:")
print("=" * 60)

def detect_outliers_iqr(data, column):
    """Detect outliers using IQR method."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return len(outliers), lower_bound, upper_bound

# Detect outliers for each numeric column
outlier_summary = []
for col in numeric_cols:
    count, lower, upper = detect_outliers_iqr(df_cleaned, col)
    pct = (count / len(df_cleaned)) * 100
    outlier_summary.append({
        'Column': col,
        'Outliers': count,
        'Percentage': f'{pct:.2f}%',
        'Lower Bound': f'{lower:.2f}',
        'Upper Bound': f'{upper:.2f}'
    })
    print(f"   {col}: {count:,} outliers ({pct:.2f}%)")

outlier_df = pd.DataFrame(outlier_summary)
outlier_df

In [None]:
# Visualize outliers using box plots
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    ax = axes[i]
    box = ax.boxplot(df_cleaned[col].dropna(), patch_artist=True)
    box['boxes'][0].set_facecolor('#74b9ff')
    ax.set_title(col, fontsize=10)
    ax.set_ylabel('Value')

# Hide the last empty subplot
axes[-1].axis('off')

plt.suptitle('Outlier Detection - Box Plots', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('images/03_outlier_detection.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/03_outlier_detection.png")

In [None]:
# Outlier treatment: Cap to percentile bounds (Winsorization)
print("\nüîß Treating Outliers (Winsorization at 1st and 99th percentile):")
print("=" * 60)

for col in numeric_cols:
    lower_cap = df_cleaned[col].quantile(0.01)
    upper_cap = df_cleaned[col].quantile(0.99)
    original_outliers = len(df_cleaned[(df_cleaned[col] < lower_cap) | (df_cleaned[col] > upper_cap)])
    df_cleaned[col] = df_cleaned[col].clip(lower=lower_cap, upper=upper_cap)
    print(f"   {col}: Capped {original_outliers:,} values to [{lower_cap:.2f}, {upper_cap:.2f}]")

print("\n‚úÖ Outliers treated successfully!")

### 3.4 Resampling Data (Hourly/Daily Aggregation)

In [None]:
# Hourly resampling
df_hourly = df_cleaned.resample('h').mean()
print("üìä Hourly Resampled Data:")
print(f"   Shape: {df_hourly.shape}")
print(f"   Date Range: {df_hourly.index.min()} to {df_hourly.index.max()}")

# Daily resampling
df_daily = df_cleaned.resample('D').mean()
print("\nüìä Daily Resampled Data:")
print(f"   Shape: {df_daily.shape}")
print(f"   Date Range: {df_daily.index.min()} to {df_daily.index.max()}")

print("\n‚úÖ Data resampled successfully!")

In [None]:
# Visualize resampled time series
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Hourly data (sample)
sample_hourly = df_hourly['Global_active_power'].iloc[:720]  # First 30 days
axes[0].plot(sample_hourly.index, sample_hourly.values, color='#3498db', linewidth=0.8)
axes[0].set_title('Hourly Global Active Power (First 30 Days)', fontweight='bold')
axes[0].set_xlabel('DateTime')
axes[0].set_ylabel('Power (kW)')
axes[0].fill_between(sample_hourly.index, sample_hourly.values, alpha=0.3, color='#3498db')

# Daily data
axes[1].plot(df_daily.index, df_daily['Global_active_power'].values, color='#e74c3c', linewidth=1.2)
axes[1].set_title('Daily Average Global Active Power (Full Dataset)', fontweight='bold')
axes[1].set_xlabel('DateTime')
axes[1].set_ylabel('Power (kW)')
axes[1].fill_between(df_daily.index, df_daily['Global_active_power'].values, alpha=0.3, color='#e74c3c')

plt.tight_layout()
plt.savefig('images/04_resampled_time_series.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/04_resampled_time_series.png")

### 3.5 Feature Engineering (Temporal Features)

In [None]:
# Add temporal features to hourly data
df_hourly['hour'] = df_hourly.index.hour
df_hourly['day'] = df_hourly.index.day
df_hourly['month'] = df_hourly.index.month
df_hourly['year'] = df_hourly.index.year
df_hourly['dayofweek'] = df_hourly.index.dayofweek
df_hourly['is_weekend'] = df_hourly['dayofweek'].isin([5, 6]).astype(int)

# Season mapping
def get_season(month):
    if month in [12, 1, 2]:
        return 0  # Winter
    elif month in [3, 4, 5]:
        return 1  # Spring
    elif month in [6, 7, 8]:
        return 2  # Summer
    else:
        return 3  # Autumn

df_hourly['season'] = df_hourly['month'].apply(get_season)

print("‚úÖ Temporal features added!")
print("\nüìä New Features:")
print(df_hourly[['hour', 'day', 'month', 'year', 'dayofweek', 'is_weekend', 'season']].head(10))

In [None]:
# Visualize hourly patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Hourly consumption pattern
hourly_pattern = df_hourly.groupby('hour')['Global_active_power'].mean()
axes[0].bar(hourly_pattern.index, hourly_pattern.values, color='#2ecc71', edgecolor='white')
axes[0].set_title('Average Consumption by Hour of Day', fontweight='bold')
axes[0].set_xlabel('Hour')
axes[0].set_ylabel('Avg Power (kW)')
axes[0].set_xticks(range(24))

# Day of week pattern
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
daily_pattern = df_hourly.groupby('dayofweek')['Global_active_power'].mean()
colors = ['#3498db'] * 5 + ['#e74c3c'] * 2  # Blue for weekdays, red for weekends
axes[1].bar(daily_pattern.index, daily_pattern.values, color=colors, edgecolor='white')
axes[1].set_title('Average Consumption by Day of Week', fontweight='bold')
axes[1].set_xlabel('Day of Week')
axes[1].set_ylabel('Avg Power (kW)')
axes[1].set_xticks(range(7))
axes[1].set_xticklabels(day_names)

plt.tight_layout()
plt.savefig('images/05_consumption_patterns.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/05_consumption_patterns.png")

### 3.6 Correlation Analysis

In [None]:
# Correlation heatmap
correlation_cols = numeric_cols + ['hour', 'dayofweek', 'is_weekend', 'season']
corr_matrix = df_hourly[correlation_cols].corr()

plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0, 
            fmt='.2f', square=True, linewidths=0.5,
            cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('images/06_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/06_correlation_heatmap.png")

### 3.7 Normalization & Scaling

In [None]:
# Prepare data for scaling (using hourly data)
features_to_scale = numeric_cols
df_scaled = df_hourly.copy()

# MinMax Scaling
minmax_scaler = MinMaxScaler()
minmax_scaled = minmax_scaler.fit_transform(df_hourly[features_to_scale])
df_minmax = pd.DataFrame(minmax_scaled, columns=[f'{col}_minmax' for col in features_to_scale], index=df_hourly.index)

# Standard Scaling (Z-score normalization)
standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(df_hourly[features_to_scale])
df_standard = pd.DataFrame(standard_scaled, columns=[f'{col}_standard' for col in features_to_scale], index=df_hourly.index)

print("‚úÖ Data Scaling Complete!")
print("\nüìä MinMax Scaled Data (range [0, 1]):")
print(df_minmax.describe().round(3))
print("\nüìä Standard Scaled Data (mean=0, std=1):")
print(df_standard.describe().round(3))

In [None]:
# Visualize scaling comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

col_to_show = 'Global_active_power'

# Original
axes[0].hist(df_hourly[col_to_show].dropna(), bins=50, color='#3498db', edgecolor='white', alpha=0.8)
axes[0].set_title(f'Original: {col_to_show}', fontweight='bold')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

# MinMax Scaled
axes[1].hist(df_minmax[f'{col_to_show}_minmax'].dropna(), bins=50, color='#2ecc71', edgecolor='white', alpha=0.8)
axes[1].set_title(f'MinMax Scaled: {col_to_show}', fontweight='bold')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')

# Standard Scaled
axes[2].hist(df_standard[f'{col_to_show}_standard'].dropna(), bins=50, color='#e74c3c', edgecolor='white', alpha=0.8)
axes[2].set_title(f'Standard Scaled: {col_to_show}', fontweight='bold')
axes[2].set_xlabel('Value')
axes[2].set_ylabel('Frequency')

plt.suptitle('Scaling Comparison', fontsize=14, fontweight='bold', y=1.05)
plt.tight_layout()
plt.savefig('images/07_scaling_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/07_scaling_comparison.png")

### 3.8 Train/Validation/Test Split

In [None]:
# Time-series aware split (no shuffling to maintain temporal order)
# 70% train, 15% validation, 15% test

n = len(df_hourly)
train_end = int(n * 0.70)
val_end = int(n * 0.85)

df_train = df_hourly.iloc[:train_end]
df_val = df_hourly.iloc[train_end:val_end]
df_test = df_hourly.iloc[val_end:]

print("üìä Dataset Split Summary:")
print("=" * 60)
print(f"   Training Set:   {len(df_train):,} samples ({len(df_train)/n*100:.1f}%)")
print(f"   Validation Set: {len(df_val):,} samples ({len(df_val)/n*100:.1f}%)")
print(f"   Test Set:       {len(df_test):,} samples ({len(df_test)/n*100:.1f}%)")
print(f"\n   Total:          {n:,} samples")

print("\nüìÖ Date Ranges:")
print(f"   Training:   {df_train.index.min()} to {df_train.index.max()}")
print(f"   Validation: {df_val.index.min()} to {df_val.index.max()}")
print(f"   Test:       {df_test.index.min()} to {df_test.index.max()}")

In [None]:
# Visualize data split
fig, ax = plt.subplots(figsize=(14, 5))

ax.plot(df_train.index, df_train['Global_active_power'], label='Training', color='#3498db', alpha=0.8)
ax.plot(df_val.index, df_val['Global_active_power'], label='Validation', color='#f39c12', alpha=0.8)
ax.plot(df_test.index, df_test['Global_active_power'], label='Test', color='#e74c3c', alpha=0.8)

ax.axvline(df_val.index.min(), color='black', linestyle='--', linewidth=1.5, alpha=0.7)
ax.axvline(df_test.index.min(), color='black', linestyle='--', linewidth=1.5, alpha=0.7)

ax.set_title('Train/Validation/Test Split Visualization', fontsize=14, fontweight='bold')
ax.set_xlabel('DateTime')
ax.set_ylabel('Global Active Power (kW)')
ax.legend(loc='upper right')

plt.tight_layout()
plt.savefig('images/08_data_split_visualization.png', dpi=150, bbox_inches='tight')
plt.show()
print("‚úÖ Image saved: images/08_data_split_visualization.png")

---

## 4Ô∏è‚É£ Summary & Conclusions

### Milestone 1 Achievements:

#### Module 1: Data Collection and Understanding
- ‚úÖ Loaded and explored 2,075,259 minute-level energy consumption records
- ‚úÖ Identified 1.25% missing values across all numerical columns
- ‚úÖ Organized device-level energy data (Kitchen, Laundry, Climate Control)
- ‚úÖ Performed comprehensive exploratory data analysis

#### Module 2: Data Cleaning and Preprocessing
- ‚úÖ Handled missing values using interpolation and forward/backward fill
- ‚úÖ Detected and treated outliers using IQR method and Winsorization
- ‚úÖ Created DateTime index and extracted temporal features
- ‚úÖ Resampled data to hourly and daily granularity
- ‚úÖ Applied MinMax and Standard scaling for normalization
- ‚úÖ Split data into train (70%), validation (15%), and test (15%) sets

### Key Observations:
1. **HVAC (Sub_metering_3)** has the highest average consumption (~6.4 Wh)
2. **Peak hours** are morning (7-9 AM) and evening (6-9 PM)
3. **Weekend patterns** differ from weekdays
4. Strong correlation between `Global_active_power` and `Global_intensity`

### All Saved Visualizations:
1. `images/01_missing_values_analysis.png`
2. `images/02_device_level_distribution.png`
3. `images/03_outlier_detection.png`
4. `images/04_resampled_time_series.png`
5. `images/05_consumption_patterns.png`
6. `images/06_correlation_heatmap.png`
7. `images/07_scaling_comparison.png`
8. `images/08_data_split_visualization.png`

In [None]:
# Final summary statistics
print("üìä MILESTONE 1 COMPLETE!")
print("=" * 60)
print(f"\nüìÅ Processed Data Summary:")
print(f"   Original Records: 2,075,259")
print(f"   Hourly Records: {len(df_hourly):,}")
print(f"   Daily Records: {len(df_daily):,}")
print(f"\nüìä Missing Values: {missing_before:,} ‚Üí 0")
print(f"\nüñºÔ∏è Visualizations Saved: 8 images in 'images/' folder")
print("\n‚úÖ Ready for Milestone 2: Feature Engineering & Modeling!")