# Milestone 1: Week 1-2 ‚Äî Smart Energy Analysis
 
## Module 1: Data Collection and Understanding
- **Define project scope and functional objectives for smart energy analysis.**
- **Collect and structure the SmartHome Energy Monitoring Dataset.**
- **Verify data integrity, handle missing timestamps, and perform exploratory analysis.**
- **Organize energy readings by device, room, and timestamp.**
 
## Module 2: Data Cleaning and Preprocessing
- **Handle missing values and outliers in power consumption readings.**
- **Convert timestamps to datetime format and resample data (hourly/daily).**
- **Normalize or scale energy values for model compatibility.**
- **Split dataset into training, validation, and testing sets.**
 
---
 
This notebook implements the highest level of these tasks for Milestone 1, with all outputs and images saved for reporting and reproducibility.

## NOTE:
This notebook submission is limited to the following Milestone 1 tasks only:
- Data Collection and Understanding (Module 1)
- Data Cleaning and Preprocessing (Module 2)
All other modules, advanced modeling, or extra analyses are excluded from this milestone.

In [1]:
# Data Manipulation & Analysis
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 12

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Deep Learning (LSTM)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

print("‚úÖ All libraries imported successfully!")
print(f"üì¶ TensorFlow version: {tf.__version__}")
print(f"üì¶ Pandas version: {pd.__version__}")
print(f"üì¶ NumPy version: {np.__version__}")

‚úÖ All libraries imported successfully!
üì¶ TensorFlow version: 2.20.0
üì¶ Pandas version: 2.3.3
üì¶ NumPy version: 2.2.6


## 2Ô∏è‚É£ Load and Parse Dataset

The dataset uses:
- Semicolon (`;`) as delimiter
- Missing values marked as `?`
- European decimal format (comma as separator) - though this dataset uses period

In [2]:
# Load the dataset
# The dataset uses semicolon as separator and '?' for missing values
df = pd.read_csv('household_power_consumption.txt', 
                 sep=';', 
                 na_values=['?', ''],
                 low_memory=False)

# Display basic information
print("=" * 60)
print("üìä DATASET LOADED SUCCESSFULLY!")
print("=" * 60)
print(f"\nüìè Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"üíæ Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nüìã Column Names:")
for i, col in enumerate(df.columns, 1):
    print(f"   {i}. {col}")

# Display first few rows
print("\nüîç First 5 rows:")
df.head()

üìä DATASET LOADED SUCCESSFULLY!

üìè Shape: 2,075,259 rows √ó 9 columns
üíæ Memory Usage: 338.33 MB

üìã Column Names:
   1. Date
   2. Time
   3. Global_active_power
   4. Global_reactive_power
   5. Voltage
   6. Global_intensity
   7. Sub_metering_1
   8. Sub_metering_2
   9. Sub_metering_3

üîç First 5 rows:


Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [3]:
# Check data types and info
print("üìä Dataset Information:")
print("=" * 60)
df.info()

print("\n" + "=" * 60)
print("üìà Statistical Summary:")
print("=" * 60)
df.describe()

üìä Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    float64
 3   Global_reactive_power  float64
 4   Voltage                float64
 5   Global_intensity       float64
 6   Sub_metering_1         float64
 7   Sub_metering_2         float64
 8   Sub_metering_3         float64
dtypes: float64(7), object(2)
memory usage: 142.5+ MB

üìà Statistical Summary:


Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
count,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0
mean,1.091615,0.1237145,240.8399,4.627759,1.121923,1.29852,6.458447
std,1.057294,0.112722,3.239987,4.444396,6.153031,5.822026,8.437154
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0
25%,0.308,0.048,238.99,1.4,0.0,0.0,0.0
50%,0.602,0.1,241.01,2.6,0.0,0.0,1.0
75%,1.528,0.194,242.89,6.4,0.0,1.0,17.0
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0


## 3Ô∏è‚É£ Data Cleaning and Missing Value Treatment

### Key Preprocessing Steps:
1. Convert all numeric columns to proper float type
2. Handle missing values using forward-fill and interpolation
3. Create proper DateTime index

In [4]:
# Check missing values
print("üîç Missing Values Analysis:")
print("=" * 60)
missing_data = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage (%)': missing_percent
})
print(missing_df)
print(f"\nüìä Total missing values: {df.isnull().sum().sum():,}")
print(f"üìä Percentage of data with missing values: {(df.isnull().any(axis=1).sum() / len(df) * 100):.2f}%")

üîç Missing Values Analysis:
                       Missing Count  Percentage (%)
Date                               0            0.00
Time                               0            0.00
Global_active_power            25979            1.25
Global_reactive_power          25979            1.25
Voltage                        25979            1.25
Global_intensity               25979            1.25
Sub_metering_1                 25979            1.25
Sub_metering_2                 25979            1.25
Sub_metering_3                 25979            1.25

üìä Total missing values: 181,853
üìä Percentage of data with missing values: 1.25%


In [5]:
# Convert numeric columns to proper types
numeric_cols = ['Global_active_power', 'Global_reactive_power', 'Voltage', 
                'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']

for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Create DateTime column
df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H:%M:%S')

# Set DateTime as index
df.set_index('DateTime', inplace=True)

# Drop original Date and Time columns
df.drop(['Date', 'Time'], axis=1, inplace=True)

print("‚úÖ DateTime index created!")
print(f"üìÖ Date Range: {df.index.min()} to {df.index.max()}")
print(f"üìÜ Total Duration: {(df.index.max() - df.index.min()).days} days")
df.head()

‚úÖ DateTime index created!
üìÖ Date Range: 2006-12-16 17:24:00 to 2010-11-26 21:02:00
üìÜ Total Duration: 1441 days


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [6]:
# Handle missing values using forward fill and interpolation
print("üîß Handling Missing Values...")
print(f"   Before: {df.isnull().sum().sum():,} missing values")

# Forward fill first (for time-series continuity)
df.fillna(method='ffill', inplace=True)

# Then backward fill for any remaining at the start
df.fillna(method='bfill', inplace=True)

# Verify
print(f"   After: {df.isnull().sum().sum():,} missing values")
print("\n‚úÖ Missing values handled successfully!")

# Display cleaned data info
print("\nüìä Cleaned Dataset Summary:")
df.describe().round(3)

üîß Handling Missing Values...
   Before: 181,853 missing values
   After: 0 missing values

‚úÖ Missing values handled successfully!

üìä Cleaned Dataset Summary:


Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
count,2075259.0,2075259.0,2075259.0,2075259.0,2075259.0,2075259.0,2075259.0
mean,1.086,0.123,240.842,4.604,1.111,1.288,6.417
std,1.053,0.113,3.236,4.427,6.116,5.787,8.42
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0
25%,0.308,0.048,239.0,1.4,0.0,0.0,0.0
50%,0.598,0.1,241.02,2.6,0.0,0.0,1.0
75%,1.524,0.194,242.87,6.4,0.0,1.0,17.0
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0


## 4Ô∏è‚É£ DateTime Feature Engineering

Extracting temporal features for pattern analysis:
- Hour of day (0-23)
- Day of week (0-6, Monday=0)
- Month (1-12)
- Year
- Weekend indicator
- Season

In [7]:
# Extract temporal features
df['Hour'] = df.index.hour
df['DayOfWeek'] = df.index.dayofweek
df['Month'] = df.index.month
df['Year'] = df.index.year
df['Day'] = df.index.day
df['Quarter'] = df.index.quarter
df['IsWeekend'] = (df['DayOfWeek'] >= 5).astype(int)

# Create Season feature
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

df['Season'] = df['Month'].apply(get_season)

# Day period (Morning, Afternoon, Evening, Night)
def get_day_period(hour):
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

df['DayPeriod'] = df['Hour'].apply(get_day_period)

print("‚úÖ Temporal features created!")
print("\nüìä Sample of engineered features:")
df[['Global_active_power', 'Hour', 'DayOfWeek', 'Month', 'Year', 'IsWeekend', 'Season', 'DayPeriod']].head(10)

‚úÖ Temporal features created!

üìä Sample of engineered features:


Unnamed: 0_level_0,Global_active_power,Hour,DayOfWeek,Month,Year,IsWeekend,Season,DayPeriod
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-12-16 17:24:00,4.216,17,5,12,2006,1,Winter,Evening
2006-12-16 17:25:00,5.36,17,5,12,2006,1,Winter,Evening
2006-12-16 17:26:00,5.374,17,5,12,2006,1,Winter,Evening
2006-12-16 17:27:00,5.388,17,5,12,2006,1,Winter,Evening
2006-12-16 17:28:00,3.666,17,5,12,2006,1,Winter,Evening
2006-12-16 17:29:00,3.52,17,5,12,2006,1,Winter,Evening
2006-12-16 17:30:00,3.702,17,5,12,2006,1,Winter,Evening
2006-12-16 17:31:00,3.7,17,5,12,2006,1,Winter,Evening
2006-12-16 17:32:00,3.668,17,5,12,2006,1,Winter,Evening
2006-12-16 17:33:00,3.662,17,5,12,2006,1,Winter,Evening


## End of Milestone 1 Submission
This is the end of the Milestone 1 (Week 1-2) submission. All subsequent content, including advanced modeling, feature engineering, or analyses beyond the scope of Modules 1 and 2, has been intentionally removed for this milestone.