<a href="https://colab.research.google.com/github/anandaditya07/Smart-Energy-Consumption-Analysis-and-Prediction-using-Machine-Learning-with-Device-Level-Insights/blob/main/Aditya_Anand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **WEEK 1 & 2**


**Module 1: Data Collection and Understanding**


1. **Define project scope and functional objectives for smart energy analysis.**



This project is about understanding how much electricity different appliances in a smart home use. Instead of only seeing one total electricity bill at the end of the month, we want to see which device uses how much power and when. This will help us know where energy is being wasted.

**Functional Objectives**

*   Track energy usage of each device and each room separately.
*   Show energy use in the form of graphs (hourly, daily, weekly).
*   Find which devices use the most power and at what time.
*   Use machine learning to predict future electricity use.
*   Help save electricity by giving suggestions to reduce unnecessary usage.


2. **Collect and structure the SmartHome Energy Monitoring Dataset**

In [3]:
# Basic libraries
!pip install matplotlib scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\anand\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip




In [6]:
import sys
import subprocess

# Essential packages (no TensorFlow)
packages = ['matplotlib', 'scikit-learn', 'pandas', 'numpy', 'seaborn', 'joblib', 'flask']

print("Installing essential packages...")
for pkg in packages:
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', pkg], 
                            stdout=subprocess.DEVNULL)
        print(f"âœ“ {pkg}")
    except Exception as e:
        print(f"âœ— {pkg} - {str(e)[:50]}")

print("\nâœ“ Installation complete!")
print("ðŸ”„ IMPORTANT: Restart your kernel now!")
print("   Go to: Kernel â†’ Restart Kernel")

Installing essential packages...
âœ“ matplotlib
âœ“ scikit-learn
âœ“ pandas
âœ“ numpy
âœ“ seaborn
âœ“ joblib
âœ“ flask

âœ“ Installation complete!
ðŸ”„ IMPORTANT: Restart your kernel now!
   Go to: Kernel â†’ Restart Kernel


In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

print("âœ“ All essential packages working!")
print("You can now run the project!")

âœ“ All essential packages working!
You can now run the project!


In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

import pandas as pd

path = "D:\lala\HomeC_augmented.csv"
df = pd.read_csv(path)

In [2]:
# Read the CSV
df_raw = pd.read_csv(path)

print("Original shape:", df_raw.shape)
df_raw.head()

Original shape: (503910, 42)


Unnamed: 0.1,Unnamed: 0,time,Dishwasher,Home office,Fridge,Wine cellar,Garage door,Barn,Well,Microwave,...,use_HO,gen_Sol,Car charger [kW],Water heater [kW],Air conditioning [kW],Home Theater [kW],Outdoor lights [kW],microwave [kW],Laundry [kW],Pool Pump [kW]
0,0,2016-01-01 05:00:00,3.3e-05,0.442633,0.12415,0.006983,0.013083,0.03135,0.001017,0.004067,...,0.932833,0.003483,0.013034,0.000241,0.013796,0.000519,0.0014,0.020147,0.0,0.000746
1,1,2016-01-01 05:01:00,0.0,0.444067,0.124,0.006983,0.013117,0.0315,0.001017,0.004067,...,0.934333,0.003467,0.021769,0.000978,0.014487,0.000543,0.0008,0.030903,0.0,0.002249
2,2,2016-01-01 05:02:00,1.7e-05,0.446067,0.123533,0.006983,0.013083,0.031517,0.001,0.004067,...,0.931817,0.003467,0.028218,0.000642,0.014498,0.000481,0.0012,0.0,0.001883,0.003971
3,3,2016-01-01 05:03:00,1.7e-05,0.446583,0.123133,0.006983,0.013,0.0315,0.001017,0.004067,...,1.02205,0.003483,0.036478,0.000218,0.014181,0.000531,0.0016,0.024038,0.00261,0.003673
4,4,2016-01-01 05:04:00,0.000133,0.446533,0.12285,0.00685,0.012783,0.0315,0.001017,0.004067,...,1.1394,0.003467,0.044295,0.000844,0.014949,0.001052,0.002,0.0,0.002462,0.005006


In [6]:
# Column names
df.columns


Index(['Unnamed: 0', 'time', 'Dishwasher', 'Home office', 'Fridge',
       'Wine cellar', 'Garage door', 'Barn', 'Well', 'Microwave',
       'Living room', 'temperature', 'humidity', 'visibility',
       'apparentTemperature', 'pressure', 'windSpeed', 'cloudCover',
       'windBearing', 'precipIntensity', 'dewPoint', 'precipProbability',
       'Furnace', 'Kitchen', 'year', 'month', 'day', 'weekday', 'weekofyear',
       'hour', 'minute', 'timing', 'use_HO', 'gen_Sol', 'Car charger [kW]',
       'Water heater [kW]', 'Air conditioning [kW]', 'Home Theater [kW]',
       'Outdoor lights [kW]', 'microwave [kW]', 'Laundry [kW]',
       'Pool Pump [kW]'],
      dtype='object')

In [7]:
# Basic info â€“ data types, nulls, etc.
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503910 entries, 0 to 503909
Data columns (total 42 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             503910 non-null  int64  
 1   time                   503910 non-null  object 
 2   Dishwasher             503910 non-null  float64
 3   Home office            503910 non-null  float64
 4   Fridge                 503910 non-null  float64
 5   Wine cellar            503910 non-null  float64
 6   Garage door            503910 non-null  float64
 7   Barn                   503910 non-null  float64
 8   Well                   503910 non-null  float64
 9   Microwave              503910 non-null  float64
 10  Living room            503910 non-null  float64
 11  temperature            503910 non-null  float64
 12  humidity               503910 non-null  float64
 13  visibility             503910 non-null  float64
 14  apparentTemperature    503910 non-nu

In [8]:
# Basic statistics for numerical columns
df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,503910.0,251954.5,145466.431411,0.0,125977.25,251954.5,377931.75,503909.0
Dishwasher,503910.0,0.031368,0.190951,0.0,0.0,1.7e-05,0.000233,1.401767
Home office,503910.0,0.081287,0.104466,8.3e-05,0.040383,0.042217,0.068283,0.97175
Fridge,503910.0,0.063556,0.076199,6.7e-05,0.005083,0.005433,0.125417,0.851267
Wine cellar,503910.0,0.042137,0.057967,1.7e-05,0.007133,0.008083,0.053192,1.273933
Garage door,503910.0,0.014139,0.014292,1.7e-05,0.012733,0.012933,0.0131,1.088983
Barn,503910.0,0.05853,0.202706,0.0,0.029833,0.031317,0.032883,7.0279
Well,503910.0,0.015642,0.137841,0.0,0.000983,0.001,0.001017,1.633017
Microwave,503910.0,0.010983,0.098859,0.0,0.003617,0.004,0.004067,1.9298
Living room,503910.0,0.035313,0.096056,0.0,0.001483,0.001617,0.00175,0.465217


In [10]:
import pandas as pd
# Change 'timestamp' to the actual time column name from df_raw.columns
time_col = "time"   # e.g. "date", "time", "Datetime" etc.

# Convert to datetime
df_raw[time_col] = pd.to_datetime(df_raw[time_col], errors='coerce')

# Drop rows where timestamp could not be parsed
df_raw = df_raw.dropna(subset=[time_col])

# Sort by time
df_raw = df_raw.sort_values(time_col);

# Set timestamp as index
df = df_raw.set_index(time_col)

print("After setting time index:", df.shape)
df.head()

After setting time index: (503910, 41)


Unnamed: 0_level_0,Unnamed: 0,Dishwasher,Home office,Fridge,Wine cellar,Garage door,Barn,Well,Microwave,Living room,...,use_HO,gen_Sol,Car charger [kW],Water heater [kW],Air conditioning [kW],Home Theater [kW],Outdoor lights [kW],microwave [kW],Laundry [kW],Pool Pump [kW]
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-01 05:00:00,0,3.3e-05,0.442633,0.12415,0.006983,0.013083,0.03135,0.001017,0.004067,0.001517,...,0.932833,0.003483,0.013034,0.000241,0.013796,0.000519,0.0014,0.020147,0.0,0.000746
2016-01-01 05:01:00,1,0.0,0.444067,0.124,0.006983,0.013117,0.0315,0.001017,0.004067,0.00165,...,0.934333,0.003467,0.021769,0.000978,0.014487,0.000543,0.0008,0.030903,0.0,0.002249
2016-01-01 05:02:00,2,1.7e-05,0.446067,0.123533,0.006983,0.013083,0.031517,0.001,0.004067,0.00165,...,0.931817,0.003467,0.028218,0.000642,0.014498,0.000481,0.0012,0.0,0.001883,0.003971
2016-01-01 05:03:00,3,1.7e-05,0.446583,0.123133,0.006983,0.013,0.0315,0.001017,0.004067,0.001617,...,1.02205,0.003483,0.036478,0.000218,0.014181,0.000531,0.0016,0.024038,0.00261,0.003673
2016-01-01 05:04:00,4,0.000133,0.446533,0.12285,0.00685,0.012783,0.0315,0.001017,0.004067,0.001583,...,1.1394,0.003467,0.044295,0.000844,0.014949,0.001052,0.002,0.0,0.002462,0.005006


In [11]:
# All devices/measurements (since time is now index)
device_cols = df.columns.tolist()
print("Device / sensor columns:", device_cols[:10])

Device / sensor columns: ['Unnamed: 0', 'Dishwasher', 'Home office', 'Fridge', 'Wine cellar', 'Garage door', 'Barn', 'Well', 'Microwave', 'Living room']




3. **Verify data integrity, handle missing timestamps, and perform exploratory analysis.**



i. Check Data Integrity

  We verify whether the dataset has:

*   Repeated timestamps
*   Empty/Missing data

    If yes, we fix them.




In [12]:
import pandas as pd
print("\n~~~~~~~ MISSING TIMESTAMP HANDLING ~~~~~~~")
# Try to guess the time gap between readings (like 1 hour / 5 min)
inferred_freq = pd.infer_freq(df.index[:100])
print("Inferred frequency:", inferred_freq)

# If frequency cannot be detected â†’ assume 1 hour gap
if inferred_freq is None:
    inferred_freq = '1H'

# Create a new continuous timeline with no gaps
full_range = pd.date_range(start=df.index.min(),
                           end=df.index.max(),
                           freq=inferred_freq)

# Reindex so dataset follows this timeline
df = df.reindex(full_range)
df.index.name = "timestamp"

# Fill empty values created by reindexing
df = df.ffill().bfill()
print("Missing values after filling:")
print(df.isna().sum())


~~~~~~~ MISSING TIMESTAMP HANDLING ~~~~~~~
Inferred frequency: min
Missing values after filling:
Unnamed: 0               0
Dishwasher               0
Home office              0
Fridge                   0
Wine cellar              0
Garage door              0
Barn                     0
Well                     0
Microwave                0
Living room              0
temperature              0
humidity                 0
visibility               0
apparentTemperature      0
pressure                 0
windSpeed                0
cloudCover               0
windBearing              0
precipIntensity          0
dewPoint                 0
precipProbability        0
Furnace                  0
Kitchen                  0
year                     0
month                    0
day                      0
weekday                  0
weekofyear               0
hour                     0
minute                   0
timing                   0
use_HO                   0
gen_Sol                  0
Car charger

ii. Handle Missing Timestamps

We ensure time moves smoothly with no missing timestamps,
and we fill gaps in the data by copying nearby values.

In [13]:
import pandas as pd
print("\n~~~~~~~ MISSING TIMESTAMP HANDLING ~~~~~~~")
# Try to guess the time gap between readings (like 1 hour / 5 min)
inferred_freq = pd.infer_freq(df.index[:100])
print("Inferred frequency:", inferred_freq)
# If frequency cannot be detected â†’ assume 1 hour gap
if inferred_freq is None:
    inferred_freq = '1H'
# Create a new continuous timeline with no gaps
full_range = pd.date_range(start=df.index.min(),
                           end=df.index.max(),
                           freq=inferred_freq)
# Reindex so dataset follows this timeline
df = df.reindex(full_range)
df.index.name = "timestamp"
# Fill empty values created by reindexing
df = df.ffill().bfill()
print("Missing values after filling:")
print(df.isna().sum())


~~~~~~~ MISSING TIMESTAMP HANDLING ~~~~~~~
Inferred frequency: min
Missing values after filling:
Unnamed: 0               0
Dishwasher               0
Home office              0
Fridge                   0
Wine cellar              0
Garage door              0
Barn                     0
Well                     0
Microwave                0
Living room              0
temperature              0
humidity                 0
visibility               0
apparentTemperature      0
pressure                 0
windSpeed                0
cloudCover               0
windBearing              0
precipIntensity          0
dewPoint                 0
precipProbability        0
Furnace                  0
Kitchen                  0
year                     0
month                    0
day                      0
weekday                  0
weekofyear               0
hour                     0
minute                   0
timing                   0
use_HO                   0
gen_Sol                  0
Car charger

iii. Exploratory Data Analysis
* Minimum, maximum, average energy usage per device

* Graph that shows how energy usage changes with time


In [14]:
import matplotlib.pyplot as plt
import numpy as np
print("\n~~~~~~EXPLORATORY ANALYSIS ~~~~~~~")
# Show basic numeric statistics for all device columns
display(df.describe().T)
# Show how a few device values change over time
plt.figure(figsize=(12,4))
# Exclude 'Unnamed: 0' from sample_cols for better visualization of energy consumption
sample_cols = [col for col in df.select_dtypes(include=[np.number]).columns if col != 'Unnamed: 0'][:3]
for col in sample_cols:
    plt.plot(df.index, df[col], label=col)
plt.xlabel("Time")
plt.ylabel("Energy Consumption")
plt.title("Sample Energy Consumption Over Time")
plt.legend()
plt.show()

ModuleNotFoundError: No module named 'matplotlib'

**iv**. **Organize energy readings by device, room, and timestamp.**

In [15]:
# Make a copy to be safe
df_device = df.copy()
# Select all numeric columns as device columns
device_cols = df_device.select_dtypes(include=['number']).columns.tolist()
print("Device / sensor columns:", device_cols)
# Convert from wide format â†’ long format
df_long = df_device.reset_index().melt(
    id_vars=["timestamp"],       # column that stays fixed (time)
    value_vars=device_cols,      # columns that will become 'device'
    var_name="device",           # new column name for device name
    value_name="energy"          # new column name for energy value
)
print("Long format shape:", df_long.shape)
df_long.head()

Device / sensor columns: ['Unnamed: 0', 'Dishwasher', 'Home office', 'Fridge', 'Wine cellar', 'Garage door', 'Barn', 'Well', 'Microwave', 'Living room', 'temperature', 'humidity', 'visibility', 'apparentTemperature', 'pressure', 'windSpeed', 'cloudCover', 'windBearing', 'precipIntensity', 'dewPoint', 'precipProbability', 'Furnace', 'Kitchen', 'year', 'month', 'day', 'weekofyear', 'hour', 'minute', 'use_HO', 'gen_Sol', 'Car charger [kW]', 'Water heater [kW]', 'Air conditioning [kW]', 'Home Theater [kW]', 'Outdoor lights [kW]', 'microwave [kW]', 'Laundry [kW]', 'Pool Pump [kW]']
Long format shape: (19652490, 3)


Unnamed: 0,timestamp,device,energy
0,2016-01-01 05:00:00,Unnamed: 0,0.0
1,2016-01-01 05:01:00,Unnamed: 0,1.0
2,2016-01-01 05:02:00,Unnamed: 0,2.0
3,2016-01-01 05:03:00,Unnamed: 0,3.0
4,2016-01-01 05:04:00,Unnamed: 0,4.0


In [16]:
# Example device â†’ room mapping
# IMPORTANT: change keys to match your real device names
room_map = {
    "Kitchen_Light": "Kitchen",
    "Fridge": "Kitchen",
    "AC_Bedroom": "Bedroom",
    "TV_LivingRoom": "Living Room",
    # Add more device: room pairs here...
}
# Create 'room' column using the mapping
df_long["room"] = df_long["device"].map(room_map).fillna("Unknown")
# Show first few organized rows
df_long.head()

Unnamed: 0,timestamp,device,energy,room
0,2016-01-01 05:00:00,Unnamed: 0,0.0,Unknown
1,2016-01-01 05:01:00,Unnamed: 0,1.0,Unknown
2,2016-01-01 05:02:00,Unnamed: 0,2.0,Unknown
3,2016-01-01 05:03:00,Unnamed: 0,3.0,Unknown
4,2016-01-01 05:04:00,Unnamed: 0,4.0,Unknown


**Module 2: Data Cleaning and Preprocessing**

i. Handle missing values and outliers in power consumption readings.



In [17]:
import numpy as np # Ensure numpy is imported for np.number, if not already
# Missing values check
print("Missing values before cleaning:")
print(df.isna().sum())

# Fill missing values using forward & backward fill
df = df.ffill().bfill()
print("Missing values after filling:")
print(df.isna().sum())

# Remove outliers using 1st and 99th percentile for each numeric column
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
for col in num_cols:
    low, high = df[col].quantile([0.01, 0.99])
    df[col] = df[col].clip(lower=low, upper=high)

print("Outliers handled successfully.")

Missing values before cleaning:
Unnamed: 0               0
Dishwasher               0
Home office              0
Fridge                   0
Wine cellar              0
Garage door              0
Barn                     0
Well                     0
Microwave                0
Living room              0
temperature              0
humidity                 0
visibility               0
apparentTemperature      0
pressure                 0
windSpeed                0
cloudCover               0
windBearing              0
precipIntensity          0
dewPoint                 0
precipProbability        0
Furnace                  0
Kitchen                  0
year                     0
month                    0
day                      0
weekday                  0
weekofyear               0
hour                     0
minute                   0
timing                   0
use_HO                   0
gen_Sol                  0
Car charger [kW]         0
Water heater [kW]        0
Air conditioning [kW]  

ii. Convert timestamps to datetime format and resample data (hourly/daily).

In [18]:
import numpy as np
# PART 2: RESAMPLE DATA (HOURLY / DAILY)

# Select only numeric columns for resampling
numeric_df = df.select_dtypes(include=[np.number])

# Hourly average consumption
df_hourly = numeric_df.resample('h').mean()

print("Hourly data shape:", df_hourly.shape)
df_hourly.head()

Hourly data shape: (8399, 39)


Unnamed: 0_level_0,Unnamed: 0,Dishwasher,Home office,Fridge,Wine cellar,Garage door,Barn,Well,Microwave,Living room,...,use_HO,gen_Sol,Car charger [kW],Water heater [kW],Air conditioning [kW],Home Theater [kW],Outdoor lights [kW],microwave [kW],Laundry [kW],Pool Pump [kW]
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-01 05:00:00,5039.09,6.4e-05,0.241814,0.037861,0.063351,0.013046,0.038881,0.001042,0.004348,0.001505,...,1.04413,0.003307,1.343201,0.0337,0.01452,0.010582,0.117,0.120953,0.017833,0.090792
2016-01-01 06:00:00,5039.09,9.9e-05,0.043294,0.075522,0.112942,0.012836,0.039181,0.001021,0.004216,0.001618,...,0.918167,0.003422,1.343201,0.106164,0.014299,0.012238,0.117,0.411554,0.027082,0.090792
2016-01-01 07:00:00,5039.09,4.3e-05,0.043416,0.059486,0.007184,0.013299,0.034439,0.001014,0.004246,0.001629,...,0.714736,0.003448,1.343201,0.127881,0.014002,0.012473,0.117,1.52346,0.027375,0.097588
2016-01-01 08:00:00,5039.09,0.000138,0.065014,0.060412,0.007045,0.012925,0.034195,0.001016,0.004274,0.001634,...,0.960013,0.003447,1.381755,0.125348,0.013986,0.013061,0.117,0.302304,0.033354,0.13101
2016-01-01 09:00:00,5039.09,6e-05,0.043392,0.035106,0.007143,0.01322,0.03183,0.001014,0.004258,0.00165,...,0.639836,0.003439,1.5462,0.059802,0.014145,0.014597,0.117,0.072612,0.062879,0.175744


iii. Normalize or scale energy values for model compatibility.

In [19]:
from sklearn.preprocessing import MinMaxScaler

# PART 3: NORMALIZATION / SCALING

# Select target and features later
df_scaled = df_hourly.copy()

scaler = MinMaxScaler()
df_scaled[df_hourly.columns] = scaler.fit_transform(df_hourly)

df_scaled.head()

ModuleNotFoundError: No module named 'sklearn'

iv. Split dataset into training, validation, and testing sets.

In [21]:
# PART 4: TRAIN / VALIDATION / TEST SPLIT

# Select the main target column (CHANGE to your main power column)
target_col = df_scaled.columns[0]  # example: first numeric col
print("Using target:", target_col)

# Create X and y
X = df_scaled.drop(columns=[target_col])
y = df_scaled[target_col]

# Time-based splitting
train_size = int(len(df_scaled) * 0.7)
val_size = int(len(df_scaled) * 0.15)

X_train = X.iloc[:train_size]
y_train = y.iloc[:train_size]

X_val = X.iloc[train_size:train_size + val_size]
y_val = y.iloc[train_size:train_size + val_size]

X_test = X.iloc[train_size + val_size:]
y_test = y.iloc[train_size + val_size:]

print("Train size:", len(X_train))
print("Validation size:", len(X_val))
print("Test size:", len(X_test))


NameError: name 'df_scaled' is not defined

**Milestone 2: Week 3-4**


Module 3: Feature Engineering

i. Extract relevant time-based features (hour, day, week, month trends).

In [22]:
# PART 1: TIME-BASED FEATURES
df_features = df_scaled.copy()

df_features["hour"] = df_features.index.hour
df_features["dayofweek"] = df_features.index.dayofweek   # 0=Monday
df_features["month"] = df_features.index.month

print("Time-based features added.")
df_features.head()

NameError: name 'df_scaled' is not defined

ii. Aggregate device-level consumption statistics.

In [None]:
# PART 2: AGGREGATE DEVICE CONSUMPTION
df_features["total_energy"] = df_features.select_dtypes(include='number').sum(axis=1)
df_features.head()


iii. Create lag features and moving averages for time series learning.

In [23]:
# PART 3: LAG AND MOVING AVERAGE FEATURES
target_col = df_features.columns[0]  # change if needed
print("Target column:", target_col)

# Lag features (previous values)
for lag in [1, 6, 12, 24]:
    df_features[f"{target_col}_lag_{lag}"] = df_features[target_col].shift(lag)

# Rolling/Moving averages
df_features["rolling_mean_6"] = df_features[target_col].rolling(6).mean()
df_features["rolling_mean_12"] = df_features[target_col].rolling(12).mean()
df_features["rolling_mean_24"] = df_features[target_col].rolling(24).mean()

# Drop rows created with NaN from shifting
df_features = df_features.dropna()

df_features.head()


NameError: name 'df_features' is not defined

iv. Prepare final feature set for ML model input.

In [None]:

# PART 4: FINAL ML FEATURE MATRIX
X = df_features.drop(columns=[target_col])
y = df_features[target_col]

# Time-based splitting for model training
train_size = int(len(df_features) * 0.7)
val_size = int(len(df_features) * 0.15)

X_train = X.iloc[:train_size]
y_train = y.iloc[:train_size]

X_val = X.iloc[train_size:train_size+val_size]
y_val = y.iloc[train_size:train_size+val_size]

X_test = X.iloc[train_size+val_size:]
y_test = y.iloc[train_size+val_size:]

print("Training set:", X_train.shape)
print("Validation set:", X_val.shape)
print("Testing set:", X_test.shape)


**Module 4: Baseline Model Development**



i. Build the Baseline Model (Linear Regression)

In [None]:
from sklearn.linear_model import LinearRegression
# Initialize Linear Regression model
baseline_model = LinearRegression()
print("Baseline Linear Regression model created.")


ii. Train the Baseline Model.

In [None]:
baseline_model.fit(X_train, y_train)
print("Baseline model training completed.")


iii. Make Predictions.

In [None]:
y_val_pred = baseline_model.predict(X_val)
y_test_pred = baseline_model.predict(X_test)
print("Predictions generated for validation and test sets.")


iv. Evaluate Model Performance (MAE & RMSE)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
# Validation metrics
val_mae = mean_absolute_error(y_val, y_val_pred)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
# Test metrics
test_mae = mean_absolute_error(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Validation MAE :", val_mae)
print("Validation RMSE:", val_rmse)
print("\nTest MAE :", test_mae)
print("Test RMSE:", test_rmse)

v. Plot Actual vs Predicted Energy Usage

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.plot(y_test.values, label="Actual Energy", linewidth=2)
plt.plot(y_test_pred, label="Predicted Energy", linestyle="--")
plt.xlabel("Time Steps")
plt.ylabel("Energy Consumption")
plt.title("Baseline Linear Regression: Actual vs Predicted")
plt.legend()
plt.show()

vi. Save Baseline Results for Model Comparison.

In [None]:
baseline_results = {
    "Model": "Linear Regression",
    "Validation MAE": val_mae,
    "Validation RMSE": val_rmse,
    "Test MAE": test_mae,
    "Test RMSE": test_rmse
}
baseline_results

**Milestone 3 : Week 5-6**

**Module 5: LSTM Model Development**





i. CREATE SEQUENCES FOR LSTM



In [None]:
import numpy as np

# Target column (same as baseline)
target_col = df_features.columns[0] # Explicitly set target_col to the first column, which was used in previous steps

# Convert dataframe to numpy array
data = df_features[target_col].values

# Number of past time steps used for prediction
SEQ_LEN = 24  # last 24 hours

def create_sequences(data, seq_len):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i + seq_len])
        y.append(data[i + seq_len])
    return np.array(X), np.array(y)

X_seq, y_seq = create_sequences(data, SEQ_LEN)

print("Sequence shape:", X_seq.shape)
print("Target shape:", y_seq.shape)

ii. Train / Validation / Test Split

In [None]:
train_size = int(len(X_seq) * 0.7)
val_size = int(len(X_seq) * 0.15)

X_train_seq = X_seq[:train_size]
y_train_seq = y_seq[:train_size]

X_val_seq = X_seq[train_size:train_size + val_size]
y_val_seq = y_seq[train_size:train_size + val_size]

X_test_seq = X_seq[train_size + val_size:]
y_test_seq = y_seq[train_size + val_size:]

# Reshape for LSTM: (samples, timesteps, features)
X_train_seq = X_train_seq.reshape(-1, SEQ_LEN, 1)
X_val_seq = X_val_seq.reshape(-1, SEQ_LEN, 1)
X_test_seq = X_test_seq.reshape(-1, SEQ_LEN, 1)

print("Train shape:", X_train_seq.shape)
print("Validation shape:", X_val_seq.shape)
print("Test shape:", X_test_seq.shape)


iii. Design and Implement LSTM Architecture

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import Adam

# Hyperparameters
EPOCHS = 30
BATCH_SIZE = 32
LEARNING_RATE = 0.001

model = Sequential([
    LSTM(50, activation='tanh', input_shape=(SEQ_LEN, 1)),
    Dense(1)
])

optimizer = Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer, loss='mse')

model.summary()


iv. Train the LSTM Model

In [None]:
history = model.fit(
    X_train_seq, y_train_seq,
    validation_data=(X_val_seq, y_val_seq),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    verbose=1
)


v. Hyperparameter Tuning





*  Epochs â†’ how long the model trains



*  Learning rate â†’ speed of learning
* Batch size â†’ data processed at once   





*   Create LSTM model function



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import Adam

def create_lstm_model(learning_rate):
    model = Sequential()
    model.add(LSTM(50, input_shape=(SEQ_LEN, 1)))
    model.add(Dense(1))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='mse')

    return model




*   Define hyperparameter values


In [None]:
epochs_list = [20, 30]
batch_size_list = [32, 64]
learning_rate_list = [0.001, 0.0005]




*  Train model with different combinations


In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

results = []

for epochs in epochs_list:
    for batch_size in batch_size_list:
        for lr in learning_rate_list:

            print(f"Training with epochs={epochs}, batch_size={batch_size}, learning_rate={lr}")

            model = create_lstm_model(lr)

            model.fit(
                X_train_seq, y_train_seq,
                validation_data=(X_val_seq, y_val_seq),
                epochs=epochs,
                batch_size=batch_size,
                verbose=0
            )

            # Validation prediction
            y_val_pred = model.predict(X_val_seq)
            rmse = np.sqrt(mean_squared_error(y_val_seq, y_val_pred))

            results.append([epochs, batch_size, lr, rmse])


*   View best hyperparameter combination

In [None]:
results_df = pd.DataFrame(
    results,
    columns=["Epochs", "Batch Size", "Learning Rate", "Validation RMSE"]
)

results_df.sort_values("Validation RMSE").head()


In [None]:
# EPOCHS = 50
# BATCH_SIZE = 64
# LEARNING_RATE = 0.0005

vi. Evaluate LSTM Model (MAE & RMSE)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

y_test_pred_lstm = model.predict(X_test_seq)

lstm_mae = mean_absolute_error(y_test_seq, y_test_pred_lstm)
lstm_rmse = np.sqrt(mean_squared_error(y_test_seq, y_test_pred_lstm))

print("LSTM MAE :", lstm_mae)
print("LSTM RMSE:", lstm_rmse)


vii. Compare LSTM with Baseline Model

In [None]:
comparison = {
    "Model": ["Linear Regression", "LSTM"],
    "MAE": [baseline_results["Test MAE"], lstm_mae],
    "RMSE": [baseline_results["Test RMSE"], lstm_rmse]
}

comparison


**Module 6: Model Evaluation and Integration**





i.Evaluate models using RMSE, MAE, and R2 score.

Here we check how good each model is using:

MAE â†’ average error

RMSE â†’ large error penalty

RÂ² Score â†’ how well the model explains data

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# ----- Linear Regression Evaluation -----
y_test_pred_lr = baseline_model.predict(X_test)

lr_mae = mean_absolute_error(y_test, y_test_pred_lr)
lr_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred_lr))
lr_r2 = r2_score(y_test, y_test_pred_lr)

print("Linear Regression:")
print("MAE :", lr_mae)
print("RMSE:", lr_rmse)
print("R2  :", lr_r2)

# ----- LSTM Evaluation -----
y_test_pred_lstm = model.predict(X_test_seq)

lstm_mae = mean_absolute_error(y_test_seq, y_test_pred_lstm)
lstm_rmse = np.sqrt(mean_squared_error(y_test_seq, y_test_pred_lstm))
lstm_r2 = r2_score(y_test_seq, y_test_pred_lstm)

print("\nLSTM Model:")
print("MAE :", lstm_mae)
print("RMSE:", lstm_rmse)
print("R2  :", lstm_r2)


ii. **Select Best-Performing Model**

The model with lower MAE & RMSE and higher RÂ² is selected as the final model.

In [None]:
#  MODEL SELECTION

# Identify best hyperparameters from the results_df
best_hyperparams = results_df.sort_values("Validation RMSE").iloc[0]
best_epochs = int(best_hyperparams["Epochs"])
best_batch_size = int(best_hyperparams["Batch Size"])
best_lr = best_hyperparams["Learning Rate"]

print(f"Retraining LSTM with best hyperparameters: Epochs={best_epochs}, Batch Size={best_batch_size}, Learning Rate={best_lr}")

# Create and train the final LSTM model with the best hyperparameters
final_lstm_model = create_lstm_model(best_lr)
final_lstm_model.fit(
    X_train_seq, y_train_seq,
    validation_data=(X_val_seq, y_val_seq),
    epochs=best_epochs,
    batch_size=best_batch_size,
    verbose=0
)

# Evaluate the final LSTM model for comparison
y_test_pred_final_lstm = final_lstm_model.predict(X_test_seq)
final_lstm_rmse = np.sqrt(mean_squared_error(y_test_seq, y_test_pred_final_lstm))

if final_lstm_rmse < lr_rmse:
    best_model_name = "LSTM"
    best_model = final_lstm_model
else:
    best_model_name = "Linear Regression"
    best_model = baseline_model

print("Best performing model:", best_model_name)

**Save Trained Model Weights**

Saving the model allows us to:





*   reuse it later

*   integrate it with Flask


*   avoid retraining every time





In [None]:
# SAVE MODEL

import joblib

if best_model_name == "LSTM":
    best_model.save("best_lstm_model.h5")
    print("LSTM model saved successfully.")
else:
    joblib.dump(best_model, "baseline_lr_model.pkl")
    print("Linear Regression model saved successfully.")


iv. **Flask-Compatible Prediction Function**

This function is written so Flask can call it easily to predict energy values.

In [None]:
# FLASK-COMPATIBLE FUNCTION

import numpy as np

def predict_energy(input_data):
    """
    input_data:
    - For LSTM: last SEQ_LEN values as a list
    - For LR: feature vector
    """

    if best_model_name == "LSTM":
        input_array = np.array(input_data).reshape(1, SEQ_LEN, 1)
        prediction = best_model.predict(input_array)
        return float(prediction[0][0])

    else:
        input_array = np.array(input_data).reshape(1, -1)
        prediction = best_model.predict(input_array)
        return float(prediction[0])


v. **Test Model with Sample Input**

We test the function with dummy or real recent data to confirm it works correctly.

In [None]:
# SAMPLE TEST

if best_model_name == "LSTM":
    sample_input = X_test_seq[0].flatten().tolist()
else:
    sample_input = X_test.iloc[0].tolist()

sample_prediction = predict_energy(sample_input)

print("Sample prediction output:", sample_prediction)


**Milestone 4: Week 7-8**

**Module 7: Dashboard and Visualization**

**i. Create Consumption Views (Hourly / Daily / Weekly / Monthly)**

In this part, energy data is grouped at different time levels to understand usage patterns:

* Hourly â†’ short-term behavior





* Daily â†’ day-to-day trend

* Weekly â†’ usage pattern across weeks

* Monthly â†’ long-term consumption trend

In [None]:
import numpy as np

# Select only numeric columns for resampling
numeric_df = df.select_dtypes(include=[np.number])

# Hourly, Daily, Weekly, Monthly aggregation
df_hourly = numeric_df.resample('h').mean()
df_daily = numeric_df.resample('D').mean()
df_weekly = numeric_df.resample('W').mean()
df_monthly = numeric_df.resample('ME').mean()

print("Hourly shape :", df_hourly.shape)
print("Daily shape  :", df_daily.shape)
print("Weekly shape :", df_weekly.shape)
print("Monthly shape:", df_monthly.shape)

ii. **Plot Hourly / Daily / Weekly / Monthly Dashboard**

This dashboard visually shows how energy consumption changes over time at different levels.

In [None]:

import matplotlib.pyplot as plt

target_col = df_hourly.select_dtypes(include='number').columns[0]

plt.figure(figsize=(14,8))

plt.subplot(2,2,1)
plt.plot(df_hourly[target_col])
plt.title("Hourly Energy Consumption")

plt.subplot(2,2,2)
plt.plot(df_daily[target_col])
plt.title("Daily Energy Consumption")

plt.subplot(2,2,3)
plt.plot(df_weekly[target_col])
plt.title("Weekly Energy Consumption")

plt.subplot(2,2,4)
plt.plot(df_monthly[target_col])
plt.title("Monthly Energy Consumption")

plt.tight_layout()
plt.show()


iii. **Device-Wise Energy Usage Charts**

This part shows which devices consume more energy, helping users identify high-usage appliances.

In [None]:

# Calculate total energy usage per device
device_energy = df.select_dtypes(include='number').sum().sort_values(ascending=False)

plt.figure(figsize=(10,5))
device_energy.plot(kind='bar')
plt.title("Device-wise Total Energy Consumption")
plt.xlabel("Device")
plt.ylabel("Total Energy")
plt.xticks(rotation=45)
plt.show()


iv. **Smart Suggestions (Energy Efficiency Tips)**

In [None]:


suggestions = []

high_usage_threshold = device_energy.mean()

for device, energy in device_energy.items():
    if energy > high_usage_threshold:
        suggestions.append(
            f"{device} uses high energy. Consider reducing usage or using energy-efficient alternatives."
        )

print("SMART ENERGY SAVING SUGGESTIONS:\n")
for tip in suggestions:
    print("-", tip)


v. **Simple Text-Based Dashboard Summary**

In [None]:


print("DASHBOARD SUMMARY")
print("------------------")
print("Total Devices:", len(device_energy))
print("Highest Energy Device:", device_energy.idxmax())
print("Lowest Energy Device :", device_energy.idxmin())
print("Average Consumption  :", round(device_energy.mean(), 2))


**Module 8: Web Application Deployment and Reporting**

i. **Develop Flask API (Connect ML Model to Backend)**

In [None]:
from flask import Flask, request, jsonify
import numpy as np
import joblib
from tensorflow.keras.models import load_model

app = Flask(__name__)

# Load best model (use one depending on your selection)
# For LSTM
model = load_model("best_lstm_model.h5", compile=False)

SEQ_LEN = 24  # same as training

@app.route("/")
def home():
    return "Smart Energy Prediction API is running"

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["input"]   # expects a list of last 24 values
    input_array = np.array(data).reshape(1, SEQ_LEN, 1)

    prediction = model.predict(input_array)
    return jsonify({
        "predicted_energy": float(prediction[0][0])
    })

if __name__ == "__main__":
    app.run(debug=True)
