### ENERGY REAL-TIME FORECASTING PROJECT

In [None]:
# Install required packages in Colab
!pip install pandas  matplotlib seaborn scikit-learn xgboost requests

#### Data Loading from URL SupaBase

1. **Imports**  
   - `pandas` for handling tables of data.  
   - `requests` for making HTTP calls to the Supabase API.

2. **Configuration**  
   - `SUPABASE_URL` and `SUPABASE_API_KEY` point to your database and authorize access.  
   - `TABLE` is the name of the table you’ll pull (`smart_meter_readings`).

3. **`load_data()` function**  
   - Builds the URL to fetch **all columns** (`select=*`) from your table, sorted by `timestamp` ascending (`order=timestamp.asc`).  
   - Sends a GET request with your API key.  
   - If successful (`status_code == 200`), prints the number of records retrieved and converts the JSON response into a pandas DataFrame.  
   - If unsuccessful, raises an error with the status code and message.

4. **Usage**  
   - Calls `load_data()` to populate `df` with the full dataset.  
   - Displays the first few rows of `df` using `df.head()`.  

> **Purpose:** Pull the entire smart-meter readings table into your notebook for analysis.

In [None]:
import pandas as pd
import requests

SUPABASE_URL     = "https://qpnzblvhwgmzorcdduuy.supabase.co"
SUPABASE_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InFwbnpibHZod2dtem9yY2RkdXV5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTEyNTcyNDEsImV4cCI6MjA2NjgzMzI0MX0._q7_v3XX-_tqKiFRI4KDy4e7IX5GIkDwqPSlU78FQCg"
TABLE            = "smart_meter_readings"

def load_data():
    url = (
        f"{SUPABASE_URL}/rest/v1/{TABLE}"
        "?select=*"               # fetch all columns
        "&order=timestamp.asc"    # order by timestamp ascending
    )
    headers = {
        "apikey": SUPABASE_API_KEY,
        "Authorization": f"Bearer {SUPABASE_API_KEY}"
    }

    res = requests.get(url, headers=headers)
    if res.status_code == 200:
        print("✅ Data pulled successfully: ", len(res.json()), "records\n")
        return pd.DataFrame(res.json())
    else:
        raise Exception(f"❌ Error: {res.status_code}\n{res.text}")

# usage
df = load_data()
df.head()


#### Data Inspection

In [None]:
df.shape

There are 668 rows and 13 columns in the dataset

In [None]:
# checking out the column names

for col in df.columns:
    print(col)

In [None]:
# Lets look at the various datatypes for each features
df.info()

### Data Cleaning

In [None]:
# Step 1: Convert the 'timestamp' column from Unix seconds to a readable datetime format and store it in a new 'datetime' column

df['datetime'] = pd.to_datetime(df['timestamp'], unit='s')

In [None]:
# Step 2: Extract date and ISO week number and store in new columns

df['date'] = df['datetime'].dt.date
df['week'] = df['datetime'].dt.isocalendar().week

In [None]:
# Step 3: Convert date to datetime, week to integer

df['date'] = pd.to_datetime(df['date'])
df['week'] = df['week'].astype(int)

In [None]:
df.info()

In [None]:
df.head(5)

In [None]:
# Calculate the date range and the number of days and unique ISO weeks in the dataset
min_date = df['date'].min()
max_date = df['date'].max()
days     = (max_date - min_date).days + 1
weeks    = df['week'].nunique()

print(f"Dataset covers {days} days, from {min_date} to {max_date} → {weeks} ISO weeks\n")


In [None]:
# Count the number of readings for each date and display them sorted by date
daily_counts = df['date'].value_counts().sort_index()
print("Readings per day:\n", daily_counts)

In [None]:
# Step 4: Check for nulls and duplicates
print("Nulls per column:\n", df.isnull().sum(), "\n")
print("Duplicates:", df.duplicated().sum())

There's no null or duplicate values in the dataset

In [None]:
# Step 5: Drop irrelevant columns
df = df.drop(columns=['id', 'timestamp'])

In [None]:
# Move 'datetime' to the index which automatically sorts the data in datetime order for time series analysis
df = df.set_index('datetime').sort_index()

In [None]:
df.head()

In [None]:
# Round down all datetime index values to the nearest second
df.index = df.index.floor('S')


In [None]:
df.head()

### Data Visualization and Analysis

#### Univariate Distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
# Create histograms for all numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].hist(figsize=(15, 20), bins=30, edgecolor='black', layout=(len(numeric_cols)//2 + 1, 2))

# Get the current figure and axes
fig = plt.gcf()
axes = fig.get_axes()

# Loop through each numeric column and corresponding subplot
for ax, col in zip(axes, numeric_cols):
    data = df[col].dropna()
    mean = data.mean()
    median = data.median()
    mode = data.mode().iloc[0] if not data.mode().empty else np.nan
    ax.axvline(mean, color='red', linestyle='--', linewidth=2, label='Mean')
    ax.axvline(median, color='green', linestyle='-', linewidth=2, label='Median')
    ax.axvline(mode, color='orange', linestyle='-', linewidth=2, label='Mode')
    ax.set_xlabel(col, fontsize=10)
    ax.set_ylabel('Count', fontsize=10)
    ax.legend()

plt.tight_layout()
plt.suptitle("Distributions of the Numeric Data", fontsize=16, y=1.02)
plt.show()

#### Observations and Insights

## 1. **meter_id**
- Metrics Range is between 1000 - 1100, Mean is 1052, Median is 1052 and Mode is 1083. All meter_ids have roughly equal counts, no dominant meter.

- **What this means:**
The dataset covers a well-balanced spread of meter_ids, which reduces the risk of bias from overrepresented meters. This uniformity supports generalizability of any model or insight across all monitored units. In real-world terms, it suggests that each household or site is sampled at similar frequency, preventing location-based skew in energy analytics.

## 2. **power_consumption_kwh**
- Metrics Range is between 0.1 - 5 kWh per 30 minutes, Mean is approxiamately 1.35 kWh, Median is ~0.67 kWh, and Mode is ~0.13 kWh. ~60% of readings are below 1 kWh; the distribution is strongly right-skewed.

- **What this means:**
Most intervals have low consumption, but there are occasional intervals or users with much higher usage, which drive up the mean. Most of the time, usage is low or moderate, but there are spikes, possibly due to specific high-demand activities (e.g., EV charging, industrial equipment, HVAC use).

## 3. **voltage**
- Metrics are Range: 224 V - 231 V, Mean: ~228.6 V, Median: ~229 V, Mode: 229 V
~90% of all readings fall between 227 V and 231 V.

- **What this means:**
The voltage distribution is tight and nearly normal, with most readings very close to the nominal supply voltage. This indicates a stable and reliable grid supply. Occasional lower values could reflect brief voltage dips or simulation artifacts, but overall there is little risk of power quality issues affecting energy consumption analysis.

## 4. **current**
- Metrics Range: 0.35 - 22.2 A, Mean: ~5.94 A, Median: ~2.92 A, Mode: ~1.08 A.
Over 70% of readings are below 5 A; distribution is right-skewed.

- **What this means:**
Most of the time, the measured current is low, but there are infrequent but substantial spikes. These spikes likely correspond to the use of high-power appliances or charging devices. This distribution is common in household and commercial environments, where baseline consumption is low but punctuated by occasional heavy draws.

## 5. **temperature_c**
- Metrics are: Range: 6.6°C - 32.2°C, Mean: ~19.4°C, Median: ~19.7°C, Mode: 10.5°C.
~80% of readings are between 10°C and 28°C, with mild bimodality.

- **What this means:**
The temperature readings cover a broad but realistic range, possibly reflecting data from different times of day or simulated regions. The slight bimodal shape may indicate two climatic patterns or artificial variation.

## 6. **humidity_pct**
- Metrics Range: 52.3% - 87.7%, Mean: ~69.9%, Median: ~69.8%, Mode: 82.1%.
~85% of readings are between 55% and 85%; broad and slightly multi-modal distribution.

- **What this means:**
The wide range and fairly even spread of humidity values suggest the data includes both humid and relatively dry intervals. This variation can be useful for correlating with cooling or heating demand, though the lack of strong peaks implies that humidity alone is unlikely to drive major changes in energy usage in this sample. The presence of multiple small peaks may reflect different regions or microclimates within your dataset.

## 7. **hour_of_day**
- Metrics Range: 0 - 23, Mean: ~11.5. Each hour is almost equally represented with ~28 readings per hour.

- **What this means:**
The dataset is very well-balanced across the 24-hour cycle, ensuring robust coverage of daily consumption patterns. This allows for reliable analysis of hourly and diurnal trends, with no bias toward particular times of day.

## 8. **week**
- Metrics: Observed weeks: 27, 28, 29.
Most data comes from week 28, followed by week 29 and then week 27.

- **What this means:**
Data collection spans three distinct weeks, but the sample size varies between weeks. Most observations fall in week 28, which could skew weekly analysis unless balanced with resampling or weighted metrics. The presence of multiple weeks supports investigation into week-to-week variation.

In [None]:
# Function to create barplots that indicate percentage for each category.
def bar_perc(plot, feature):
    total = len(feature) # length of the column
    for p in plot.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
        y = p.get_y() + p.get_height()           # height of the plot
        plot.annotate(percentage, (x, y), size = 12) # annotate the percentage

In [None]:
#get all category datatype
list_col=  df.select_dtypes(['object', 'bool']).columns

fig1, axes1 =plt.subplots(2,2,figsize=(11, 9))
axes1 = axes1.flatten()
for i in range(len(list_col)):
    order = df[list_col[i]].value_counts(ascending=False).index # to display bar in ascending order
    axis=sns.countplot(x=list_col[i], data=df , order=order,ax=axes1[i],palette='coolwarm').set(title=list_col[i])
    bar_perc(axes1[i],df[list_col[i]])

# Hide any unused subplots
for j in range(len(list_col), len(axes1)):
    fig1.delaxes(axes1[j])

plt.tight_layout()
plt.show()



### Categorical Variable Distributions

#### 1. Region
**Metrics:** South: 31.9% | East: 24.4% | West: 24.4% | North: 19.3%

**What this means:**  
The data is fairly well-distributed across regions, but the **South** is most represented (about a third of the data), while the **North** is least. This could reflect larger population, more meters, or more active data collection in the South.


#### 2. Property Type
**Metrics:** Commercial: 50.0% | Residential: 50.0%

**What this means:**  
The property types are **perfectly balanced**. This is ideal for comparative analysis—there’s no bias in property type representation.


#### 3. EV Owner
**Metrics:** True: 50.3% | False: 49.7%

**What this means:**  
**EV ownership** is almost perfectly balanced in the dataset, which is excellent for analyzing how EV presence affects power consumption. This prevents the model from being biased toward one group and strengthens statistical power for group comparisons.


#### 4. Solar Installed
**Metrics:** False: 67.4% | True: 32.6%

**What this means:**  
A majority of the sampled properties **do not have solar panels** (about two-thirds), while about a third do. This means that “solar” is a minority and that statistical findings for that group may be less robust, especially for rare events or behaviors.


### Bivariate Distribution

In [None]:
# Power Consumption by Region, Property Type, EV Owner and Solar Installed

fig, axes = plt.subplots(2, 2, figsize=(10, 7))

# 1. Region
sns.boxplot(x='region', y='power_consumption_kwh', data=df, ax=axes[0,0])
axes[0,0].set_title("Power Consumption by Region")

# 2. Property Type
sns.boxplot(x='property_type', y='power_consumption_kwh', data=df, ax=axes[0,1])
axes[0,1].set_title("Power Consumption by Property_Type")

# 3. EV Owner
sns.boxplot(x='ev_owner', y='power_consumption_kwh', data=df, ax=axes[1,0])
axes[1,0].set_title("Power Consumption by EV_Owner")

# 4. Solar Installed
sns.boxplot(x='solar_installed', y='power_consumption_kwh', data=df, ax=axes[1,1])
axes[1,1].set_title("Power Consumption by Solar_Installed")

plt.tight_layout()
plt.show()

### Boxplot Observations: Power Consumption



1. **Region**
- **West & East:** Highest medians (~1.6 kWh), largest upper whiskers (peaks to ~5 kWh); likely heavier evening loads or more frequent use of large appliances.
- **South:** Similar pattern, but slightly fewer extreme highs (upper limit ~4.5 kWh).
- **North:** Lower median (~1.3 kWh), tighter IQR—overall lighter, more consistent usage.
- **Takeaway:** West and East draw the most power per interval, North the least.



2. **Property Type**
- **Residential:** Higher median (~1.1 kWh) and wider spread; more diverse and generally higher energy use.
- **Commercial:** Lower median (~0.7 kWh), narrower spread; fewer outliers and less variability.
- **Takeaway:** Residential properties are the main drivers of higher and more variable consumption.



3. **EV Owner**
- **Both groups:** Nearly identical distributions—medians (~0.9 kWh), spreads, and maximum values all similar.
- **Takeaway:** EV ownership has minimal impact on total interval power consumption in this dataset.



4. **Solar Installed**
- **Both groups:** Medians and distributions are almost the same (medians ~0.9 kWh); no noticeable impact from solar installation.
- **Takeaway:** Solar panel presence does not significantly alter recorded power consumption patterns.



**General Remark:**  
Most variation in interval power use comes from property type and region, not EV ownership or solar installation.


In [None]:
# Daily & Weekly Time‐Series Patterns

# Resample to daily mean consumption
daily = df['power_consumption_kwh'].resample('D').mean()
plt.figure(figsize=(10,3))
daily.plot()
plt.title("Daily Average Power Consumption")
plt.ylabel("kWh")
plt.show()

# Hour‐of‐day profile (averaged across all days)
hourly = df.groupby('hour_of_day')['power_consumption_kwh'].mean()
plt.figure(figsize=(8,3))
hourly.plot(marker='o')
plt.title("Average Consumption by Hour of Day")
plt.xlabel("Hour (0–23)")
plt.ylabel("kWh")
plt.xticks(range(0,24,2))
plt.show()

In [None]:
"""
This code produces two insightful time series visualizations of energy usage:

1. Daily Average Power Consumption:
   - Calculates the mean power consumption for each calendar day in the dataset.
   - Plots these daily averages on a line graph, with each day labeled as "Day Abbrev + Date" (e.g., "Mon 05 Jul").
   - Helps to identify trends, peaks, or dips in energy use from day to day.

2. Hourly Average Consumption Pattern:
   - Calculates the average power consumption for each hour of the day (across all days).
   - Plots these hourly averages on a line graph, with each hour labeled as "HH:00".
   - Reveals the typical daily cycle of power use, highlighting which hours are highest or lowest on average.

Together, these plots provide a clear picture of both overall daily trends and the recurring daily pattern of energy consumption in the dataset, which are valuable for understanding seasonality, forecasting, and detecting unusual usage patterns.
"""


import matplotlib.pyplot as plt

# 1. Compute daily mean
daily = df['power_consumption_kwh'].resample('D').mean()

# 2. Create tick labels as “Day Abbrev + MM-DD”
day_labels = daily.index.strftime('%a %d %b')
#    e.g. “Mon 05 Jul”, “Tue 06 Jul”, …

# 3. Plot
plt.figure(figsize=(10,3))
plt.plot(daily.index, daily.values, marker='o')
plt.title("Daily Average Power Consumption")
plt.ylabel("kWh")

# 4. Replace the x-axis ticks with our day labels
plt.xticks(daily.index, day_labels, rotation=45, ha='right')
plt.tight_layout()
plt.show()


# Plot 2: Hourly Average with “HH:MM” Labels

# 1. Compute hourly mean (0–23)
hourly = df.groupby('hour_of_day')['power_consumption_kwh'].mean()
hours  = hourly.index

# 2. Create time‐of‐day labels “00:00”, “01:00”, …, “23:00”
time_labels = [f"{h:02d}:00" for h in hours]

# 3. Plot
plt.figure(figsize=(10,3))
plt.plot(hours, hourly.values, marker='o')
plt.title("Average Consumption by Time of Day")
plt.xlabel("Time of Day")
plt.ylabel("kWh")

# 4. Replace x-ticks with “HH:MM”
plt.xticks(hours, time_labels, rotation=45, ha='right')
plt.tight_layout()
plt.show()

### Observations: Daily Average Power Consumption

- **Fluctuations:** Daily average power consumption shows noticeable fluctuations from one day to the next, with several clear peaks and dips across the observed period.
- **Mid-week Peaks:** Consumption tends to rise early in each week, peaking around Tuesday or Wednesday, and again toward the end of the period (notably on Fridays).
- **Dips:** Lower usage is seen around weekends (e.g., July 5th–6th, July 12th–13th), which may reflect fewer occupants at home, travel, or reduced business activity typical of weekends.
- **Real-life Context:** These patterns align with typical workweek routines, where weekdays (especially midweek) often see higher overall activity—such as office operations, school runs, and more time spent at home in the evenings—leading to increased energy use. Weekend dips could be due to residents being away, less business operation, or generally more outdoor and leisure activities.

### Observations: Average Consumption by Time of Day

- **Morning Ramp-up:** Power consumption is lowest in the early hours (midnight to 6 AM), then starts to rise around 7 AM—likely as people wake up, begin using appliances, and start their daily routines.
- **Daytime Peaks:** The sharpest increase occurs from 9 AM to 3 PM, with the highest average consumption between 2 PM and 4 PM. This midday to afternoon peak may reflect increased use of air conditioning or cooling (especially if these days are warm), cooking, or higher occupancy of homes and commercial spaces.
- **Evening Decline:** After 4 PM, consumption steadily drops through the evening, re


In [None]:
# Daily Average by Property Type

import matplotlib.pyplot as plt

# 1. Compute daily mean consumption for each type
daily_type = (
    df
    .groupby([pd.Grouper(freq='D'), 'property_type'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = residential, commercial
)

# 2. Generate readable x-labels (weekday + date)
day_labels = daily_type.index.strftime('%a %d %b')

# 3. Plot on two subplots
fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True)

for ax, ptype in zip(axes, ['residential', 'commercial']):
    ax.plot(daily_type.index, daily_type[ptype], marker='o')
    ax.set_title(f"Daily Average Power Consumption: {ptype.capitalize()}")
    ax.set_ylabel("kWh")
    ax.grid(True)

# 4. Customize x-ticks on bottom subplot
axes[-1].set_xticks(daily_type.index)
axes[-1].set_xticklabels(day_labels, rotation=45, ha='right')
axes[-1].set_xlabel("Day of Week")

plt.tight_layout()
plt.show()


# Cell: Hourly Average by Property Type

# 1. Compute hourly mean consumption for each type
hourly_type = (
    df
    .groupby(['hour_of_day', 'property_type'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = residential, commercial
)

# 2. Generate time-of-day labels
time_labels = [f"{h:02d}:00" for h in hourly_type.index]

# 3. Plot both lines on one figure
plt.figure(figsize=(10, 4))
for ptype in ['residential', 'commercial']:
    plt.plot(hourly_type.index, hourly_type[ptype], marker='o', label=ptype.capitalize())

plt.title("Average Consumption by Hour of Day")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hourly_type.index, time_labels, rotation=45, ha='right')
plt.legend(title="Property Type")
plt.grid(True)
plt.tight_layout()
plt.show()


### Daily Average Power Consumption: Residential

- Residential power consumption averages between **1.0 and 2.0 kWh per day**, with a gradual decline over the period.
- The highest usage is at the beginning (above 2.0 kWh), with notable drops around the middle and end (down to 0.9–1.1 kWh).
- **Interpretation:** This mild downward trend could reflect seasonal shifts, changing occupancy, or energy-saving behaviors in homes.

### Daily Average Power Consumption: Commercial

- Commercial properties show **greater day-to-day variability**, ranging from about **0.3 kWh up to 2.2 kWh**.
- Usage is lowest on weekends (0.2–0.3 kWh), with pronounced peaks on weekdays, especially toward the end of the period.
- **Interpretation:** This pattern reflects business activity—commercial buildings use much less power on weekends, ramping up during workdays. Late spikes may coincide with end-of-week operations or special events.

### Average Consumption by Hour of Day (Property Type Comparison)

- **Residential:** Consumption rises sharply after 6 AM, peaking between **7–9 AM** and staying elevated (about **1.5–2.0 kWh**) through the day, with another smaller peak in the evening (7–10 PM). Lowest values occur overnight and early morning (around 0.5 kWh).
- **Commercial:** Nearly flat overnight (0.2–0.3 kWh), then increases rapidly after 8 AM, peaking at **3.0–3.5 kWh** from **1 PM to 4 PM**, and drops sharply after 5 PM.
- **Interpretation:** Residential usage reflects morning and evening routines, while commercial usage is highly concentrated during business hours. The pronounced midday and afternoon peak for commercial is typical of active workday operations.

**Takeaway:**  
Residential and commercial properties display distinct energy patterns. Residential use is more consistent throughout the day, while commercial use is highly focused during business hours and drops to minimal levels outside of them—highlighting how building function and human routines shape energy demand.


In [None]:
# Daily average by EV owner
daily_ev = (
    df
    .groupby([pd.Grouper(freq='D'), 'ev_owner'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)
day_labels = daily_ev.index.strftime('%a %d %b')

fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True)
for ax, owner in zip(axes, [True, False]):
    ax.plot(daily_ev.index, daily_ev[owner], marker='o')
    label = "EV Owner" if owner else "No EV"
    ax.set_title(f"Daily Average Power Consumption: {label}")
    ax.set_ylabel("kWh")
    ax.grid(True)
axes[-1].set_xticks(daily_ev.index)
axes[-1].set_xticklabels(day_labels, rotation=45, ha='right')
axes[-1].set_xlabel("Day of Week")
plt.tight_layout()
plt.show()

# Hourly average by EV owner
hourly_ev = (
    df
    .groupby(['hour_of_day', 'ev_owner'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)
time_labels = [f"{h:02d}:00" for h in hourly_ev.index]

plt.figure(figsize=(10, 4))
for owner in [True, False]:
    label = "EV Owner" if owner else "No EV"
    plt.plot(hourly_ev.index, hourly_ev[owner], marker='o', label=label)
plt.title("Average Consumption by Hour of Day (EV Owner)")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hourly_ev.index, time_labels, rotation=45, ha='right')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


### Daily Average Power Consumption: EV Owner

- **EV Owners:** Daily averages typically range from **0.4 to 1.6 kWh**, with noticeable increases mid-period and higher usage on several days.
- **No EV:** Daily averages generally fluctuate between **0.9 and 2.0 kWh**, with the highest peaks toward the end of the observed period.
- **Interpretation:** No EV households actually show higher and more variable daily usage, while EV owner usage is steadier and somewhat lower. This may reflect that not all EV owners charge their vehicles daily, or that their overall home usage is more stable. Alternatively, it could suggest other non-EV factors are driving energy demand for the "No EV" group.

### Average Consumption by Hour of Day (EV Owner Comparison)

- **Both Groups:** Follow similar daily cycles—rising in the morning, peaking midday to afternoon (**up to 2.8 kWh** for EV owners), then declining into the night.
- **EV Owners:** Show slightly higher peaks in late morning and mid-afternoon (by **0.2–0.3 kWh**), but similar or even lower values in the evening and late night.
- **Interpretation:** The close alignment suggests EV charging does not dominate household energy usage patterns. Small midday differences could reflect opportunistic daytime charging, but overall, both groups' routines drive the main pattern.

**Takeaway:**  
**EV ownership does not result in consistently higher energy use**. In fact, "No EV" homes show higher peaks on some days, possibly due to other lifestyle or property differences. Hourly patterns for both groups are nearly identical, reinforcing that EV charging is either modest, infrequent, or scheduled in a way that blends with regular household usage.


In [None]:
# Daily average by solar installed
daily_solar = (
    df
    .groupby([pd.Grouper(freq='D'), 'solar_installed'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)
day_labels = daily_solar.index.strftime('%a %d %b')

fig, axes = plt.subplots(2, 1, figsize=(10, 6), sharex=True)
for ax, solar in zip(axes, [True, False]):
    ax.plot(daily_solar.index, daily_solar[solar], marker='o')
    label = "Solar Installed" if solar else "No Solar"
    ax.set_title(f"Daily Average Power Consumption: {label}")
    ax.set_ylabel("kWh")
    ax.grid(True)
axes[-1].set_xticks(daily_solar.index)
axes[-1].set_xticklabels(day_labels, rotation=45, ha='right')
axes[-1].set_xlabel("Day of Week")
plt.tight_layout()
plt.show()

# Hourly average by solar installed
hourly_solar = (
    df
    .groupby(['hour_of_day', 'solar_installed'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)
time_labels = [f"{h:02d}:00" for h in hourly_solar.index]

plt.figure(figsize=(10, 4))
for solar in [True, False]:
    label = "Solar Installed" if solar else "No Solar"
    plt.plot(hourly_solar.index, hourly_solar[solar], marker='o', label=label)
plt.title("Average Consumption by Hour of Day (Solar Installed)")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hourly_solar.index, time_labels, rotation=45, ha='right')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


### Daily Average Power Consumption: Solar Installed

- **Solar Installed:** Daily consumption ranges from **0.4 up to 2.5 kWh**, with a noticeable spike early in the period and some days of moderate usage thereafter.
- **No Solar:** Energy use is more consistent, generally between **1.0 and 2.0 kWh**, with smaller peaks and a gradual increase toward the end of the period.
- **Interpretation:** Properties with solar panels can still show high daily consumption—possibly on days with less sunlight, more appliance use, or when solar does not offset all grid needs. The steadier pattern in “No Solar” homes suggests more routine electricity use, possibly without the benefit (or variability) of generating some of their own power.

### Average Consumption by Hour of Day (Solar Installed Comparison)

- **Both Groups:** Share similar hourly patterns—consumption climbs through the morning, peaks in the early to late afternoon (**2.5–2.8 kWh**), and declines in the evening.
- **Solar Installed:** Sometimes shows slightly higher midday consumption, but the overall curve closely matches that of homes without solar.
- **Interpretation:** The fact that solar homes don’t always have lower grid energy use at peak sunlight hours could mean either high demand during those hours, solar generation not fully covering needs, or grid consumption being measured before solar offsets are counted.

**Takeaway:**  
**Having solar panels does not automatically guarantee lower average daily or hourly grid energy use** in this dataset. Both groups show similar usage trends, likely reflecting a mix of weather, occupant behavior, and the way energy is measured (total used, not net of solar). This highlights that simply installing solar does not always lead to major drops in visible power consumption—actual savings depend on timing, usage, and how the data is tracked.


### Correlation Matrix

In [None]:
# Compute correlations
corr = df[['power_consumption_kwh','voltage','current','temperature_c','humidity_pct']].corr()

# Plot
plt.figure(figsize=(6,5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

- **Power consumption** is perfectly positively correlated with **current** (**1.00**) and strongly negatively correlated with **voltage** (**-0.92**), which is expected in electrical systems.
- **Humidity** shows a mild positive correlation with power consumption (**0.29**), while **temperature** has a slight negative correlation (**-0.26**), indicating that weather factors have only a modest influence on energy use.
- Among other features, **temperature and humidity** have a very strong negative correlation (**-0.89**), reflecting the typical climate relationship where higher temperatures often coincide with lower humidity.
- **Voltage and current** are also strongly negatively correlated (**-0.92**), highlighting their inverse relationship in the dataset.


### Modelling
#### Experimenting With Time Series Models: PROPHET and SARIMAX

### Modelling - Experimenting With Time Series Models: Prophet and SARIMAX

In this section, we explore two powerful time series forecasting models:

- **Prophet**: Developed by Facebook, Prophet is an open-source forecasting tool designed for time series data with strong seasonal effects and historical trends. It's robust to missing data, handles outliers well, and supports custom seasonality and holidays, making it ideal for business and economic forecasting.

- **SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors)**: SARIMAX is a statistical model used to forecast time series data that may have trend and seasonality components. It extends the ARIMA model by including:
  - **Seasonality**: Repeats over time (e.g., daily, weekly patterns)
  - **Exogenous variables (X)**: External predictors that can improve forecast accuracy
  - **AR (AutoRegressive)**, **I (Integrated)**, and **MA (Moving Average)** components to model different aspects of the time series' structure.

We will use these models to evaluate and compare their performance on our dataset.