### ENERGY REAL-TIME FORECASTING PROJECT

In [None]:
# Install required packages in Colab
!pip install pandas  matplotlib seaborn scikit-learn xgboost requests

#### Data Loading from URL SupaBase

1. **Imports**  
   - `pandas` for handling tables of data.  
   - `requests` for making HTTP calls to the Supabase API.

2. **Configuration**  
   - `SUPABASE_URL` and `SUPABASE_API_KEY` point to your database and authorize access.  
   - `TABLE` is the name of the table you’ll pull (`smart_meter_readings`).

3. **`load_data()` function**  
   - Builds the URL to fetch **all columns** (`select=*`) from your table, sorted by `timestamp` ascending (`order=timestamp.asc`).  
   - Sends a GET request with your API key.  
   - If successful (`status_code == 200`), prints the number of records retrieved and converts the JSON response into a pandas DataFrame.  
   - If unsuccessful, raises an error with the status code and message.

4. **Usage**  
   - Calls `load_data()` to populate `df` with the full dataset.  
   - Displays the first few rows of `df` using `df.head()`.  

> **Purpose:** Pull the entire smart-meter readings table into your notebook for analysis.

#### Data Inspection

In [None]:
import pandas as pd
import requests
from google.colab import userdata

SUPABASE_URL     = userdata.get('SupabaseURL')
SUPABASE_API_KEY = userdata.get('SupabaseAPI')
TABLE            = "smart_meter_readings_1year"

def load_data():
    url = (
        f"{SUPABASE_URL}/rest/v1/{TABLE}"
        "?select=*"               # fetch all columns
        "&order=timestamp.asc"    # order by timestamp ascending
    )
    headers = {
        "apikey": SUPABASE_API_KEY,
        "Authorization": f"Bearer {SUPABASE_API_KEY}"
    }

    res = requests.get(url, headers=headers)
    if res.status_code == 200:
        print("✅ Data pulled successfully: ", len(res.json()), "records\n")
        return pd.DataFrame(res.json())
    else:
        raise Exception(f"❌ Error: {res.status_code}\n{res.text}")

# usage
df = load_data()
df.head()

In [None]:
df.shape

There are 17431 rows and 13 columns in the dataset

In [None]:
# checking out the column names

for col in df.columns:
    print(col)

In [None]:
# Lets look at the various datatypes for each features
df.info()

### Data Cleaning

In [None]:
# Step 1: Convert the 'timestamp' column from Unix seconds to a readable datetime format and store it in a new 'datetime' column

df['datetime'] = pd.to_datetime(df['timestamp'])

In [None]:
# Step 2: Extract date and ISO week number and store in new columns

df['date'] = df['datetime'].dt.date
df['week'] = df['datetime'].dt.isocalendar().week

In [None]:
# Step 3: Convert date to datetime, week to integer

df['date'] = pd.to_datetime(df['date'])
df['week'] = df['week'].astype(int)

In [None]:
df.info()

In [None]:
df.head(5)

In [None]:
# Calculate the date range and the number of days and unique ISO weeks in the dataset
min_date = df['date'].min()
max_date = df['date'].max()
days     = (max_date - min_date).days + 1
weeks    = df['week'].nunique()

print(f"Dataset covers {days} days, from {min_date} to {max_date} → {weeks} ISO weeks\n")


In [None]:
# Count the number of readings for each date and display them sorted by date
daily_counts = df['date'].value_counts().sort_index()
print("Readings per day:\n", daily_counts)

In [None]:
# Step 4: Check for nulls and duplicates
print("Nulls per column:\n", df.isnull().sum(), "\n")
print("Duplicates:", df.duplicated().sum())

There's no null or duplicate values in the dataset

In [None]:
# Step 5: Drop irrelevant columns
df = df.drop(columns=['id', 'timestamp'])

In [None]:
# Move 'datetime' to the index which automatically sorts the data in datetime order for time series analysis
df = df.set_index('datetime').sort_index()

In [None]:
df.head()

In [None]:
# Round down all datetime index values to the nearest second
df.index = df.index.floor('s')


In [None]:
df.head()

### Data Visualization and Analysis

#### Univariate Distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
# Create histograms for all numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].hist(figsize=(15, 20), bins=30, edgecolor='black', layout=(len(numeric_cols)//2 + 1, 2))

# Get the current figure and axes
fig = plt.gcf()
axes = fig.get_axes()

# Loop through each numeric column and corresponding subplot
for ax, col in zip(axes, numeric_cols):
    data = df[col].dropna()
    mean = data.mean()
    median = data.median()
    mode = data.mode().iloc[0] if not data.mode().empty else np.nan
    ax.axvline(mean, color='red', linestyle='--', linewidth=2, label='Mean')
    ax.axvline(median, color='green', linestyle='-', linewidth=2, label='Median')
    ax.axvline(mode, color='orange', linestyle='-', linewidth=2, label='Mode')
    ax.set_xlabel(col, fontsize=10)
    ax.set_ylabel('Count', fontsize=10)
    ax.legend()

plt.tight_layout()
plt.suptitle("Distributions of the Numeric Data", fontsize=16, y=1.02)
plt.show()

#### Observations and Insights

### 1. **meter_id**
- **Metrics:** Range: 1000–1100; Mean ≈ 1050; Median ≈ 1050; Mode ≈ 1045. Counts for each meter are nearly uniform.
- **What this means:**  
  The dataset is well-balanced across all meters, indicating even data collection. This minimizes the risk of bias from any single location and ensures that insights or models apply generally across all metered sites.

### 2. **power_consumption_kwh**
- **Metrics:** Range: ~0.1–7.5 kWh per interval; Mean ≈ 1.3 kWh; Median ≈ 0.5 kWh; Mode ≈ 0.1 kWh. Over 60% of intervals are below 1 kWh; distribution is strongly right-skewed.
- **What this means:**  
  Most intervals reflect modest energy use, but occasional high-usage intervals drive the average upward. This is typical for large energy datasets, where normal daily activity is punctuated by spikes (from heavy appliances, EV charging, or industrial equipment).

### 3. **voltage**
- **Metrics:** Range: 224–231 V; Mean ≈ 228.7 V; Median ≈ 229 V; Mode: 229 V. About 90% of readings fall between 227 V and 231 V.
- **What this means:**  
  Voltage is tightly regulated and stable for most of the year, indicating a reliable power grid. The few lower readings likely represent minor dips or simulation noise, not systemic supply issues.

### 4. **current**
- **Metrics:** Range: 0.1–22 A; Mean ≈ 5.2 A; Median ≈ 2.4 A; Mode ≈ 0.7 A. Over 75% of readings are below 5 A; highly right-skewed distribution.
- **What this means:**  
  The vast majority of intervals have low current draw, with infrequent but significant spikes. These higher values likely reflect periods of peak demand or operation of large appliances.

### 5. **temperature_c**
- **Metrics:** Range: 4–35°C; Mean ≈ 20°C; Median ≈ 20°C; Mode ≈ 17°C. Most temperatures are between 12°C and 28°C, with a mild central peak.
- **What this means:**  
  The temperature data captures a full annual cycle, covering both warm and cool periods. The broad, nearly symmetrical distribution supports seasonality analysis.

### 6. **humidity_pct**
- **Metrics:** Range: 52–87%; Mean ≈ 69%; Median ≈ 70%; Mode ≈ 83%. Humidity is broadly distributed with two main peaks (around 58% and 83%).
- **What this means:**  
  Humidity reflects strong seasonality, likely corresponding to dry and rainy periods. This variation enables robust analysis of how weather affects power consumption.

### 7. **hour_of_day**
- **Metrics:** Range: 0–23; Mean ≈ 11.5; all hours are equally represented.
- **What this means:**  
  There is excellent coverage across every hour, ensuring that time-of-day patterns can be analyzed without bias or missing intervals.

### 8. **week**
- **Metrics:** Range: 1–52; all weeks in the year are covered and equally represented.
- **What this means:**  
  The data spans a full year, supporting both seasonal and long-term trend analysis. No weeks are missing, so time-based insights will be robust.

In [None]:
# Function to create barplots that indicate percentage for each category.
def bar_perc(plot, feature):
    total = len(feature) # length of the column
    for p in plot.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
        y = p.get_y() + p.get_height()           # height of the plot
        plot.annotate(percentage, (x, y), size = 12) # annotate the percentage

In [None]:
#get all category datatype
list_col=  df.select_dtypes(['object', 'bool']).columns

fig1, axes1 =plt.subplots(2,2,figsize=(10, 8))
axes1 = axes1.flatten()
for i in range(len(list_col)):
    order = df[list_col[i]].value_counts(ascending=False).index # to display bar in ascending order
    axis=sns.countplot(x=list_col[i], data=df , order=order,ax=axes1[i],palette='coolwarm').set(title=list_col[i])
    bar_perc(axes1[i],df[list_col[i]])

# Hide any unused subplots
for j in range(len(list_col), len(axes1)):
    fig1.delaxes(axes1[j])

plt.tight_layout()
plt.suptitle("Distributions of the Categorical Data", fontsize=16, y=1.02)
plt.show()

### Categorical Variable Distributions

#### 1. Region
**Metrics:**  
South: 29.7% | West: 25.3% | East: 25.2% | North: 19.8%

**What this means:**  
The dataset is relatively well-distributed across all regions, but the **South** has the highest representation (nearly 30%), while the **North** has the lowest (under 20%). This may reflect larger population centers, more meters installed, or higher data collection frequency in the South.

---

#### 2. Property Type
**Metrics:**  
Residential: 50.6% | Commercial: 49.4%

**What this means:**  
The split between **residential** and **commercial** properties is almost perfectly balanced. This is ideal for comparative analysis and ensures there is no inherent bias toward one property type.

---

#### 3. EV Owner
**Metrics:**  
False: 50.2% | True: 49.8%

**What this means:**  
**EV ownership** is very evenly distributed, allowing for robust comparisons of power consumption patterns between EV owners and non-owners. This balanced distribution supports fair modeling and analysis.

---

#### 4. Solar Installed
**Metrics:**  
False: 70.0% | True: 30.0%

**What this means:**  
A significant majority of properties **do not have solar panels** (70%), while 30% do. This gives good statistical power for comparing solar versus non-solar homes, though findings for the solar group may be less robust for rare patterns.

### Bivariate Distribution

In [None]:
# Power Consumption by Region, Property Type, EV Owner and Solar Installed

fig, axes = plt.subplots(2, 2, figsize=(10, 7))

# 1. Region
sns.boxplot(x='region', y='power_consumption_kwh', data=df, ax=axes[0,0])
axes[0,0].set_title("Power Consumption by Region")

# 2. Property Type
sns.boxplot(x='property_type', y='power_consumption_kwh', data=df, ax=axes[0,1])
axes[0,1].set_title("Power Consumption by Property_Type")

# 3. EV Owner
sns.boxplot(x='ev_owner', y='power_consumption_kwh', data=df, ax=axes[1,0])
axes[1,0].set_title("Power Consumption by EV_Owner")

# 4. Solar Installed
sns.boxplot(x='solar_installed', y='power_consumption_kwh', data=df, ax=axes[1,1])
axes[1,1].set_title("Power Consumption by Solar_Installed")

plt.tight_layout()
plt.show()

### Boxplot Observations: Power Consumption

#### 1. Region
- **West & East:** Highest medians (~1.6 kWh) and largest upper whiskers (peaks up to ~5 kWh), pointing to heavier loads or more frequent use of large appliances in these areas.
- **South:** Similar distribution to West/East, but with slightly fewer extreme high intervals (upper limit around 4.5 kWh).
- **North:** Lower median (~1.3 kWh) and a more compact interquartile range, indicating lighter and more consistent usage.
- **What this means:** Power usage per interval is highest in the West and East, and lowest in the North.

#### 2. Property Type
- **Residential:** Higher median (~1.1 kWh) and wider spread, with more outliers—reflects more diverse and generally higher usage.
- **Commercial:** Lower median (~0.7 kWh) and a tighter distribution, with fewer outliers and less variability.
- **What this means:** Residential properties are the primary drivers of higher and more variable interval power consumption.

#### 3. EV Owner
- **Both groups:** Nearly identical boxplots, with medians around 0.9 kWh, similar spreads, and similar maximum values.
- **What this means:** EV ownership does not significantly impact overall interval power consumption in this dataset.

#### 4. Solar Installed
- **Both groups:** Medians and distribution shapes are almost identical (~0.9 kWh), showing no visible impact from solar installation on recorded interval consumption.
- **What this means:** Having solar panels does not noticeably alter overall usage patterns in this sample.

**General Remark:**  
The main drivers of variation in power consumption intervals are property type and region, not EV ownership or solar installation.

In [None]:
# Daily & Weekly Time‐Series Patterns

# Resample to daily mean consumption
daily = df['power_consumption_kwh'].resample('D').mean()
plt.figure(figsize=(10,3))
daily.plot()
plt.title("Daily Average Power Consumption")
plt.ylabel("kWh")
plt.show()

# Hour‐of‐day profile (averaged across all days)
hourly = df.groupby('hour_of_day')['power_consumption_kwh'].mean()
plt.figure(figsize=(8,3))
hourly.plot(marker='o')
plt.title("Average Consumption by Hour of Day")
plt.xlabel("Hour (0–23)")
plt.ylabel("kWh")
plt.xticks(range(0,24,2))
plt.show()

In [None]:
# Total Power consumption per month
# 1. Resample to monthly total consumption
monthly = df['power_consumption_kwh'].resample('M').sum()

# 2. Create month labels (e.g., 'Jul 2025')
month_labels = monthly.index.strftime('%b %Y')

# 3. Plot
plt.figure(figsize=(10, 4))
plt.plot(month_labels, monthly.values, marker='o')
plt.title("Total Power Consumption by Month")
plt.xlabel("Month")
plt.ylabel("Total kWh")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


### Observations: Total Power Consumption by Month

- **Highest Consumption:** Power usage peaks in **March 2026**, with total consumption just above 1900 kWh, closely followed by October 2025, January 2026, and the months of spring.
- **Lowest Consumption:** The lowest total is seen in **July 2026** (about 600 kWh), which is likely due to incomplete data for that month.
- **Consistent Pattern:** From August 2025 through June 2026, monthly usage remains fairly stable, ranging between 1750 and 1920 kWh.
- **Seasonal Insight:** The consistently high consumption throughout late summer, autumn, winter, and spring suggests year-round demand, possibly driven by both heating (in colder months) and cooling (in warmer months), or steady activity across all months.
- **Note:** Partial data in the first and last July periods can make these months appear artificially low.


In [None]:
import matplotlib.pyplot as plt

# 1. Create a 'month' column for grouping
df['month'] = df.index.to_period('M').to_timestamp()

# 2. Group by month and property type, then sum power consumption
monthly_by_type = (
    df.groupby(['month', 'property_type'])['power_consumption_kwh']
    .sum()
    .unstack()  # property_type becomes columns
)

# 3. Plot each property type as a separate line
plt.figure(figsize=(12,5))
for col in monthly_by_type.columns:
    plt.plot(monthly_by_type.index, monthly_by_type[col], marker='o', label=col.capitalize())

plt.title("Total Power Consumption by Month and Property Type")
plt.xlabel("Month")
plt.ylabel("Total kWh")
plt.xticks(monthly_by_type.index, [d.strftime('%b %Y') for d in monthly_by_type.index], rotation=45, ha='right')
plt.legend(title='Property Type')
plt.tight_layout()
plt.show()


### Observations: Monthly Total Power Consumption by Property Type

- **Residential properties** consistently consume more electricity each month than commercial properties. Residential usage typically ranges from about **950 to 1,100 kWh** per month, while commercial properties range between **700 and 900 kWh**.
- Both types follow a similar seasonal pattern, with higher usage in **late summer and early autumn** (August to October), and somewhat lower totals during the winter months (especially February).
- **Residential consumption** reaches its peak in **August** and again in **May–June**, likely due to increased heating or cooling needs during these periods.
- The lowest recorded consumption appears in **July 2026** for both categories, which is probably due to incomplete data for that month.
- **Commercial consumption** is slightly more variable from month to month, which may reflect fluctuations in business activity or the influence of holidays.

Overall, residential buildings are the main contributors to annual energy use, but both groups show similar rises and falls throughout the year, suggesting that seasonal factors impact both property types in parallel.


In [None]:
# --- 1. Weekly Average Power Consumption (better for yearly data) ---
# Aggregate to weekly mean
weekly = df['power_consumption_kwh'].resample('W').mean()

plt.figure(figsize=(10, 3))
plt.plot(weekly.index, weekly.values, marker='o')
plt.title("Weekly Average Power Consumption")
plt.ylabel("kWh")
plt.xlabel("Week")
plt.tight_layout()
plt.show()

# --- 2. Average Consumption by Day of Week (cycle) ---
# If your index is a datetime, extract day name
df['weekday'] = df.index.day_name()  # Use df['ds'].dt.day_name() if your index is not datetime

# Get mean for each day of the week
dow_avg = df.groupby('weekday')['power_consumption_kwh'].mean()
# Ensure proper order: Monday to Sunday
dow_avg = dow_avg.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

plt.figure(figsize=(8, 3))
plt.plot(dow_avg.index, dow_avg.values, marker='o')
plt.title("Average Power Consumption by Day of Week")
plt.ylabel("kWh")
plt.xlabel("Day of Week")
plt.tight_layout()
plt.show()

# --- 3. Hourly Average Consumption (remains unchanged) ---
hourly = df.groupby('hour_of_day')['power_consumption_kwh'].mean()
hours = hourly.index
time_labels = [f"{h:02d}:00" for h in hours]

plt.figure(figsize=(10, 3))
plt.plot(hours, hourly.values, marker='o')
plt.title("Average Consumption by Time of Day")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hours, time_labels, rotation=45, ha='right')
plt.tight_layout()
plt.show()


### Observations: Weekly Average Power Consumption

- **Fluctuations:** Weekly average power consumption varies between about **1.1 kWh and 1.4 kWh**, showing regular rises and dips throughout the year.
- **Short-Term Peaks:** Repeating peaks occur every few weeks, likely linked to operational or environmental factors such as weather changes or holidays.
- **No Major Outliers:** There are no extreme spikes or drops, suggesting overall stable demand.
- **Real-life Context:** These trends reflect a mix of business cycles, weather, and seasonal influences on energy use.

---

### Observations: Average Power Consumption by Day of Week

- **Workweek Higher:** Energy use is **highest Monday to Friday** (roughly **1.35–1.42 kWh** daily), peaking on Fridays.
- **Weekend Drop:** There’s a clear decrease on **Saturdays and Sundays** (about **0.95–1.0 kWh**), indicating less activity.
- **What This Means:** This pattern aligns with typical routines—higher weekday demand due to work, school, and business; lower on weekends when more people may be out or businesses closed.

---

### Observations: Average Consumption by Time of Day

- **Nighttime Low:** Power use is lowest between **midnight and 6 AM** (~**0.4 kWh**), reflecting minimal activity.
- **Morning Ramp-up:** Usage climbs rapidly from **7 AM** onward, especially between **8–10 AM**.
- **Afternoon Peak:** The highest consumption occurs between **10 AM and 4 PM** (up to **2.2 kWh**), coinciding with business hours and daytime routines.
- **Evening Decline:** After **4 PM**, power use gradually decreases, stabilizing in the evening and dropping after **10 PM**.
- **What This Means:** This daily pattern matches real-world behavior—low overnight, sharp morning rise, sustained daytime peak, then evening wind-down.


In [None]:
# Weekly consumption by Propert Type

# 1. Compute weekly mean consumption for each type
weekly_type = (
    df
    .groupby([pd.Grouper(freq='W'), 'property_type'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = residential, commercial
)

# 2. Generate readable x-labels (week starting date)
week_labels = weekly_type.index.strftime('Week of %d %b')

# 3. Plot on two subplots
fig, axes = plt.subplots(2, 1, figsize=(12, 6), sharex=True)

for ax, ptype in zip(axes, ['residential', 'commercial']):
    ax.plot(weekly_type.index, weekly_type[ptype], marker='o')
    ax.set_title(f"Weekly Average Power Consumption: {ptype.capitalize()}")
    ax.set_ylabel("kWh")
    ax.grid(True)

# 4. Customize x-ticks on bottom subplot
axes[-1].set_xticks(weekly_type.index)
axes[-1].set_xticklabels(week_labels, rotation=45, ha='right')
axes[-1].set_xlabel("Week")

plt.tight_layout()
plt.show()

# Hourly Average by Property Type (unchanged)
hourly_type = (
    df
    .groupby(['hour_of_day', 'property_type'])['power_consumption_kwh']
    .mean()
    .unstack()
)

time_labels = [f"{h:02d}:00" for h in hourly_type.index]

plt.figure(figsize=(10, 4))
for ptype in ['residential', 'commercial']:
    plt.plot(hourly_type.index, hourly_type[ptype], marker='o', label=ptype.capitalize())

plt.title("Average Consumption by Hour of Day")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hourly_type.index, time_labels, rotation=45, ha='right')
plt.legend(title="Property Type")
plt.grid(True)
plt.tight_layout()
plt.show()


### Weekly Average Power Consumption: Residential

- **Residential power usage** remains higher than commercial throughout the year, typically ranging from **1.3 to 1.5 kWh per week**.
- Fluctuations are mild, with most weeks staying within a narrow band, indicating stable household routines.
- **Interpretation:** Residential energy demand is steady, likely due to consistent appliance use and regular living patterns, with occasional variations due to holidays, weather, or other seasonal effects.

---

### Weekly Average Power Consumption: Commercial

- **Commercial usage** is generally lower, averaging **0.9–1.2 kWh per week**, but with a few sharper weekly peaks approaching **1.5 kWh**.
- There is slightly more variability, with occasional spikes likely reflecting special business activities or seasonality.
- **Interpretation:** Commercial properties show increased energy use during certain periods, possibly linked to events or operational peaks, while maintaining lower consumption during routine weeks or holidays.

---

### Average Consumption by Hour of Day (Property Type Comparison)

- **Residential:** Power usage increases after **6 AM**, stays elevated between **7 AM and 9 PM** (about **1.5–2.0 kWh**), and drops overnight (to around **0.5 kWh**).
- **Commercial:** Remains low overnight (**0.3 kWh**), rises rapidly after **8 AM**, peaks between **1 PM and 4 PM** (**up to 3.0 kWh**), and then falls sharply after business hours.
- **Interpretation:** Residential consumption follows typical daily routines, while commercial usage is highly concentrated during standard business hours, reflecting operational schedules.



In [None]:
# Weekly consumption by Ev Owner
# 1. Compute weekly mean consumption for each EV owner group
weekly_ev = (
    df
    .groupby([pd.Grouper(freq='W'), 'ev_owner'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)

# 2. Create readable week labels
week_labels = ["Week of " + d.strftime('%d %b') for d in weekly_ev.index]

# 3. Plot on two subplots (one for EV owners, one for non-owners)
fig, axes = plt.subplots(2, 1, figsize=(10, 5), sharex=True)
for ax, owner in zip(axes, [True, False]):
    ax.plot(weekly_ev.index, weekly_ev[owner], marker='o')
    label = "EV Owner" if owner else "No EV"
    ax.set_title(f"Weekly Average Power Consumption: {label}")
    ax.set_ylabel("kWh")
    ax.grid(True)

# 4. Set week labels on x-axis
axes[-1].set_xticks(weekly_ev.index)
axes[-1].set_xticklabels(week_labels, rotation=45, ha='right')
axes[-1].set_xlabel("Week")
plt.tight_layout()
plt.show()

# Hourly average by EV owner
hourly_ev = (
    df
    .groupby(['hour_of_day', 'ev_owner'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)
time_labels = [f"{h:02d}:00" for h in hourly_ev.index]

plt.figure(figsize=(10, 4))
for owner in [True, False]:
    label = "EV Owner" if owner else "No EV"
    plt.plot(hourly_ev.index, hourly_ev[owner], marker='o', label=label)
plt.title("Average Consumption by Hour of Day (EV Owner)")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hourly_ev.index, time_labels, rotation=45, ha='right')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


### Weekly Average Power Consumption: EV Owner

- **EV Owners:** Weekly average power consumption for EV owners generally ranges from **1.0 to 1.4 kWh**, with a gradual increase over the year and several noticeable peaks. Most values remain above 1.1 kWh, showing a consistently higher usage baseline.
- **No EV:** Non-EV owners have slightly lower weekly averages, mostly between **1.1 and 1.3 kWh**, and their usage is more stable, with fewer sharp peaks compared to EV owners.
- **Interpretation:** EV ownership is associated with a higher and more variable average weekly power consumption, likely due to charging needs that add to regular household usage.

### Average Consumption by Hour of Day (EV Owner Comparison)

- **Both Groups:** Power consumption patterns by hour are very similar for both EV owners and non-EV owners, with overnight lows (0.4–0.5 kWh), a morning rise from 6 AM, and prominent peaks between **10 AM and 4 PM** (about 2.0–2.3 kWh).
- **Subtle Differences:** EV owners have slightly higher afternoon peaks, but the overall daily curves closely match.
- **Interpretation:** Despite the higher weekly average for EV owners, hourly usage patterns remain largely parallel, suggesting that charging is distributed in a way that aligns with general household routines, rather than causing unique new peaks.


In [None]:

# Weekly average by Solar Installed

# 1. weekly average
weekly_solar = (
    df
    .groupby([pd.Grouper(freq='W-MON'), 'solar_installed'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)

# 2. Generate readable week labels (e.g., "Week of 13 Jul")
week_labels = ["Week of " + d.strftime('%d %b') for d in weekly_solar.index]

# 3. Plot weekly averages in two subplots
fig, axes = plt.subplots(2, 1, figsize=(10, 5), sharex=True)
for ax, solar in zip(axes, [True, False]):
    ax.plot(weekly_solar.index, weekly_solar[solar], marker='o')
    label = "Solar Installed" if solar else "No Solar"
    ax.set_title(f"Weekly Average Power Consumption: {label}")
    ax.set_ylabel("kWh")
    ax.grid(True)

# X-axis labels for bottom subplot
axes[-1].set_xticks(weekly_solar.index)
axes[-1].set_xticklabels(week_labels, rotation=45, ha='right')
axes[-1].set_xlabel("Week")
plt.tight_layout()
plt.show()

# 4. Hourly average by Solar Installed (remains unchanged)
hourly_solar = (
    df
    .groupby(['hour_of_day', 'solar_installed'])['power_consumption_kwh']
    .mean()
    .unstack()  # columns = True, False
)
time_labels = [f"{h:02d}:00" for h in hourly_solar.index]

plt.figure(figsize=(10, 4))
for solar in [True, False]:
    label = "Solar Installed" if solar else "No Solar"
    plt.plot(hourly_solar.index, hourly_solar[solar], marker='o', label=label)
plt.title("Average Consumption by Hour of Day (Solar Installed)")
plt.xlabel("Time of Day")
plt.ylabel("kWh")
plt.xticks(hourly_solar.index, time_labels, rotation=45, ha='right')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


### Weekly Average Power Consumption: Solar Installed

- **Solar Installed:** Properties with solar panels generally show a higher and more variable weekly power consumption, with peaks above 1.6 kWh and fluctuations throughout the year. The higher average could reflect larger or more energy-active households, or possible feedback to the grid.
- **No Solar:** Properties without solar are more consistent, averaging around 1.2–1.4 kWh per week, with less dramatic swings and a slightly lower ceiling compared to the solar group.
- **Interpretation:** Solar adoption is associated with higher observed power use, but the variation suggests that lifestyle, system size, or local conditions may also play a role.

### Average Consumption by Hour of Day (Solar Installed)

- **Solar Installed:** Average power use rises sharply after 6 AM, peaking between 2–4 PM (up to 2.2 kWh), then declines in the evening. This pattern could align with active daytime usage and possible feedback from solar systems during peak sun hours.
- **No Solar:** Similar daily curve, but with a slightly lower afternoon peak and consistently lower consumption in most hours.
- **Interpretation:** Both groups have similar time-of-day profiles, but the solar group consistently uses (or returns) more energy, especially during midday hours. This highlights the impact of solar adoption on both consumption and potential generation patterns.


### Correlation Matrix

In [None]:
# Compute correlations
corr = df[['power_consumption_kwh','voltage','current','temperature_c','humidity_pct']].corr()

# Plot
plt.figure(figsize=(6,5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

- **Power consumption** is perfectly positively correlated with **current** (1.00) and very strongly negatively correlated with **voltage** (-0.91). This is typical in electrical systems: as current increases for a given voltage, so does power usage, and vice versa.
- **Humidity** shows a mild positive correlation with power consumption (0.22), while **temperature** is slightly negatively correlated (-0.19), suggesting that weather factors play only a small role in influencing energy use in this dataset.
- **Temperature and humidity** are very strongly negatively correlated (-0.86), reflecting the common environmental pattern where higher temperatures often coincide with lower humidity.
- **Voltage and current** are also very strongly negatively correlated (-0.91), further highlighting their inverse relationship in household or commercial power distribution.


### Modelling - Experimenting With Time Series Models: Prophet and SARIMAX

In this section, we explore two powerful time series forecasting models:

- **Prophet**: Developed by Facebook, Prophet is an open-source forecasting tool designed for time series data with strong seasonal effects and historical trends. It's robust to missing data, handles outliers well, and supports custom seasonality and holidays, making it ideal for business and economic forecasting.

- **SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors)**: SARIMAX is a statistical model used to forecast time series data that may have trend and seasonality components. It extends the ARIMA model by including:
  - **Seasonality**: Repeats over time (e.g., daily, weekly patterns)
  - **Exogenous variables (X)**: External predictors that can improve forecast accuracy
  - **AR (AutoRegressive)**, **I (Integrated)**, and **MA (Moving Average)** components to model different aspects of the time series' structure.

We will use these models to evaluate and compare their performance on our dataset.

#### PROPHET MODEL

In [None]:
# Make a copy to preserve the original
df_prophet = df.copy()

# Prophet needs 'ds' and 'y' columns
df_prophet['ds'] = df_prophet.index
df_prophet['y'] = df_prophet['power_consumption_kwh']

# Convert boolean columns to integers
df_prophet['ev_owner'] = df_prophet['ev_owner'].astype(int)
df_prophet['solar_installed'] = df_prophet['solar_installed'].astype(int)

In [None]:
df_prophet.head()

In [None]:
"""
This next step creates cyclical time features (hour_sin and hour_cos) from hour_of_day.
Uses sine and cosine transformations to reflect the repeating daily cycle.
This helps models recognize that hour 23 and hour 0 are close in time,
but not in numeric value, improving the capture of daily patterns.
"""
# Create sine and cosine features from hour_of_day (0–23)
df_prophet['hour_sin'] = np.sin(2 * np.pi * df_prophet['hour_of_day'] / 24)
df_prophet['hour_cos'] = np.cos(2 * np.pi * df_prophet['hour_of_day'] / 24)

In [None]:
# One-hot encode region and property_type
region_dummies = pd.get_dummies(df_prophet['region'], prefix='region')
property_dummies = pd.get_dummies(df_prophet['property_type'], prefix='property')

# Add them to the main DataFrame
df_prophet = pd.concat([df_prophet, region_dummies, property_dummies], axis=1)

In [None]:
selected_columns = [
    'ds', 'y',
    'temperature_c', 'humidity_pct',
    'ev_owner', 'solar_installed',
    'hour_sin', 'hour_cos',
    'region_east', 'region_north', 'region_south', 'region_west',
    'property_commercial', 'property_residential'
]

df_prophet = df_prophet[selected_columns]

In [None]:
# Sort and split the Time Series Data
# Sort by time
df_prophet = df_prophet.sort_values('ds')

# Split by 80%
split_idx = int(len(df_prophet) * 0.8)
train_df = df_prophet.iloc[:split_idx]
test_df = df_prophet.iloc[split_idx:]

In [None]:
from prophet import Prophet

# Initialize the model
model = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=False)

# Add all regressors
for regressor in selected_columns:
    if regressor not in ['ds', 'y']:
        model.add_regressor(regressor)

# Fit the model on the training data
model.fit(train_df)

In [None]:
# Remove target column from test
future_test = test_df.drop(columns=['y'])

# Predict
forecast = model.predict(future_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Compute metrics
mae = mean_absolute_error(test_df['y'], forecast['yhat'])
rmse = np.sqrt(mean_squared_error(test_df['y'], forecast['yhat']))

print(f"MAE: {mae:.3f} kWh")
print(f"RMSE: {rmse:.3f} kWh")

#### Comparing with Baseline models

### Preparing True and Predicted Values for Comparison

- `y_true =test_df['y'].reset_index(drop=True)`: Extracts the actual target values from the results and resets the index to ensure alignment.

- `y_pred_prophet = forecast['yhat'].reset_index(drop=True)`: Extracts the predicted values from Prophet (`'yhat'`) and also resets the index.

Resetting the index ensures that both `y_true` and `y_pred_prophet` are properly aligned by position, which is important for plotting or calculating metrics like correlation or custom error analysis.

In [None]:
y_true = test_df['y'].reset_index(drop=True)
y_pred_prophet = forecast['yhat'].reset_index(drop=True)

In [None]:
"""
Naïve Forecast Benchmark
This cell creates a simple naïve forecast to serve as a baseline for evaluating Prophet's performance:

y_pred_naive = y_true.shift(1): Assumes that the best prediction for the current time step is the actual value from 30 minutes ago (i.e., a 1-step lag).

mean_absolute_error(y_true[1:], y_pred_naive[1:]): Calculates MAE between the actual and naïve predictions, excluding the first row (which becomes NaN after the shift).

mean_squared_error(...): Computes RMSE the same way.
"""

# Shift the true values by 1 step (30 mins ago)
y_pred_naive = y_true.shift(1)

# Drop the first value to align
naive_mae = mean_absolute_error(y_true[1:], y_pred_naive[1:])
naive_rmse = np.sqrt(mean_squared_error(y_true[1:], y_pred_naive[1:]))

print(f"Naïve MAE: {naive_mae:.3f} kWh")
print(f"Naïve RMSE: {naive_rmse:.3f} kWh")

In [None]:
"""
Seasonal Naïve Forecast Benchmark (1-Day Lag)
This cell sets up a seasonal naïve model that assumes the power consumption at a given time is the same as exactly 24 hours earlier:

y_true.shift(48): Shifts the actual values by 48 steps, assuming the data has 30-minute intervals (48 steps = 1 day).

The forecast assumes today’s consumption will match the same time yesterday.

mean_absolute_error(...) and mean_squared_error(...): Compute MAE and RMSE, skipping the first 48 rows to avoid misalignment due to shifting.

This benchmark captures daily seasonality and helps assess whether Prophet's more advanced forecasting offers an improvement over this simple, seasonal assumption.

"""

# Shift by 48 steps for 1-day seasonal naive
y_pred_seasonal_naive = y_true.shift(48)

seasonal_mae = mean_absolute_error(y_true[48:], y_pred_seasonal_naive[48:])
seasonal_rmse = np.sqrt(mean_squared_error(y_true[48:], y_pred_seasonal_naive[48:]))

print(f"Seasonal Naïve MAE: {seasonal_mae:.3f} kWh")
print(f"Seasonal Naïve RMSE: {seasonal_rmse:.3f} kWh")

In [None]:
prophet_mae = mean_absolute_error(y_true, y_pred_prophet)
prophet_rmse = np.sqrt(mean_squared_error(y_true, y_pred_prophet))

print("🔍 Forecast Performance Comparison:")
print(f"📈 Prophet         - MAE: {prophet_mae:.3f}, RMSE: {prophet_rmse:.3f}")
print(f"📉 Naïve           - MAE: {naive_mae:.3f}, RMSE: {naive_rmse:.3f}")
print(f"🕒 Seasonal Naïve  - MAE: {seasonal_mae:.3f}, RMSE: {seasonal_rmse:.3f}")

### Experimenting With SARIMAX MODEL

In [None]:
!pip install statsmodels --quiet

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
import pandas as pd

# Make a fresh copy to work on
df_sarimax = df.copy()

# Ensure datetime index
df_sarimax.index = pd.to_datetime(df_sarimax.index)

# Convert boolean features to int
df_sarimax['ev_owner'] = df_sarimax['ev_owner'].astype(int)
df_sarimax['solar_installed'] = df_sarimax['solar_installed'].astype(int)

# One-hot encode categorical features
df_sarimax = pd.get_dummies(df_sarimax, columns=['region', 'property_type'], drop_first=True)

# Now convert any remaining bools (from one-hot encoding) to int
for col in df_sarimax.columns:
    if df_sarimax[col].dtype == bool:
        df_sarimax[col] = df_sarimax[col].astype(int)

# Drop non-numeric / irrelevant features
df_sarimax = df_sarimax.drop(columns=['date', 'weekday', 'month'])

# Sort by time
df_sarimax = df_sarimax.sort_index()

# ✅ Show updated dtypes to confirm
print(df_sarimax.dtypes)

# Preview the cleaned DataFrame
df_sarimax.head()

### Preparing Data for SARIMAX Modeling

This cell prepares a clean and numeric dataset suitable for use with the **SARIMAX** model:

- `df_sarimax = df.copy()`: Starts with a fresh copy of the original data to avoid altering it.
- `df_sarimax.index = pd.to_datetime(...)`: Ensures the DataFrame index is in datetime format, which is required for time series modeling.

**Feature preprocessing:**
- Boolean features like `ev_owner` and `solar_installed` are converted to integers (0 or 1).
- Categorical variables (`region` and `property_type`) are one-hot encoded using `pd.get_dummies`, with `drop_first=True` to avoid multicollinearity.

**Additional cleaning:**
- Any leftover boolean columns (possibly from encoding) are also converted to integers.
- Drops the `date` column since the timestamp is already captured in the index.
- Sorts the DataFrame chronologically to preserve time order.

Finally:
- Prints the data types to confirm everything is numeric.
- Shows the top rows of the cleaned dataset to preview the structure before modeling.

This preprocessing ensures that SARIMAX can handle all features correctly during fitting.

In [None]:
# Select the target column
target = 'power_consumption_kwh'

# Drop unwanted columns from exogenous features
exog_cols = df_sarimax.columns.drop([target, 'voltage', 'current'], errors='ignore')

# Define train-test split
split_index = int(len(df_sarimax) * 0.8)
train_end = df_sarimax.index[split_index]

# Create training and testing sets
train_y = df_sarimax[target].loc[:train_end]
test_y = df_sarimax[target].loc[train_end:]

train_X = df_sarimax[exog_cols].loc[:train_end]
test_X = df_sarimax[exog_cols].loc[train_end:]

Defining Target, Exogenous Variables, and Time-Based Train-Test Split for SARIMAX
target = 'power_consumption_kwh': Specifies the target variable to forecast.
exog_cols = df_sarimax.columns.drop(target): Defines all other columns as exogenous variables (external predictors), which SARIMAX can use to improve forecasting accuracy.
Train-test split:

split_index: Computes the index that corresponds to 80% of the dataset length.

train_end = df_sarimax.index[split_index]: Gets the actual timestamp at the split point.

train_y and test_y: Contain the target values for the training and testing periods, respectively.

train_X and test_X: Contain the exogenous features for the corresponding time ranges.

This setup ensures that both the target and predictor variables are properly aligned and time-ordered for training and evaluating the SARIMAX model.

In [None]:
print(train_y.dtypes)
print(train_X.dtypes)

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fit SARIMAX model
sarimax_model = SARIMAX(train_y, exog=train_X, order=(1, 0, 0), seasonal_order=(1, 0, 0, 48), enforce_stationarity=False, enforce_invertibility=False)
sarimax_results = sarimax_model.fit(disp=False)

print("✅ SARIMAX model fitted.")

### Training the SARIMAX Model

- `SARIMAX(...)`: Initializes a **Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors** model with the following parameters:
  - `order=(1, 0, 0)`: Standard ARIMA order:
    - AR (p=1): One lag of the dependent variable
    - I (d=0): No differencing (assumes stationarity)
    - MA (q=0): No moving average component
  - `seasonal_order=(1, 0, 0, 48)`: Seasonal component:
    - Seasonal AR with 1 lag
    - Seasonal period = 48 (assumes 30-minute data, so 48 steps = 1 day)
  - `exog=train_X`: Includes external features (regressors) during training
  - `enforce_stationarity=False`, `enforce_invertibility=False`: These relax constraints during estimation to allow more flexible model fitting.

- `sarimax_model.fit(disp=False)`: Fits the model to the training target (`train_y`) and its exogenous inputs, suppressing output display.

Once fitted, the model learns both short-term and seasonal patterns in the target series while accounting for influence from external variables.

In [None]:
# Predict using SARIMAX
sarimax_forecast = sarimax_results.predict(start=len(train_y), end=len(train_y) + len(test_y) - 1, exog=test_X)

### Making Forecasts with the SARIMAX Model

- `sarimax_results.predict(...)`: Generates predictions using the trained SARIMAX model on the **test period**.

Parameters:
- `start=len(train_y)`: Specifies the starting index for forecasting—immediately after the training data ends.
- `end=len(train_y) + len(test_y) - 1`: Forecasts up to the length of the test set.
- `exog=test_X`: Supplies the corresponding exogenous features for the forecast period.

This step produces a time-aligned forecast of power consumption, incorporating both past values and external factors like region, hour, and solar installation status.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Calculate metrics
sarimax_mae = mean_absolute_error(test_y, sarimax_forecast)
sarimax_rmse = np.sqrt(mean_squared_error(test_y, sarimax_forecast))

print(f"📊 SARIMAX Evaluation Results:")
print(f"🔹 MAE:  {mae:.3f} kWh")
print(f"🔹 RMSE: {rmse:.3f} kWh")

### LSTM MODEL

In [None]:
# Make a copy to preserve the original
df_lstm = df.copy()

# Ensure timestamp is the index (for time series)
if df_lstm.index.name != 'datetime':
    df_lstm = df_lstm.set_index('datetime')

# Encode Categorical and boolean values
# Convert boolean columns to integers
df_lstm['ev_owner'] = df_lstm['ev_owner'].astype(int)
df_lstm['solar_installed'] = df_lstm['solar_installed'].astype(int)

# One-hot encode region and property_type
region_dummies = pd.get_dummies(df_lstm['region'], prefix='region')
property_dummies = pd.get_dummies(df_lstm['property_type'], prefix='property')

# Concatenate with the original DataFrame
df_lstm = pd.concat([df_lstm, region_dummies, property_dummies], axis=1)

In [None]:
# Create cyclical features of hour day

# If not already present, create hour_of_day
df_lstm['hour_of_day'] = df_lstm.index.hour

# Create sine and cosine transforms to capture daily cycles
df_lstm['hour_sin'] = np.sin(2 * np.pi * df_lstm['hour_of_day'] / 24)
df_lstm['hour_cos'] = np.cos(2 * np.pi * df_lstm['hour_of_day'] / 24)

In [None]:
# Select features and target
feature_cols = [
    'temperature_c',
    'ev_owner', 'solar_installed',
    'hour_sin', 'hour_cos',
    'region_east', 'region_north', 'region_south', 'region_west',
    'property_commercial', 'property_residential'
]
target_col = 'power_consumption_kwh'

# Drop NA values (if any)
df_lstm = df_lstm.dropna(subset=feature_cols + [target_col])

In [None]:
# Scaling
from sklearn.preprocessing import StandardScaler

# Features
X = df_lstm[feature_cols].values
# Target
y = df_lstm[target_col].values

# Scale features and target
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)

scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y.reshape(-1, 1))

In [None]:
# Create LSTM Sequences
def create_sequences(X, y, seq_length):
    Xs, ys = [], []
    for i in range(len(X) - seq_length):
        Xs.append(X[i:i+seq_length])
        ys.append(y[i+seq_length])
    return np.array(Xs), np.array(ys)

seq_length = 24  # Use last 24 intervals (e.g., 1 day if hourly) to predict next step

X_seq, y_seq = create_sequences(X_scaled, y_scaled, seq_length)

# Train/test split (80/20)
split_idx = int(len(X_seq) * 0.8)
X_train, X_test = X_seq[:split_idx], X_seq[split_idx:]
y_train, y_test = y_seq[:split_idx], y_seq[split_idx:]

In [None]:
# Build and Train LSTM

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

model = Sequential([
    LSTM(64, input_shape=(seq_length, X_train.shape[2]), return_sequences=False),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(1)
])


model.compile(optimizer='adam', loss='mse')
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1, verbose=1)

In [None]:
# Evaluate the Model

from sklearn.metrics import mean_absolute_error, mean_squared_error

# Predict
y_pred_scaled = model.predict(X_test)
y_pred = scaler_y.inverse_transform(y_pred_scaled)
y_test_inv = scaler_y.inverse_transform(y_test)

lstm_mae = mean_absolute_error(y_test_inv, y_pred)
lstm_rmse = np.sqrt(mean_squared_error(y_test_inv, y_pred))

print(f"LSTM MAE: {mae:.3f} kWh")
print(f"LSTM RMSE: {rmse:.3f} kWh")

In [None]:
# Plot Predictions

import matplotlib.pyplot as plt

plt.figure(figsize=(12,4))
plt.plot(y_test_inv, label='Actual')
plt.plot(y_pred, label='Predicted')
plt.title('LSTM: Actual vs Predicted Power Consumption')
plt.xlabel('Time step')
plt.ylabel('kWh')
plt.legend()
plt.show()

#### USING TRADITIONAL RANDOM FOREST MODEL

In [None]:
# 1. Make a copy# Assume df is your original DataFrame with datetime index
df_model = df.copy()

# Confirm the datetime index is set
df_model.index = pd.to_datetime(df_model.index)

In [None]:
df_model['lag_30min'] = df_model['power_consumption_kwh'].shift(1)
df_model['lag_1h'] = df_model['power_consumption_kwh'].shift(2)  # 1 hour = 2 x 30 mins

In [None]:
df_model['rolling_avg_1h'] = df_model['power_consumption_kwh'].rolling(2).mean()
df_model['rolling_avg_2h'] = df_model['power_consumption_kwh'].rolling(4).mean()

In [None]:
# Ensure 'hour_of_day' and 'date' already exist
df_model['is_weekend'] = df_model.index.weekday >= 5  # Saturday=5, Sunday=6

# Sine and cosine encoding for cyclical hour
df_model['hour_sin'] = np.sin(2 * np.pi * df_model['hour_of_day'] / 24)
df_model['hour_cos'] = np.cos(2 * np.pi * df_model['hour_of_day'] / 24)


In [None]:
df_model = pd.get_dummies(df_model, columns=['property_type', 'region'], drop_first=False)

In [None]:
df_model = df_model.dropna()

In [None]:
target = 'power_consumption_kwh'

features = [
    'lag_30min', 'lag_1h',
    'rolling_avg_1h', 'rolling_avg_2h',
    'hour_of_day', 'is_weekend',
    'hour_sin', 'hour_cos',
    'temperature_c', 'ev_owner', 'solar_installed',
    'property_type_commercial', 'property_type_residential',
    'region_north', 'region_south', 'region_east', 'region_west'
]

X = df_model[features]
y = df_model[target]

In [None]:
split_index = int(len(df_model) * 0.8)

X_train = X.iloc[:split_index]
y_train = y.iloc[:split_index]

X_test = X.iloc[split_index:]
y_test = y.iloc[split_index:]

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

y_pred = model.predict(X_test)

rf_mae = mean_absolute_error(y_test, y_pred)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred))  # Manual RMSE

print(f"MAE: {mae:.3f} kWh")
print(f"RMSE: {rmse:.3f} kWh")

In [None]:
import matplotlib.pyplot as plt

# Reset index to align on the same axis if needed (especially if datetime index)
y_test_aligned = y_test.reset_index(drop=True)
y_pred_aligned = pd.Series(y_pred, index=y_test_aligned.index)

plt.figure(figsize=(14, 5))
plt.plot(y_test_aligned, label="Actual", color='blue', linewidth=2)
plt.plot(y_pred_aligned, label="Predicted", color='tomato')
plt.title("Actual vs Predicted Power Consumption (kWh)")
plt.xlabel("Sample (Time Progression)")
plt.ylabel("Power Consumption (kWh)")
plt.legend()
plt.tight_layout()
plt.show()


### Using XGBOOST Model

In [None]:
import xgboost as xgb

xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_model.fit(X_train, y_train)


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_pred = xgb_model.predict(X_test)
xgboost_mae = mean_absolute_error(y_test, y_pred)
xgboost_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"XGBoost MAE: {mae:.3f} kWh")
print(f"XGBoost RMSE: {rmse:.3f} kWh")

In [None]:
print("🔍 Forecast Performance Comparison:")
print(f"📈 Prophet         - MAE: {prophet_mae:.3f}, RMSE: {prophet_rmse:.3f}")
print(f"📉 Naïve           - MAE: {naive_mae:.3f}, RMSE: {naive_rmse:.3f}")
print(f"🕒 Seasonal Naïve  - MAE: {seasonal_mae:.3f}, RMSE: {seasonal_rmse:.3f}")
print(f"📊 SARIMAX          - MAE: {sarimax_mae:.3f}, RMSE: {sarimax_rmse:.3f}")
print(f"🌳 LSTM             - MAE: {lstm_mae:.3f}, RMSE: {lstm_rmse:.3f}")
print(f"🌳 Random Forest   - MAE: {rf_mae:.3f}, RMSE: {rf_rmse:.3f}")
print(f"📊 XGBoost          - MAE: {xgboost_mae:.3f}, RMSE: {xgboost_rmse:.3f}")
