In your class work, you will be collecting energy consumption data from your school and using AI to gain more insight from the data. 

In this activity, you can compare data collected from households across London to the data you've collected, and look at how a popular time-series modelling method breaks this data down into different time components. 

This notebook is linked to a dataset containing data on a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014. 

This analysis is adapted from this notebook: https://www.kaggle.com/ryuheeeei/smart-home-energy-analysis-with-prophet/log.


# Reading in and preparing the data

We will start off by preparing our environment to run the analysis:

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar

import datetime, time
from datetime import timedelta
import matplotlib.dates as mdates
from matplotlib.dates import AutoDateFormatter, AutoDateLocator
sns.set()
%matplotlib inline

from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import plot_seasonality, plot_weekly, plot_yearly


Next, we will read in part of the publicly available dataset. The dataset consists of smartmeter readings for different households across London. 

In [None]:
# Choose only 1 house by LCLid "MAC000002"
alldata = pd.read_csv("../input/smart-meters-in-london/halfhourly_dataset/halfhourly_dataset/block_0.csv")
print('A sample of household IDs:', alldata.LCLid.unique()[:10])
household = "MAC000002"

df = alldata[alldata["LCLid"] == household ]
df.reset_index(drop=True, inplace=True)
print(df.head())

This dataset has some useful info about homes (LCLid), timestamps (tstp) and energy consumption (energy(kWh/hh)). 

The timestamp data are easy for humans to read and interpret, but need to be converted to an interpretable form so that comupters can understand them. Python has a 'datetime' format for doing this. We will convert each of these timestamps into datetime format, then pull out different time components (day, week, month..) from these to help us visualise and understand the data. 

In [None]:
# process datetime info to pull out different components
for i in range(df.shape[0]):
    df.loc[i,'datetime'] = datetime.datetime.strptime(df.loc[i,'tstp'].replace('.0000000', ''), '%Y-%m-%d %H:%M:%S')
    df.loc[i,'date'] = df.loc[i,'datetime'].date()
    df.loc[i,'month'] = df.loc[i,'datetime'].strftime("%B")
    df.loc[i,'day_of_month'] = df.loc[i,'datetime'].strftime("%d")
    df.loc[i,'time'] = df.loc[i,'datetime'].strftime('%X')
    df.loc[i,'weekday'] = df.loc[i,'datetime'].strftime('%A')
    time = df.datetime[i] - datetime.datetime.combine(df.date[i], datetime.datetime.min.time())
    df.loc[i,'day_seconds'] = time.total_seconds()

In [None]:
# order the weekdays and months correctly
df.loc[:,'weekday'] = pd.Categorical(df['weekday'], categories= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday'], ordered=True)
df.loc[:,'month'] = pd.Categorical(df['month'], categories=calendar.month_name[1:], ordered=True)


What date range do the data cover?

In [None]:
print('Earliest date:', df.date.min())
print('Latest date:', df.date.max())

In [None]:
# set energy consumption data to numeric type
df = df[df["energy(kWh/hh)"] != "Null"]
df.loc[:,"energy"] = df["energy(kWh/hh)"].astype("float64")

# calculate the cumulative energy use over time for each date
df.loc[:,"cumulative_sum"] = df.groupby('date')["energy"].cumsum()
df = df.set_index("datetime")
df.head()

What does half-hourly energy use look like over a week?

In [None]:
df.plot(y="energy", figsize=(15, 4), xlim=('2012-10-13', '2012-10-20'))

# Modelling the data

When we look at the data, we can see some regular patterns in energy usage over time. These patterns can be modelled mathematically, and used to predict energy usage into the future. 

To explore this possibility, will be using Prophet, a popular time-series modelling Python package developed by Facebook. We will use it to model this data and predict energy usage by breaking the trends in the data down into different periodic patterns. 

In [None]:
train_size = int(0.8 * len(df))
X_train, X_test = df[:train_size].index, df[train_size:].index
y_train, y_test = df[:train_size]["energy"].values, df[train_size:]["energy"].values

train_data = pd.concat([pd.Series(X_train), pd.Series(y_train)], axis=1, keys=["ds", "y"])
test_data = pd.concat([pd.Series(X_test), pd.Series([0]*len(y_test))], axis=1, keys=["ds", "y"])
answer_data = pd.concat([pd.Series(X_test), pd.Series(y_test)], axis=1, keys=["ds", "y"])

Below, we build the model, including time trends for days, weeks and years:

In [None]:
model = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True)
model.fit(train_data)

We can then plot some of the energy usage predicted by the model:

In [None]:
forecast = model.predict(test_data)
fig = model.plot(forecast)
forecast_start = test_data.ds[0]
axlim1 = forecast_start - timedelta(days=6)
axlim2 = forecast_start + timedelta(days=6)
plt.xlim(axlim1, axlim2)

The real data is shown here as black dots, and the model predictions are shown as blue lines, with a halo of uncertainty around them. We can see that the model predicts less variability than what we see in the real data, because the cause of the different extreme power usage events is not captured in this data. Below is a longer time-horizon for the predictions:

# Exploring time components of the model

It seems that the daily trend is accompanied by a predicted drop in energy use at Christmas, and a rise in power use after mid-February. We can start to explain these predictions by pulling apart the components of the forecasting model:

In [None]:
fig = model.plot_components(forecast)

The first component in this model's breakdown of the data shows the 'trend' - the general direction the data are moving in once the seasonal components have been taken out. It looks like over the 2 years of the data collection, overall power usage was going up slightly.

Now, let's look at each of the seasonal components compared to the data. 

The model pulled out a daily pattern of fluctuating power use:

In [None]:
fig, axs = plt.subplots(figsize=(12,7))
plt.scatter(x='time', y='energy', data=df)
plt.gcf().autofmt_xdate(rotation=90)
fig.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')

fig = plot_seasonality(model, 'daily')


This appears to match the real data. The model has correctly identified a dip in power consumption in the early hours of the morning and two peaks at around lunchtime and dinnertime. 

Next, let's look at the weekly component the model has identified:

In [None]:
df.groupby('weekday').mean().plot(y='energy', figsize=(10,6), title="Real energy use data mean")

plot_weekly(model, weekly_start=1)

This trend is harder to see in the data, because the differences in power usage across days is so small, but the mean power usage for each weekday fits the differences the model identified, with dips in average power use for Tuesday and Thursday, and highest use on Sunday. 

Below, we can look at energy use across the day, by day of the week. A major reason for the higher power use on Fridays, Saturdays and Sundays seems to be continued power use between lunch and dinner, while on other days this drops off in between meals. 

In [None]:
g = sns.FacetGrid(df, row="weekday", aspect=4.5, height=2, sharey=False)
g.map(sns.scatterplot, 'day_seconds', "energy")

Here we will look at the yearly trend:

In [None]:
fig, axs = plt.subplots(figsize=(10,7))
sns.stripplot(x='month', y='cumulative_sum', data=df, color='black')

fig = plot_yearly(model)

df.groupby('month').mean()

The model has correctly identified higher power usage in the winter months and lower usage in summer. The biggest spike in power use is in March. 

# How much data do we need for a good model?

What happens if we can only collect a month's worth of data? Can we still build a good model? Let's see:

In [None]:
X_train = df.loc['2012-10-13':'2012-11-13'].index
y_train = df.loc['2012-10-13':'2012-11-13']['energy'].values
train_data = pd.concat([pd.Series(X_train), pd.Series(y_train)], axis=1, keys=["ds", "y"])

X_test = df.loc['2012-11-14':'2012-11-28'].index
y_test = df.loc['2012-11-14':'2012-11-28']['energy'].values
test_data = pd.concat([pd.Series(X_test), pd.Series([0]*len(y_test))], axis=1, keys=["ds", "y"])

model = Prophet(daily_seasonality=True, weekly_seasonality=True)
model.fit(train_data)

forecast = model.predict(test_data)
fig = model.plot_components(forecast)

# Next steps

Next, try running the same modelling process on a different household - how do the time components change?

How well do you think these models would work to predict power usage at your school?