In this lesson, we will explore the use of 'big data' in our model. We can do this by using smartmeter data collected from across London, rather than just data from the school. 

In AI, more data is generally better. But, the new data needs to be representative of what you're trying to model, other wise you will train a model to make predictions that aren't relevant to the situation you're interested in. 

# Step 1 - Loading libraries and data

In [None]:
import matplotlib.pyplot as plt #Matplotlib allows us to draw graphs
import numpy as np #Numpy allows us to perform complex mathematical processes quickly
import pandas as pd #Pandas is another useful set of tools for statistics
import datetime
from fbprophet import Prophet

In [None]:
energy = pd.read_csv('../input/schoolsmartmeterdata/meter-amr-readings-1200050359109 (3).csv')
energy.head()

This time, we will include data from other London smartmeters. 

The data has been read in successfully, but we need one column for the energy readings, rather than one per time point. We can use the 'melt' function to do this. 

In [None]:
energy = pd.melt(energy, id_vars=['Reading Date', 'One Day Total kWh', 'Status', 'Substitute Date'], var_name='Time', value_name="Energy")
energy.head()

Next, we need to turn the date and time columns into a datetime column for python to work with. 

In [None]:
energy['Timestamp'] = pd.to_datetime(energy['Reading Date'] + " " + energy['Time'], format='%Y-%m-%d %H:%M')
print('Start of data collection: ', energy['Timestamp'].min())
print('End of data collection: ', energy['Timestamp'].max())
energy = energy.sort_values(by=['Timestamp'])

###### take one year of data for simplicity ########
energy = energy[(energy['Timestamp'] > '2018') & (energy['Timestamp'] < '2019')]
print('Start of data collection: ', energy['Timestamp'].min())
print('End of data collection: ', energy['Timestamp'].max())


In [None]:
london = pd.read_csv('../input/smart-meters-in-london/halfhourly_dataset/halfhourly_dataset/block_0.csv')
london['tstp'] = pd.to_datetime(london['tstp'].replace('.0000000', ''), format='%Y-%m-%d %H:%M:%S')
london['Timestamp'] = london['tstp'] # we will create a second, fake date column so we can overlay data from different years
households = ["MAC000246", "MAC004387", "MAC004431"] # choose some household IDs to include in the data
london = london[london['LCLid'].isin(households)]
london.reset_index(drop=True, inplace=True)
london['energy(kWh/hh)'][london['energy(kWh/hh)'] == 'Null'] = 0
london['energy(kWh/hh)'] = london['energy(kWh/hh)'].astype(float)

# print(london.head())
print('Start of data collection: ', london['Timestamp'].min())
print('End of data collection: ', london['Timestamp'].max())

# take one year of the London data for simplicity
london = london[(london['Timestamp'] > '2012') & (london['Timestamp'] < '2013')]


The dates that the energy data was collected over don't overlap, so we can switch all the years to 2021 in order to overlay the data.

In [None]:
london['Timestamp'] = london['Timestamp'].mask(london['Timestamp'].dt.year < 2021, london['Timestamp'] + pd.offsets.DateOffset(year=2021))
energy['Timestamp'] = energy['Timestamp'].mask(energy['Timestamp'].dt.year < 2021, energy['Timestamp'] + pd.offsets.DateOffset(year=2021))

# energy = energy[(energy['Timestamp'] > '2021-04-15') & (energy['Timestamp'] < '2021-05-15')]
# london = london[(london['Timestamp'] > '2021-04-15') & (london['Timestamp'] < '2021-05-15')]


# Step 2 - Comparing the two data sources


Let's look at one week of data from the two data sources. How does the school data compare to the household data? What might this mean for our ability to build a better model using this larger dataset?

**Note:** the y-axes on the two datasets are different (see the left and right edges of the plot), because energy usage in the school is a lot higher than in an individual household. 

In [None]:
energy_week = energy[(energy['Timestamp'] > '2021-05-15') & (energy['Timestamp'] < '2021-05-22')]
print(energy_week.shape)
london_week = london[(london['Timestamp'] > '2021-05-15') & (london['Timestamp'] < '2021-05-22')]
print(london_week.shape)

fig, ax1 = plt.subplots(figsize=(15, 6))
plt.xticks( rotation=25 )
ax2 = ax1.twinx()  

groups = london_week.groupby('LCLid')
for name, group in groups:
    ax1.plot(group['Timestamp'], group['energy(kWh/hh)'], label=name)
ax1.legend(title='Household energy use', loc='upper left')
ax1.set_ylim(-0.1,2.5)

ax2.plot(energy_week['Timestamp'], energy_week['Energy'], color='blue', label=name)
ax2.legend(['School energy use'], loc='upper right')
ax2.set_ylim(11,150)


# Step 3 - Building a model

We will build a forecasting model using the larger London dataset and investigate what it has learned, to see if it would be useful for predicting school energy usage. First, we split the data into training and testing sets:

In [None]:
train_data = london[['Timestamp', 'energy(kWh/hh)', 'LCLid']][(london['Timestamp']> "2021-04-15") & (london['Timestamp']< "2021-04-30")] #  & (london['LCLid'] == households[0]) & (london['tstp'].dt.year == 2013)
test_data = london[['Timestamp', 'energy(kWh/hh)', 'LCLid']][(london['Timestamp']>= "2021-04-30") & (london['Timestamp']< "2021-05-15")] #  & (london['LCLid'] == households[0]) & (london['tstp'].dt.year == 2013)

train_data.columns = ['ds', 'y', 'LCLid']
test_data.columns = ['ds', 'y', 'LCLid']

train_data

Next, we build the model and look at the forecast it produces. What can you notice about the forecast compared to the model built on data from the school?

In [None]:
model = Prophet(daily_seasonality=True)
# model = Prophet(daily_seasonality=True, weekly_seasonality=True) # this doesn't work for so little data, but may work on the day
model.fit(train_data)

forecast = model.predict(test_data)
fig = model.plot(forecast)

Now, we can look at the different seasonal components of the model - how do these compare to the school model?

In [None]:
fig = model.plot_components(forecast)


We can plot the forecast from the London data for multiple homes over the real data. 

In [None]:
fig, ax1 = plt.subplots(figsize=(15,8))
# rotate the date labels so they don't overlap
plt.xticks( rotation=25 )

groups = london.groupby('LCLid')
for name, group in groups:
    ax1.plot(group['Timestamp'], group['energy(kWh/hh)'], label=name)
ax1.plot(forecast['ds'], forecast['yhat'], color='red', linewidth=3, label='Forecast')
ax1.legend(title='Household energy use', loc='upper right')
ax1.set_xlim(datetime.datetime(2021, 4, 30), datetime.datetime(2021, 5, 15))
ax1.set_ylim(0,3)

By trying to make a general prediction for all homes, the model doesn't do well at predicting energy use for any homes. A more complex modelling approach is needed to do this. 

Let's see how the model's predictions compare to the school's data. 

In [None]:
fig, ax1 = plt.subplots(figsize=(12,8))
# rotate the date labels so they don't overlap
plt.xticks( rotation=25 )

ax1.plot(energy['Timestamp'], energy['Energy'], color='blue', label='School energy usage')
ax1.set_xlabel('Timestamp')
ax1.set_ylabel('Energy use')

ax1.plot(forecast['ds'], forecast['yhat'], color='red', label='Forecasted energy usage')
ax1.set_xlim(datetime.datetime(2021, 4, 30), datetime.datetime(2021, 5, 15))
ax1.legend(loc='upper right')


The basic predictions are very bad, because they underestimate energy use by a school. What if we adjust for this and just look at the usage pattern?

In [None]:
fig, ax1 = plt.subplots(figsize=(15,8))
# rotate the date labels so they don't overlap
plt.xticks( rotation=25 )
# set up the 2nd axis
ax2 = ax1.twinx()  

# energy_single = energy[energy['Reading Date'].str.contains("2019")]

# ax1.plot(energy_single['Timestamp'], energy_single['Energy'], color='blue')
ax1.plot(energy['Timestamp'], energy['Energy'], color='blue', label='School energy usage')
ax1.set_xlabel('Timestamp')
ax1.set_ylabel('Energy use')
ax1.set_xlim(datetime.datetime(2021, 5, 1), datetime.datetime(2021, 5, 15))

# the only difference here is giving the forecast data it's own y-axis, so it can lay over the school data
ax2.plot(forecast['ds'], forecast['yhat'], color='red', label='Forecasted energy usage')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')


The model predicts more consistent energy usage each day, and predicts peak energy usage in the evening instead of during school time. 