In this lesson, we will use the smartmeter data to train a forecasting algorithm, and test how well it works. 

# Step 1 - Loading libraries and data

In [None]:
import matplotlib.pyplot as plt #Matplotlib allows us to draw graphs
import numpy as np #Numpy allows us to perform complex mathematical processes quickly
import pandas as pd #Pandas is another useful set of tools for statistics
import datetime
from fbprophet import Prophet

In [None]:
energy = pd.read_csv('../input/schoolsmartmeterdata/meter-amr-readings-1200050359109 (3).csv')
energy.head()

The data has been read in successfully, but we need one column for the energy readings, rather than one per time point. We can use the 'melt' function to do this. 

In [None]:
energy = pd.melt(energy, id_vars=['Reading Date', 'One Day Total kWh', 'Status', 'Substitute Date'], var_name='Time', value_name="Energy")
energy.head()

Next, we need to turn the date and time columns into a datetime column for python to work with. 

In [None]:
energy['Timestamp'] = pd.to_datetime(energy['Reading Date'] + " " + energy['Time'], format='%Y-%m-%d %H:%M')
print('Start of data collection: ', energy['Timestamp'].min())
print('End of data collection: ', energy['Timestamp'].max())
energy = energy.sort_values(by=['Timestamp'])

# Step 2 - Preparing the data for modelling

We need to break the data into training and testing, to see how well the forecasting algorithm works.

In [None]:
train_data = energy[['Timestamp', 'Energy']][(energy['Timestamp']> "2018-01-01") & (energy['Timestamp']< "2018-02-01")]
train_data.columns = ['ds', 'y']
test_data = energy[['Timestamp', 'Energy']][(energy['Timestamp']> "2018-02-01") & (energy['Timestamp']< "2018-03-01")]
test_data.columns = ['ds', 'y']

# Step 3 - Training a forecasting model

Here, we will use a forecasting package called Prophet to train a model to predict energy usage. We will leave out yearly modelling, since we don't have enough data to learn about energy use over a year.

We can use the model to predict future energy use.

In [None]:
model = Prophet(daily_seasonality=True, weekly_seasonality=True)
model.fit(train_data)

forecast = model.predict(test_data)
fig = model.plot(forecast)


We can also look at the different seasonality patterns in the model.

In [None]:
fig = model.plot_components(forecast)

We can compare the forecasted energy use values to real energy use for the same time period. First we can use a scatter plot to view the correlation between forecasted and real rates. We can add a diagonal line to show where points should fall if they are a perfect prediction. 

In [None]:
plt.scatter(x=test_data['y'], y=forecast['yhat'], alpha=0.5)
plt.plot([-20,140],[-20,140], ls="--", c=".3") 
plt.xlim(-20,140)
plt.ylim(-20,140)
plt.xlabel('Real energy use')
plt.ylabel('Predicted energy use')

The predictions are similar to the real values, but with a few key differences: 
1. Many real energy usage values sit at the minimum usage value (around 20), but have a range of predicted values (0-60).
2. A small number of values are over-estimated by the model (real usage ~40, predicted usage 80-100).
Why do these differences happen?

We can plot real and predicted energy use over time to explore this. 

In [None]:
fig, ax1 = plt.subplots(figsize=(15, 6))
# rotate the date labels so they don't overlap
plt.xticks( rotation=25 )
# set up the 2nd axis
ax2 = ax1.twinx()  

ax1.plot(test_data['ds'], test_data['y'], color='blue')
ax1.set_xlabel('Timestamp')
ax1.set_ylabel('Energy use')
ax1.set_ylim(-10,150)
ax1.legend(['Real energy usage'], loc='upper left')

ax2.plot(forecast['ds'], forecast['yhat'], color='red')
ax2.set_ylabel('Energy use')
ax2.set_ylim(-10,150)
ax2.legend(['Predicted energy usage'], loc='upper right')

Now, we can explain these errors:
1. The model tracks changes in average usage across the different seasonality patterns (daily, weekly...), but it doesn't track changes in the **range** of values over time. In this data, energy usage is lower on the weekends, because the maximum energy usage is lower (20-120 on weekdays, 20-40 on weekends). The best the model can do is predict the mean energy use at each time and the average range (around 70). See the plot below for a clearer illustration of this pattern. 
2. One week in the data shows lower energy use than other weeks, which the model can't anticipate with the information it has. Having data from previous years could help it guess when holidays will occur.   

In [None]:
plt.figure(figsize=(15, 6))
plt.plot(forecast['ds'], forecast['yhat'], color='red')

How about we look at energy usage for the last month from 2018, 2019, 2020 and 2021?

In [None]:
energy['Year'] = energy['Timestamp'].dt.year

energy_month = energy
energy_month['Timestamp'] = energy_month['Timestamp'].mask(energy_month['Timestamp'].dt.year < 2021, energy_month['Timestamp'] + pd.offsets.DateOffset(year=2021))
energy_month = energy_month[(energy_month['Timestamp'] > "2021-05-18") & (energy_month['Timestamp'] < "2021-5-26")]

fig, ax1 = plt.subplots(figsize=(15,8))
plt.xticks( rotation=25 )

groups = energy_month.groupby('Year')
for name, group in groups:
    ax1.plot(group['Timestamp'], group['Energy'], label=name)
ax1.legend(title='Energy use', loc='upper left')
