# Intro

This is the code portion of my Politecnico di Milano M.Sc. thesis on electric load forecasting. It's a work in progress but will be complete around Dec 2020. 

# To Do

Lit 
- ~Review~

Courses
- ~Udacity - Intro to TensorFlow for Deep Learning~

Books
- Géron 2019 - Hands on Machine Learning

Viz
- ~Daily and weekly plots~
- ~Hist~

Data Prep
- ~Generalized ESD for outliers~ Doesn't seem to work (why?)
- ~Basic stats analysis: autocorrelation, Augmented Dicky-Fuller~
- ~Z-score for outlier detection~
- ~Outlier replacement~ Same day different year
- ~Normalize data~
- ~Diff data for ANN~

Models
- ~ANN, CNN, and LSTM~
- ~Predict [t+1 .. t+24h] data points (4x24=96)~ Too complex
- ~Only predict one horizon data pt~
- Ensemble: lower error, compare stdev of predictions
- Consider nowcast
- Grid/random search on hyperparameters
- Bi-directional LSTM?

Feature Engineering
- ~Empirical mode decomposition~
- ~Feature-timeseries cross correlation~
- ~Triangle time-of-day index~
- Day of wk, holidays, etc
- ~LSTM past 2-3 days~
- Exogenous: temp, irradiance, windspeed (check, don't *need* to include)

Metrics
- ~RMSE to penalize larger errors~
- Benchmark: naïve persistence
- Accuracy?

Training
- ~Set-seed~
- Try training on less data to start (3 months?) 
- Cross-validation

# Dependencies

Note that package 'emd' is not installed by default, and disappears every session refresh. It should install using `pip install emd` below.

In [None]:
pip install emd

In [None]:
import sys 
import warnings
import numpy as np
from numpy import log
import pandas as pd
import itertools
import datetime as dt
import calendar

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import pylab
import seaborn as sns # used for plot interactive graph.

#from numpy.random import seed

from scipy import signal
from scipy import stats
from scipy.stats import randint

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # for standardization of da
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline # pipeline making
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn import metrics # for the check the error and accuracy of the model
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.cluster import KMeans

from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm

import keras
from keras import optimizers
from keras.utils import plot_model
from keras.models import Sequential, Model
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers import Dense, LSTM, RepeatVector, TimeDistributed, Flatten
from keras.utils import to_categorical
from keras.optimizers import SGD 
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
from keras.layers import LSTM
from keras.layers import Dropout

import tensorflow as tf
#from tensorflow.random import set_seed #as tf_set_random_seed

%matplotlib inline
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)


#############################################
#
# In kaggle kernal requires 'pip install emd' 
#
#############################################
import emd

## Configuration

In [None]:
plt.style.use('dark_background')
plt.rcParams['axes.prop_cycle']
pd.set_option('precision', 2)

np.random.seed(42)
tf.random.set_seed(42)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Import Data

In [None]:
# read csv into dataframe
df = pd.read_csv('../input/hotel-load-and-solar/hotel_load_and_solar_2016-05-19_2020-09-21.csv', parse_dates=['Datetime'], index_col=['Datetime'])
print(df.describe())
df

## First Data Clean

Check for null data

In [None]:
df.isnull().sum()

The solar PV connection is located behind the utility consumption meter, so we will assume all solar PV production is self consumed (the array is small compared to the load) and therefore the estimated native load (without solar) is `Meter (kW) + Solar (kW)`.

In [None]:
df.columns = ['meter','solar'] # convenient renaming (all units in kW)
df[df<.001] = 0
df['load'] = df['meter'].values + df['solar'].values

# Viz

## One Week

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2016-11-14':'2016-11-20'])
plt.legend(['Meter (kW)','Solar (kW)','Load (kW)'])

## All Data, Resampled Daily

Hotel shuts down operations (due to covid-19) in mid-March 2020. Also notice the bad data in 2017.

In [None]:
dfds = df.resample('D').mean()

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(dfds['load'],label='Load (kW)')
plt.legend()

## Plot Every Day, Overlaid

Note three areas of outlier-seeming data:
1. A very small amount of data, tightly grouped, between 0 kW and ~400 kW
2. A medium amount of less-tightly grouped data around ~500 kW
3. A barely visible set of daily peaks in the afternoon and evening > 1750 kW 

In [None]:
delta = 4*24
t_begin = 0
t_end = 4*24

L = df.shape[0]
n = int(L/delta) - 1 # keeps from getting too close to the end

d = df['load'][t_begin:t_end].values.reshape(delta,1)

for i in range(n):
    t_begin += delta
    t_end += delta
    d_new = df['load'][t_begin:t_end].values.reshape(delta,1)
    d = np.concatenate([d,d_new],axis=1)


plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(d,alpha=0.04)
plt.xlabel('Timesteps [15 min]')
plt.ylabel('Load (kW)')
plt.title('Daily Data (Beginning at 0:00)')

## Plot Every Week, Overlaid

In [None]:
delta = 4*24*7
t_begin = 4*24*6 # first day of data is monday, so begin graphing on a sunday
t_end = t_begin + delta

L = df.shape[0]
n = int(L/delta) - 2 # keeps from getting too close to the end

d = df['load'][t_begin:t_end].values.reshape(delta,1)

for i in range(n):
    t_begin += delta
    t_end += delta
    d_new = df['load'][t_begin:t_end].values.reshape(delta,1)
    d = np.concatenate([d,d_new],axis=1)


plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(d,alpha=0.08)
plt.xlabel('Timesteps [15 min]')
plt.ylabel('Load (kW)')
plt.title('Weekly Data (Beginning on Sunday)')

# Clean and Standardize Data

First, we know we don't want the covid-affected data (hotel closed down in March 2020)

In [None]:
df = df[:'2020-01-31']

Second, an error was identified in the solar data. Notes how on 2016-11-12 the solar PV power starts increasing after midday and continues well into the night, which is impossible. This is true for this day, and the previous few weeks also. So we will just throw out all the data before 2016-11-14.

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)
plt.plot(df['2016-11-12':'2016-11-13']['solar'])
plt.ylabel('Power (kW)')
plt.title('Solar Bad Data Example')
df = df['2016-11-14':]

Now we don't need the meter and solar data anymore

In [None]:
df.drop('meter',axis=1, inplace=True)
df.drop('solar',axis=1, inplace=True)

## Outlier Detection

Now use the z-score method to find the remaining outlier data

In [None]:
threshold = 3

z = np.abs(stats.zscore(df)) # vector of z-scores (absolute values of..)
zi = np.where(z>threshold)[0] # array of outlier indices
print ('Number of z-score outliers: ',len(zi))

Visually locate any cluster of outliers (if at all). Indeed we see 7 independent clusters.

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)
plt.hist(zi, bins=range(int(min(zi)),int(max(zi)),10),color = "skyblue", ec="skyblue")
plt.xlabel('Index of outlier data')
plt.ylabel('Occurrences')

Or to automate this we can use a simple clustering algorithm.

In [None]:
km = KMeans(n_clusters=7,random_state=0).fit(zi.reshape(-1,1))
cc = km.cluster_centers_
print('Mean index of the outlier clusters: \n\n', cc,'\n')

for i in range(cc.shape[0]):
    print('Cluster %d mean datetime: %s' % (i,df.index[int(cc[i])]))

## Outlier Replacement

Locate the outlier data and choose data from the same day-of-year of a different year or the same day-of-week from a nearby week.

Cluster 0

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2017-8-6':'2017-8-15']['load'].values,label='2017')
plt.plot(df['2018-8-6':'2018-8-15']['load'].values,label='2018')
df['2017-8-7 7:00':'2017-8-14 11:00']['load'] = df['2018-8-7 7:00':'2018-8-14 11:00']['load'].values
plt.plot(df['2017-8-6':'2017-8-15']['load'].values,label='2017 fixed')
plt.legend()

Cluster 6

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2018-08-23':'2018-08-25'].values,label='2018')
plt.plot(df['2019-08-23':'2019-8-25'].values,label='2019')
df['2018-08-24 7:00':'2018-8-24 21:00'] = df['2019-08-24 7:00':'2019-8-24 21:00'].values
plt.plot(df['2018-08-23':'2018-08-25'].values,label='2018 fixed')
plt.legend()

Cluster 3

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2019-12-12':'2019-12-14'].values,label='2019')
plt.plot(df['2018-12-12':'2018-12-14'].values,label='2018') 
df['2019-12-13 10:00':'2019-12-13 20:00'] = df['2018-12-13 10:00':'2018-12-13 20:00'].values
plt.plot(df['2019-12-12':'2019-12-14'].values,label='2019 fixed')
plt.legend()

Cluster 4

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2017-10-22':'2017-10-24'].values,label='2017')
plt.plot(df['2018-10-22':'2018-10-24'].values,label='2018')
df['2017-10-22 15:00':'2017-10-24 17:00']['load'] = df['2018-10-22 15:00':'2018-10-24 17:00']['load'].values
plt.plot(df['2017-10-22':'2017-10-24'].values,label='2017 fixed')
plt.legend()

Cluster 5

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2016-12-15':'2016-12-16'].load.values,label='2016')
plt.plot(df['2017-12-15':'2017-12-16'].load.values,label='2017')
df['2016-12-15 12:00':'2016-12-16 12:00'] = df['2017-12-15 12:00':'2017-12-16 12:00'].values
plt.plot(df['2016-12-15':'2016-12-16'].load.values,label='2016 fixed')
plt.legend()

Cluster 1

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2019-7-8'].load.values,label='2019')
plt.plot(df['2018-7-8'].load.values,label='2018')
df['2019-7-8 8:00':'2019-7-8 13:00'] = df['2018-7-8 8:00':'2018-7-8 13:00'].values
plt.plot(df['2019-7-8'].load.values,label='2019 fixed')
plt.legend()

Cluster 2

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(df['2018-7-29':'2018-7-31'].values,label='2018')
plt.plot(df['2019-7-29':'2019-7-31'].values,label='2019')
df['2018-7-30 1:00':'2018-7-30 11:00']['load'] = df['2019-7-30 1:00':'2019-7-30 11:00']['load'].values
plt.plot(df['2018-7-29':'2018-7-31'].values,label='2018 fixed')
plt.legend()

Having removed the outliers we can visually inspect the daily data again

In [None]:
delta = 4*24
t_begin = 0
t_end = 4*24

L = df.shape[0]
n = int(L/delta) - 1 # keeps from getting too close to the end

d = df['load'][t_begin:t_end].values.reshape(delta,1)

for i in range(n):
    t_begin += delta
    t_end += delta
    d_new = df['load'][t_begin:t_end].values.reshape(delta,1)
    d = np.concatenate([d,d_new],axis=1)


plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(d,alpha=0.04)
plt.xlabel('Timesteps [15 min]')
plt.ylabel('Load (kW)')
plt.title('Daily Data (Beginning at 0:00)')

## Normalize and Difference

Normalized (min-max scaled) data should enable better learning neural networks, for instance if different feature sets have very different magnitudes (e.g. load power up to 1800 kW, day of week up to 7). The cleaned data does not have enormous outliers, so most of the data will fit nicely in a 0-to-1 scale. 

In [None]:
scaler = MinMaxScaler()

dfn = df.copy(deep=True)
dfn['load'] = scaler.fit_transform(df['load'].values.reshape(-1,1))
dfn

Difference the normalized data

In [None]:
ddfn = dfn.diff().fillna(method='bfill') # witout .fillna() the first row would be NaN
ddfn

# Statistical Analysis

### Normality

See the histogram for basic visual analysis of the distribution. The data is not very Gaussian. This may be an indicator of non-stationarity.

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)
plt.hist(df.values,bins=50)
plt.xlabel('Load (kW)')
plt.ylabel('Occurences')

The normalized and time differenced data

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)
plt.hist(ddfn.values,bins=50)
plt.xlabel('Load (kW)')
plt.ylabel('Occurences')

Quantile-Quantile (QQ) Plot

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)

ax1 = plt.subplot(121)
res = stats.probplot(df.values.flatten(), dist="norm", plot=plt)
ax1.set_title('Load QQ')

ax2 = plt.subplot(122)
res = stats.probplot(ddfn.values.flatten(), dist="norm", plot=plt)
ax2.set_title('Load QQ normalized and time-differenced')

## Stationarity

Stationarity is achieved when the statistical properties of a timeseries do not change with time. A very simple test of this could be computing the rolling mean and variance for a large window size. Note that if the data is not a typical gaussian distribution then mean and variance are less meaningful summary statistics.

In [None]:
df_rm = df.rolling(35041,center=True).mean() # window weights are equal
df_rv = df.rolling(35041,center=True).var() # window weights are equal


plt.figure(num=None, figsize=(20, 10), dpi=80)

plt.subplot(121)
plt.plot(df_rm['load'],label='Load (kW)')
plt.title('Rolling 1 Year Mean')
plt.legend()

plt.subplot(122)
plt.plot(df_rv['load'],label='Load (kW)')
plt.title('Rolling 1 Year Variance')
plt.legend()


And then if we do the same on the normalized and differenced timeseries. Note that differencing has the larger impact on statistical properties of the timeseries.

In [None]:
ddfn_rm = ddfn.rolling(35041,center=True).mean() # window weights are equal
ddfn_rv = ddfn.rolling(35041,center=True).var() # window weights are equal


plt.figure(num=None, figsize=(20, 10), dpi=80)

plt.subplot(121)
plt.plot(ddfn_rm['load'],label='Load (kW)')
plt.title('Rolling 1 Year Mean on Normalized and Differenced Timeseries')
plt.legend()

plt.subplot(122)
plt.plot(ddfn_rv['load'],label='Load (kW)')
plt.title('Rolling 1 Year Variance on Normalzied and Differenced Timeseries')
plt.legend()


Augmented Dicky-Fuller Test

Based on the very low ADF p-value we can likely say the dataset is stationary (the null hypothesis is rejected). Also the ADF statistic is much lower than the critical values at 10, 5, and 1% significance.

Reminder: null hypothesis == non-stationarity
- p-value <= 0.05: hypothesis rejected, "suggests" stationarity
- p-value > 0.05: hypothesis cannot be rejected, "suggests" non-stationariy

In [None]:
result = adfuller(df.values)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

Because the cleaned data is not very normally distributed, we can also check the ADF test of the time-differenced data, which is closer to Gaussian.

In [None]:
result = adfuller(ddfn.values)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

## Autocorrelation

One week

In [None]:
plot_acf(df.values,lags=4*24*7)
plt.show()

In [None]:
plot_acf(ddfn.values,lags=4*24*7)
plt.show()

One year

In [None]:
plot_acf(df.values,lags=4*24*365)
plt.show()

In [None]:
plot_acf(ddfn.values,lags=4*24*365)
plt.show()

# Test Data Split

Save the test data to calculate the final generalizaton error 

In [None]:
split = 0.9 # 10% of data for test
i_split = int(len(dfn)*split)

dfn_test = dfn[i_split:].copy(deep=True)
dfn = dfn[:i_split]

In [None]:
split = 0.9 # 10% of data for test
i_split = int(len(ddfn)*split)

ddfn_test = ddfn[i_split:].copy(deep=True)
ddfn = ddfn[:i_split]

# Feature Engineering

Looking for useful data transformations to reveal the patterns in the load

## Timekeeping Triangle Indices 

Time of Day

Each day the triangle index starts at value 0 at time 0:00, increases at every timestep to its maximum value of 1 at time 12:00, then decreases back to value 0 at time 0:00 the next day.

In [None]:
dfn['time of day'] = (48 - np.abs(dfn.index.hour.values*4 + dfn.index.minute.values/15 - 48))/48

In [None]:
ddfn['time of day'] = (48 - np.abs(ddfn.index.hour.values*4 + ddfn.index.minute.values/15 - 48))/48

Day of Week

Pandas computes day of week such that monday=0 and sunday=6. The problems are that
1. This does not seem to correlate with load (which is maximum on Friday, roughly speaking).
2. The ramp function suggests that 0 is very far away from 6, but actually its just the day after. For this reason we prefer a triangle function, where subsequent days have similar values.

Therefore we will shift the day of week index such that Friday is the maximum, and then convert it from a ramp to a triangle. Lastly we will normalize it.

In [None]:
# make tuesday=0
dfn['day of week'] = dfn.index.dayofweek - 1
dfn['day of week'][dfn['day of week'] == -1] += 7 

# convert ramp to triangle and normalize
dfn['day of week'] = (3 - np.abs(dfn['day of week'].values - 3))/3

In [None]:
# make tuesday=0
ddfn['day of week'] = ddfn.index.dayofweek - 1
ddfn['day of week'][ddfn['day of week'] == -1] += 7 

# convert ramp to triangle and normalize
ddfn['day of week'] = (3 - np.abs(ddfn['day of week'].values - 3))/3

Visualize the two triangle indices

In [None]:
plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(dfn['2019-1-1':'2019-1-7']['day of week'],label='day of week')
plt.plot(dfn['2019-1-1':'2019-1-7']['time of day'],label='time of day')
plt.title('Timekeeping Triangle Indices')
plt.legend()

## Holidays

In [None]:
#from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

#dr = pd.date_range(start='2016-01-01', end='2020-12-31')
#dff = pd.DataFrame()
#dff['Date'] = dr

#cal = calendar()
#holidays = cal.holidays(start=dr.min(), end=dr.max())

#dff['Holiday'] = dff['Date'].isin(holidays)*1
#dff

In [None]:
#ddfn['Date'] = ddfn.index.to_frame()
#ddfn['Date'] = ddfn['Date'].apply(lambda x:x.date().strftime('%Y-%m-%d'))

#cal2 = calendar()
#holidays2 = cal2.holidays(start=ddfn.index.min(), end=ddfn.index.max())
#ddfn['Date']

In [None]:
#ddfn['Holidays'] = ddfn['Date'].isin(holidays2)*1
#ddfn['2018-12-25']

In [None]:
dayname={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}

daily_max = df.resample('D').max()
daily_avg = df.resample('D').mean()
daily_min = df.resample('D').min()
daily_std = df.resample('D').std()

df_daily_max = pd.DataFrame()
df_daily_avg = pd.DataFrame()
df_daily_min = pd.DataFrame()
df_daily_std = pd.DataFrame()

for dow in range(7): # "day of week", 0 = monday  
    df_daily_max[dayname[dow]] = daily_max[daily_max.index.dayofweek==dow].values[:167].flatten()
    df_daily_avg[dayname[dow]] = daily_avg[daily_avg.index.dayofweek==dow].values[:167].flatten()
    df_daily_min[dayname[dow]] = daily_min[daily_min.index.dayofweek==dow].values[:167].flatten()
    df_daily_std[dayname[dow]] = daily_std[daily_std.index.dayofweek==dow].values[:167].flatten()    

    
plt.figure(num=None, figsize=(20, 10), dpi=80)

plt.subplot('221')
plt.title('daily maximums')
plt.ylabel('Load (kW)')
df_daily_max.boxplot()

plt.subplot('222')
plt.title('daily averages')
plt.ylabel('Load (kW)')
df_daily_avg.boxplot()

plt.subplot('223')
plt.title('daily minimums')
plt.ylabel('Load (kW)')
df_daily_min.boxplot()

plt.subplot('224')
plt.title('daily std devs')
plt.ylabel('Load (kW)')
df_daily_std.boxplot()



## Empirical Mode Decomposition (EMD)

Decompose ("sift") into Intrinsic Mode Functions (IMFs)

In [None]:
imf = emd.sift.sift(dfn['load'].values)

print('Number of different IMFs: ',imf.shape[1])

fig = emd.plotting.plot_imfs(imf, scale_y=True, cmap=True)


In [None]:
imf_dif = emd.sift.sift(ddfn['load'].values)

print('Number of different IMFs: ',imf_dif.shape[1])

fig = emd.plotting.plot_imfs(imf, scale_y=True, cmap=True)

Select features based on correlation with the original timeseries: IMFs 3, 4, 5, and 10.

In [None]:
dfn_emd = dfn.copy(deep=True)

for i in range(imf.shape[1]):
    dfn_emd['IMF%s'%(i+1)] = imf[:,i]
    

c = dfn_emd.corr()
c.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
ddfn_emd = ddfn.copy(deep=True)

for i in range(imf_dif.shape[1]):
    ddfn_emd['IMF%s'%(i+1)] = imf_dif[:,i]
    

c = ddfn_emd.corr()
c.style.background_gradient(cmap='coolwarm').set_precision(2)

Partially reconstruct the original timeseries using only the four IMFs, which capture most but not all the variation.

In [None]:
dfn_emd['IMFs'] = imf[:,2] + imf[:,3] + imf[:,4] + imf[:,9] + imf[:,10]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(dfn[:'2016-11-20']['load'],'b',label='Original timeseries')
plt.plot(dfn_emd[:'2016-11-20']['IMFs'],'g--',label='IMFs 3, 4, 5, 10, 11')
plt.ylabel('Load (kW)')
plt.legend()

In [None]:
ddfn_emd['IMFs'] = imf_dif[:,0] + imf_dif[:,1] + imf_dif[:,2] + imf_dif[:,3] + imf_dif[:,4]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(ddfn[:'2016-11-20']['load'],'b',label='Original timeseries')
plt.plot(ddfn_emd[:'2016-11-20']['IMFs'],'g--',label='IMFs 1-5')
plt.ylabel('Load (kW)')
plt.legend()

# Organize Data for Supervisory Learning

In [None]:
dfn_emd.drop('IMF1',axis=1, inplace=True)
dfn_emd.drop('IMF2',axis=1, inplace=True)
dfn_emd.drop('IMF6',axis=1, inplace=True)
dfn_emd.drop('IMF7',axis=1, inplace=True)
dfn_emd.drop('IMF8',axis=1, inplace=True)
dfn_emd.drop('IMF9',axis=1, inplace=True)
dfn_emd.drop('IMFs',axis=1, inplace=True)

In [None]:
ddfn_emd.drop('IMF6',axis=1, inplace=True)
ddfn_emd.drop('IMF7',axis=1, inplace=True)
ddfn_emd.drop('IMF8',axis=1, inplace=True)
ddfn_emd.drop('IMF9',axis=1, inplace=True)
ddfn_emd.drop('IMF10',axis=1, inplace=True)
ddfn_emd.drop('IMF11',axis=1, inplace=True)
ddfn_emd.drop('IMF12',axis=1, inplace=True)
ddfn_emd.drop('IMFs',axis=1, inplace=True)

## Build Time Shifted Columns

### Model Input (X)

Choose which input data set (and shorten column names for compactness)

In [None]:
Xdf = dfn.copy(deep=True)
Xdf_emd = dfn_emd

Xdf.rename(columns={'load': 't'},inplace=True)
Xdf.rename(columns={'time of day': 'tod'},inplace=True)
Xdf.rename(columns={'day of week': 'dow'},inplace=True)

In [None]:
Xddf = ddfn.copy(deep=True)
Xddf_emd = ddfn_emd

Xddf.rename(columns={'load': 't'},inplace=True)
Xddf.rename(columns={'time of day': 'tod'},inplace=True)
Xddf.rename(columns={'day of week': 'dow'},inplace=True)

Notice the NaNs that appear. This is because we start at the beginning of the (cleaned) dataset and attempt to go back in time `n_in` data points - which isn't possible. All the NaNs will be cleared out in a final trim.

In [None]:
n_in = 4*24*3 # number of inputs

for i in range(1,n_in):
    Xdf.insert(0, 'IMF3 t-%s'%i,  Xdf_emd['IMF3'].shift(i), True)  
    Xdf.insert(0, 'IMF4 t-%s'%i,  Xdf_emd['IMF4'].shift(i), True)    
    Xdf.insert(0, 'IMF5 t-%s'%i,  Xdf_emd['IMF5'].shift(i), True)    
    Xdf.insert(0, 'IMF10 t-%s'%i, Xdf_emd['IMF10'].shift(i), True)    
    Xdf.insert(0, 'IMF11 t-%s'%i, Xdf_emd['IMF11'].shift(i), True)   
    Xdf.insert(0, 'tod t-%s'%i,   Xdf_emd['time of day'].shift(i), True)   
    Xdf.insert(0, 'dow t-%s'%i,   Xdf_emd['day of week'].shift(i), True)   
    Xdf.insert(0, 't-%s'%i,       Xdf_emd['load'].shift(i),    True)    
    

In [None]:
n_in = 4*24*3 # number of inputs

for i in range(1,n_in):
    Xddf.insert(0, 'IMF1 t-%s'%i,  Xddf_emd['IMF1'].shift(i), True)  
    Xddf.insert(0, 'IMF2 t-%s'%i,  Xddf_emd['IMF2'].shift(i), True)    
    Xddf.insert(0, 'IMF3 t-%s'%i,  Xddf_emd['IMF3'].shift(i), True)    
    Xddf.insert(0, 'IMF4 t-%s'%i,  Xddf_emd['IMF4'].shift(i), True)    
    Xddf.insert(0, 'IMF5 t-%s'%i,  Xddf_emd['IMF5'].shift(i), True)   
    Xddf.insert(0, 'tod t-%s'%i,   Xddf_emd['time of day'].shift(i), True)   
    Xddf.insert(0, 'dow t-%s'%i,   Xddf_emd['day of week'].shift(i), True)   
    Xddf.insert(0, 't-%s'%i,       Xddf_emd['load'].shift(i),    True)    
    
Xdf    

### Model Output (Y)

In [None]:
Ydf = pd.DataFrame(dfn['load'])
Ydf.rename(columns={'load': 't'},inplace=True)

In [None]:
Yddf = pd.DataFrame(ddfn['load'])
Yddf.rename(columns={'load': 't'},inplace=True)

In [None]:
h = 96
n_out = 1
Ydf['t+96'] = Ydf['t'].shift(-96)

# use for predicting (say) all t+1h, t+2h.. t+24h
if 0:
    h = 0 # horizon, not incorporated
    n_out = 24  # number of outputs

    for i in range(1,n_out+1):
        Ydf['t+%sh'%i]=Ydf['t'].shift(-i)

Ydf.drop('t',axis=1,inplace=True)

In [None]:
h = 96
n_out = 1
Yddf['t+96'] = Yddf['t'].shift(-96)

# use for predicting (say) all t+1h, t+2h.. t+24h
if 0:
    h = 0 # horizon, not incorporated
    n_out = 24  # number of outputs

    for i in range(1,n_out+1):
        Yddf['t+%sh'%i]=Yddf['t'].shift(-i)

Yddf.drop('t',axis=1,inplace=True)

Yddf

### Trim the edges (NANs)

In [None]:
Xdf = Xdf[n_in-1 : -(n_out+h-1)]

In [None]:
Xddf = Xddf[n_in-1 : -(n_out+h-1)]
Xddf

In [None]:
Ydf = Ydf[n_in-1 : -(n_out+h-1)]

In [None]:
Yddf = Yddf[n_in-1 : -(n_out+h-1)]
Yddf

## Reduce data for hyperparamter tuning

Final train with use all the data

In [None]:
months = 6
k = 4*24*30*months

Xdf = Xdf[-k:]

In [None]:
months = 6
k = 4*24*30*months

Xddf = Xddf[-k:]
Xddf

In [None]:
Ydf = Ydf[-k:]

In [None]:
Yddf = Yddf[-k:]
Yddf

## Split train vs validate

In [None]:
split = 0.7
L = Xdf.shape[0]
i_split = int(L*split)


Xdf_train = Xdf[0:i_split]
Xdf_valid = Xdf[i_split:]

Ydf_train = Ydf[0:i_split]
Ydf_valid = Ydf[i_split:]

print('Xdf_train.shape: ',Xdf_train.shape)
print('Xdf_valid.shape: ',Xdf_valid.shape)
print('Ydf_train.shape: ',Ydf_train.shape)
print('Ydf_valid.shape: ',Ydf_valid.shape)

In [None]:
split = 0.7
L = Xddf.shape[0]
i_split = int(L*split)


Xddf_train = Xddf[0:i_split]
Xddf_valid = Xddf[i_split:]

Yddf_train = Yddf[0:i_split]
Yddf_valid = Yddf[i_split:]

print('Xdf_train.shape: ',Xddf_train.shape)
print('Xdf_valid.shape: ',Xddf_valid.shape)
print('Ydf_train.shape: ',Yddf_train.shape)
print('Ydf_valid.shape: ',Yddf_valid.shape)

In [None]:
Xdf_valid

In [None]:
Ydf_valid

## Training Parameters

In [None]:
epochs = 1000
batch = 1024
lr = 0.0001
patience = 20
neurons = 200
adam = optimizers.Adam(lr)

# EMD + ANN

In [None]:
model_ann = Sequential()

model_ann.add(Dense(neurons, activation='relu', input_dim=Xdf_train.shape[1]))
model_ann.add(Dense(Ydf_train.shape[1]))

model_ann.compile(loss='mse', optimizer=adam)
model_ann.summary()

In [None]:
model_dann = Sequential()

model_dann.add(Dense(neurons, activation='relu', input_dim=Xddf_train.shape[1]))
model_dann.add(Dense(Yddf_train.shape[1]))

model_dann.compile(loss='mse', optimizer=adam)
model_dann.summary()

## Train

In [None]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=patience)

ann_history =  model_ann.fit(Xdf_train.values, Ydf_train.values, validation_data=(Xdf_valid, Ydf_valid), epochs=epochs, verbose=2, callbacks=[es])

In [None]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=patience)

dann_history =  model_dann.fit(Xddf_train.values, Yddf_train.values, validation_data=(Xddf_valid, Yddf_valid), epochs=epochs, verbose=2, callbacks=[es])

Training and validations losses

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)
plt.plot(ann_history.history['loss'][2:]) # first two losses can be orders of magnitudes higher
plt.plot(ann_history.history['val_loss'][2:]) # first two losses can be orders of magnitudes higher
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch - 2')
plt.legend(['training loss', 'validation loss'], loc='upper right')

In [None]:
plt.figure(num=None, figsize=(10, 5), dpi=80)
plt.plot(dann_history.history['loss'][2:]) # first two losses can be orders of magnitudes higher
plt.plot(dann_history.history['val_loss'][2:]) # first two losses can be orders of magnitudes higher
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch - 2')
plt.legend(['training loss', 'validation loss'], loc='upper right')

## Results

Make Predictions

In [None]:
Y_train_hat = model_ann.predict(Xdf_train.values)
Y_valid_hat = model_ann.predict(Xdf_valid.values)

In [None]:
Yd_train_hat = model_dann.predict(Xddf_train.values)
Yd_valid_hat = model_dann.predict(Xddf_valid.values)
print('Train predictions shape ',Yd_train_hat.shape)
print('Valid predictions shape ',Yd_valid_hat.shape)

Root Mean Squared Error

In [None]:
measurements_t = scaler.inverse_transform(Ydf_train.values)
measurements_v = scaler.inverse_transform(Ydf_valid.values)

persistence_t = scaler.inverse_transform(Xdf_train['t'].values.reshape(-1,1))
persistence_v = scaler.inverse_transform(Xdf_valid['t'].values.reshape(-1,1))

predictions_t = scaler.inverse_transform(Y_train_hat)
predictions_v = scaler.inverse_transform(Y_valid_hat)

print('Persist train rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(measurements_t, persistence_t))))
print('Persist valid rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(measurements_v, persistence_v))))

print('Train rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(measurements_t, predictions_t))))
print('Valid rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(measurements_v, predictions_v))))

Root Mean Squared Error (differenced)

In [None]:
dmeasurements_t = scaler.inverse_transform(Yddf_train.values)
dmeasurements_v = scaler.inverse_transform(Yddf_valid.values)

dpersistence_t = scaler.inverse_transform(Xddf_train['t'].values.reshape(-1,1))
dpersistence_v = scaler.inverse_transform(Xddf_valid['t'].values.reshape(-1,1))

dpredictions_t = scaler.inverse_transform(Yd_train_hat)
dpredictions_v = scaler.inverse_transform(Yd_valid_hat)

print('Persist train rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(dmeasurements_t, dpersistence_t))))
print('Persist valid rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(dmeasurements_v, dpersistence_v))))

print('Train rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(dmeasurements_t, dpredictions_t))))
print('Valid rmse [kW]: {:.3f}'.format(np.sqrt(mean_squared_error(dmeasurements_v, dpredictions_v))))

## Plots

Naïve persistence (24 h)

In [None]:
k = 4*24*7
t=np.arange(0,k)

m = measurements_v[0:k]
p = persistence_v[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y','Predicted Y'])
plt.title('validation set 24 h persistence')

In [None]:
k = 4*24*7
t=np.arange(0,k)

m = dmeasurements_v[0:k]
p = dpersistence_v[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y','Predicted Y'])
plt.title('validation set 24 h persistence (differenced)')

### Rolling Horizon Predictions

In [None]:
k = 4*24*7
t=np.arange(0,k)

m = measurements_t[0:k]
p = predictions_t[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y','Predicted Y'])
plt.title('training set predictions')

In [None]:
k = 4*24*7
t=np.arange(0,k)

m = dmeasurements_t[0:k]
p = dpredictions_t[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y','Predicted Y'])
plt.title('training set predictions (differenced)')

In [None]:
k = 4*24*7
t=np.arange(0,k)

m = measurements_v[0:k]
p = predictions_v[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y','Predicted Y'])
plt.title('validation set predictions')

In [None]:
k = 4*24*7
t=np.arange(0,k)

m = dmeasurements_v[0:k]
p = dpredictions_v[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y','Predicted Y'])
plt.title('validation set predictions (differenced)')

## Rolling Horizon Predictions (integrated)

Intregrate (we differenced the original data)

In [None]:
k = 4*24*7
t=np.arange(0,k)

measurements_v_i = scaler.inverse_transform(Ydf_valid.values)

#Y_valid_hat[0] = Ydf_valid.values[0]
Y_valid_hat_i = np.cumsum(Y_valid_hat).reshape(-1,1)
predictions_v_i = scaler.inverse_transform(Y_valid_hat_i)

m = measurements_v_i[0:k]
p = predictions_v_i[0:k]

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,m,t,p)
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y_valid','Predicted Y_valid'])
plt.title('training set predictions')

In [None]:
stooooppppp

# EMD + CNN (not run in this version)

## Reshape X and Y into 3D array for CNN

In [None]:
X_train_3d = Xdfn_train.values.reshape(Xdfn_train.shape[0],Xdfn_train.shape[1],1)
X_valid_3d = Xdfn_valid.values.reshape(Xdfn_valid.shape[0],Xdfn_valid.shape[1],1)

Y_train_3d = Ydfn_train.values.reshape(Ydfn_train.shape[0],Ydfn_train.shape[1],1)
Y_valid_3d = Ydfn_valid.values.reshape(Ydfn_valid.shape[0],Ydfn_valid.shape[1],1)

print('X_train_3d shape: ',X_train_3d.shape)
print('X_valid_3d shape: ',X_valid_3d.shape)
print('Y_train_3d shape: ',Y_train_3d.shape)
print('Y_valid_3d shape: ',Y_valid_3d.shape)
print('X_train[:3,:5,0]:\n',X_train_3d[:3,:5,0])



## Build

In [None]:
model_cnn = Sequential()
model_cnn.add(Conv1D(filters=4, kernel_size=10, activation='relu', input_shape=(X_train_3d.shape[1], 1)))
model_cnn.add(MaxPooling1D(pool_size=2))
model_cnn.add(Flatten())
model_cnn.add(Dense(20, activation='relu'))
model_cnn.add(Dense(20, activation='relu'))
model_cnn.add(Dense(Y_train_3d.shape[1]))
model_cnn.compile(loss='mse', optimizer=adam)
model_cnn.summary()

## Fit

In [None]:
cnn_history = model_cnn.fit(X_train_3d, Y_train_3d, validation_data=(X_valid_3d, Y_valid_3d), epochs=epochs, verbose=2)

## Results

In [None]:
cnn_train_pred = model_cnn.predict(X_train_3d)
cnn_valid_pred = model_cnn.predict(X_valid_3d)
print('Ydf_train shape',Ydf_train.shape)
print('cnn_train_pred shape',cnn_train_pred.shape)
print('Train rmse: {:.3f}'.format(np.sqrt(mean_squared_error(Ydf_train.values, cnn_train_pred))))
print('Validation rmse: {:.3f}'.format(np.sqrt(mean_squared_error(Ydf_valid.values, cnn_valid_pred))))
Y_valid_hat_cnn = model_cnn.predict(X_valid_3d)
print('Y_valid_hat_cnn.shape:',Y_valid_hat_cnn.shape)

## Plots

### Rolling prediction

In [None]:
k = 4*24*7
t=np.arange(0,k)

plt.figure(num=None, figsize=(20, 10), dpi=80)
plt.plot(t,Y_valid_3d[0:k,0],t,Y_valid_hat_cnn[0:k,0])
plt.ylabel('kW')
plt.xlabel('timestep')
plt.legend(['Y_valid','Predicted Y_valid'])

# EMD + LSTM (not run in this version)

Reshape? (no, already done above)

## Build

In [None]:
model_lstm = Sequential()
model_lstm.add(LSTM(50, activation='relu', input_shape=(X_train_3d.shape[1], X_train_3d.shape[2])))
model_lstm.add(Dense(Y_train_3d.shape[1]))
model_lstm.compile(loss='mse', optimizer=adam)
model_lstm.summary()

## Fit

In [None]:
lstm_history = model_lstm.fit(X_train_3d, Y_train_3d, validation_data=(X_valid_3d, Y_valid_3d), epochs=epochs, verbose=2)

## Results

In [None]:
Y_valid_hat_lstm = model_lstm.predict(X_valid_3d)
Y_valid_hat_lstm.shape

In [None]:
lstm_train_pred = model_lstm.predict(X_train_3d)
lstm_valid_pred = model_lstm.predict(X_valid_3d)
print('Train rmse: {:.3f}'.format(np.sqrt(mean_squared_error(Ydf_train.values, lstm_train_pred))))
print('Validation rmse: {:.3f}'.format(np.sqrt(mean_squared_error(Ydf_valid.values, lstm_valid_pred))))

## Plots

### Rolling 12 hr prediction

In [None]:
k=72
t=np.arange(0,k)

plt.plot(t,Y_valid_3d[0:k,-1],t,Y_valid_hat_lstm[0:k,-1])
plt.ylabel('kWh')
plt.xlabel('hrs')
plt.legend(['Y_valid','Y_valid_hat'])

### Single prediction

In [None]:
t=np.arange(0,n_out)

plt.plot(t,Y_valid_3d[0,:],t,Y_valid_hat_lstm[0,:])
plt.ylabel('kWh')
plt.xlabel('hrs')
plt.legend(['Y_valid','Y_valid_hat'])