### Load Forecasting 
***

In this Kernel, I delve into different aspects of Time-series forecasting and the problems enccountered while modelling Load forecasting using different Time-series techniques. 


**Contents:**

[1. Data Understanding]('1')

[2. Preprocessing Data]('2')

[3. Univariate Time-series modelling]('3')  
    [3.1 Holt-winters exponential smoothing]('3.1')  
    [3.2 SARIMAX]('3.2')  
    [3.3 Auto-ARIMA]('3.3')  
    [3.4 LSTM]('3.4')  
    [3.5 Facebook-Prophet]('3.5')  
   
[4. Multivariate Time-series modelling]('4')

[5. Conclusion]('5')

 ### <div id= '1'>1. Data Understanding</div>

We start with understanding the types of variables, length, different variable names, and  their spread. 

In [None]:
#Loading packages and Data

from IPython.display import Image
import numpy as np # linear algebra
import pandas as pd# data processing, CSV file I/O (e.g. pd.read_csv)
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from datetime import datetime
from random import random
from math import sqrt
from numpy import concatenate
from numpy import array
import matplotlib.pyplot as plt

from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal  import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

!pip install pyramid-arima
from pyramid.arima import auto_arima


#!pip install plotly==3.10.0

from fbprophet import Prophet
#from plotly.plotly import plot_mpl


from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor



from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM


In [None]:

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
pd.plotting.register_matplotlib_converters()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.


In [None]:
#Loading csv
data=pd.read_csv('../input/smart-home-dataset-with-weather-information/HomeC.csv')


In [None]:
data.head()

In [None]:
data.info()

Time has 1 record more than others. Let's check why?

In [None]:
data.tail()

As we see, the last record is truncated!

In [None]:
# removing the truncated record
data=data[:-1]
data.shape

In [None]:
#given that the time is in UNIX format, let's check 
time = pd.to_datetime(data['time'],unit='s')
time.head()

Here, Time step is in increments of seconds but specified as Minute time steps. So, we create a new daterange in increments of minute

In [None]:
#new daterange in increments of minutes
time_index = pd.date_range('2016-01-01 05:00', periods=len(data),  freq='min')  
time_index = pd.DatetimeIndex(time_index)
data['time']=time_index

In [None]:
#changing column names before doing some calculation as they look weird with "[kw]"
data.columns=['time', 'use', 'gen', 'House overall', 'Dishwasher',
       'Furnace 1', 'Furnace 2', 'Home office', 'Fridge',
       'Wine cellar', 'Garage door', 'Kitchen 12',
       'Kitchen 14', 'Kitchen 38', 'Barn', 'Well',
       'Microwave', 'Living room', 'Solar', 'temperature',
       'icon', 'humidity', 'visibility', 'summary', 'apparentTemperature',
       'pressure', 'windSpeed', 'cloudCover', 'windBearing', 'precipIntensity',
       'dewPoint', 'precipProbability']

Let's check power generated from sources other than Solar

In [None]:
data['gen'].head()

In [None]:
data['Solar'].head()

In [None]:
(data['gen']-data['Solar']).value_counts()

 * It seems "solar" and "gen" are simillar columns. So we drop 'gen' column as it is the only power generated by Solar 

In [None]:
data=data.drop('gen',axis=1)

Also Let's check 'House overall' and 'use' 

In [None]:
(data['House overall']-data['use']).value_counts()

'House overall' and 'use' are simillar columns. we drop 'House overall'

In [None]:
data=data.drop('House overall',axis=1)

### <div id= '2'>2. Preprocessing Data</div>

**Feature Engineering**

In [None]:
#getting  hour, day,week, month from the date column
data['day']= data['time'].dt.day
data['month']= data['time'].dt.month
data['week']= data['time'].dt.week
data['hour']= data['time'].dt.hour

As we could see, there are simillar names for variables. we first check their energy consumption patterns over the day, week, month and then if they look simillar, we will merge them to a single variable!

In [None]:
import seaborn as sns
def visualize(label, cols):
    fig,ax=plt.subplots(figsize=(14,8))
    colour= ['red','green','blue','yellow']
    for colour,col in zip(colour,cols):
            data.groupby(label)[col].mean().plot(ax=ax,label=col,color=colour)
    plt.legend()



In [None]:
visualize('hour',['Furnace 1','Furnace 2'])

In [None]:
visualize('day',['Furnace 1','Furnace 2'])

In [None]:
visualize('month',['Furnace 1','Furnace 2'])

Furnace 2 power consumption is simillar to  Furnace 1, so we will combine both of them and make it as single variable representing Furnace power

In [None]:
data['Furnace']= data['Furnace 1']+data['Furnace 2']
data=data.drop(['Furnace 1','Furnace 2'], axis =1)

Now, we check for kitechs too

In [None]:
visualize('month',['Kitchen 12','Kitchen 14','Kitchen 38'])

In [None]:
visualize('week',['Kitchen 12','Kitchen 14','Kitchen 38'])

In [None]:
visualize('day',['Kitchen 12','Kitchen 14','Kitchen 38'])

In [None]:
visualize('hour',['Kitchen 12','Kitchen 14','Kitchen 38'])

Let us see what's happening with "Kitchen 38"

In [None]:
data['Kitchen 38'].describe()

In [None]:
fig,ax=plt.subplots(2,2,figsize=(15,10))

data.groupby('hour')['Kitchen 38'].mean().plot(ax=ax[0,0],color='green',label= 'kitchen 38')
data.groupby('day')['Kitchen 38'].mean().plot(ax=ax[0,1],color='green',label= 'kitchen 38')
data.groupby('week')['Kitchen 38'].mean().plot(ax=ax[1,0],color='green',label= 'kitchen 38')
data.groupby('month')['Kitchen 38'].mean().plot(ax=ax[1,1],color='green',label= 'kitchen 38')

                                                     

plt.legend()

There is consumption but very little comparing to other kitchens, we will keep them like that

Before building models, Let us check for datatypes that are not int or float

In [None]:
data['icon'].value_counts()

As these reports  are genererated by data acquisition system, we will  remove these variables, because the real temperature data will be enough for us instead of these variables.


In [None]:
data['summary'].value_counts()

Let us check how solar energy got produced in different days

In [None]:
data.groupby('summary')['Solar'].sum()

As expected clear, partly cloudy, drizzle, light rain days produced a lot more power than other days. Also the number of clear days outnumbered other days. So, this number would be large compared to other day's

In [None]:
data=data.drop(['icon','summary'], axis =1)

Now we will check for 'cloudCover' column

In [None]:
data['cloudCover'].dtypes

In [None]:
data['cloudCover'].head()

In [None]:
data['cloudCover'].value_counts()

As there are lot of unique values, let us check what are they

In [None]:
data['cloudCover'].unique()

In [None]:
data['cloudCover'].replace(['cloudCover'], method='bfill', inplace=True)


we need to impute 'cloudCover' with the nearest values as the records are taken in minute steps. We would use backward fill to replace

In [None]:
data['cloudCover'].unique()

In [None]:
data['cloudCover']=data['cloudCover'].astype('float')

In [None]:
data.info()

 Now, our dataset doesn't have any null values and no categorical variables. Now, we split our dataset for training and testing and start building time-series models and forecast load. We first resample on Day and forecast daily load. Later, we try to build multi-variate time-series models using other variables

### **Resampling and Visualization**

We need to resample the data and convert the data into a time-series first as data is in minute steps. Resampling over day allows us to forecast the day wise load

In [None]:
data.index= data['time']
#daily resampling
dataD=data.resample('D').mean()

In [None]:
dataD.info()

In [None]:
#hourly resampling
dataH=data.resample('H').mean()

In [None]:
weathercols= ['temperature', 'humidity','visibility', 'apparentTemperature', 'pressure', 'windSpeed',
       'cloudCover', 'windBearing', 'precipIntensity', 'dewPoint','precipProbability']
Housecols = ['Dishwasher','Furnace', 'Home office', 'Fridge','Wine cellar', 'Garage door', 'Kitchen 12','Kitchen 14', 
             'Kitchen 38', 'Barn', 'Well','Microwave', 'Living room']
useweather=['use','temperature', 'humidity','visibility', 'apparentTemperature', 'pressure', 'windSpeed',
       'cloudCover', 'windBearing', 'precipIntensity', 'dewPoint','precipProbability']
solarweather=['Solar','temperature', 'humidity','visibility', 'apparentTemperature', 'pressure', 'windSpeed',
       'cloudCover', 'windBearing', 'precipIntensity', 'dewPoint','precipProbability']
usesolar=['use','Solar']

In [None]:

# load dataset
def series_visualize(data, cols):
    dataset = data[cols]
    values = dataset.values
    # specify columns to plot    
    groups = [i for i in range(len(cols))]
    j = 1
# plot each column
    plt.figure(figsize=(18,13))
    for group in groups:
        plt.subplot(len(groups), 1, j)
        plt.plot(values[:, group])
        plt.title(dataset.columns[group], y=0.5, loc='right')
        j += 1
    plt.show()

In [None]:
#series_visualize(dataH,Housecols)
series_visualize(dataD,Housecols)

In the months of June, July,and August, "office", "winecellar", "Fridge" power consumption rose. And in December, January, February months Furnace's power consumption rose.

In [None]:
series_visualize(dataH,usesolar)

                             Hourly power usage and hourly solar power generation 

In [None]:
series_visualize(dataD,usesolar)

                                    Daily power usage and generation

For now, we concentrate on Daily Load forecasting

### <div id= '3'>3.Univariate Models</div>

Load forecasting would be helpful to optimize energy consumption and plan household energy needs accordingly, saving solar energy and utilizing it optimally. Let us build univariate time-series models first!

Understanding the time-series would let us know whether the series is having linear or exponential trend, additive or multiplicative seasonality which aides us in using appropriate techniques for considering these effects.

Decomposing Time-series 

In [None]:
datause=dataD.iloc[:,0].values
#fig,ax=plt.subplots(figsize=(15,10))
plt.rcParams['figure.figsize'] = (14, 9)
seasonal_decompose(dataD[['use']]).plot()
result = adfuller(datause)
plt.show()

Additive seasonality with no trend

#### Testing Stationarity and plotting Trend 

Ad-Fuller Test for stationarity

In [None]:
X= dataD.iloc[:,0].values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

 P-value < 0.05, the series is stationary 

Now, we can start building Time-series models. As the trend is not linear. we start with Holtz-winters exponential smooting 

### <div id= '3.1'>3.1 HOLTZ-WINTERS Exponential Smoothing </div>

In [None]:
# split data into train and tests
train=dataD[dataD['month']<12].iloc[:,0]
test=dataD[dataD['month']>=12].iloc[:,0]
print("train has {} records, test has {} records".format(len(train),len(test)))

In [None]:
fig,ax=plt.subplots(figsize=(18,6))
train.plot(ax=ax)
test.plot(ax=ax)
plt.show()

In [None]:

# fit model withweekly seasonality 
model = ExponentialSmoothing(train.values,seasonal='add',seasonal_periods=7)
model_fit = model.fit()

# make prediction

y = model_fit.forecast(len(test))
y_predicted=pd.DataFrame(y,index=test.index,columns=['Holtwinter'])

plt.figure(figsize=(16,8))
plt.plot(test, label='Test')
plt.plot(y_predicted, label='Holtwinter')
plt.legend(loc='best')
plt.show()

In [None]:
rms = sqrt(mean_squared_error(test,y_predicted))
print(rms)

### <div id= '3.2'>3.2 ARIMA </div>

To find (p,d,q) for ARIMA, we first need to plot Auto-correlation(ACF) and Partial Auto-correlation(PACF) plots. Because of seasonality, we use seasonal ARIMA model SARIMAX from statsmodels 

#### ACF and PACF plots

In [None]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

# Draw Plot
fig, axes = plt.subplots(1,2,figsize=(16,3), dpi= 100)
x=plot_acf(train.tolist(), lags=50,ax=axes[0])
y=plot_pacf(train.tolist(), lags=50, ax=axes[1])
plt.show()

In [None]:
# first differencing
fig, axes = plt.subplots(1,2,figsize=(16,3), dpi= 100)
x=plot_acf(train.diff().dropna(), lags=50,ax=axes[0])
y=plot_pacf(train.diff().dropna(), lags=50, ax=axes[1])
plt.show()

We use  P is 1, d is 0 and q is 2 , because i'll be coservative at first,

      1. ACF plots shows gradually decreasing to 0 with few lags above
      2. PACF plot cuts off quicky at 1
      3. After differencing once we see the series has negative spikes which means over differencing. so we choose d as 0
        
For Seasonal terms,(P,D,Q)m - 

In [None]:
seasonal=seasonal_decompose(dataD[['use']]).seasonal
#fig.ax=plt.subplots(figsize=(16,5))
seasonal.plot()
seasonal.diff(1).dropna().plot(color='orange')
seasonal.diff(7).dropna().plot(color='green')

we see after 7 differences, seasonality got removed completely. And as a general rule D=1 and only we keep SMA and test with RMS

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
y_hat = test.copy()
fit1 = SARIMAX(train.values, order=(1, 0, 2),seasonal_order=(0,1,1,7)).fit()
y_pred  =  fit1.predict(dynamic=True)
y = fit1.forecast(len(test),dynamic=True)
y_predicted=pd.DataFrame(y,index=y_hat.index,columns=['sarima'])

plt.figure(figsize=(16,8))
plt.plot(test, label='Test')
plt.plot(y_predicted, label='SARIMA')
plt.legend(loc='best')
plt.show()

In [None]:
rms = sqrt(mean_squared_error(test,y_predicted))
print(rms)

### <div id= '3.3'>3.3 Auto ARIMA</div>

In [None]:
#building the model
model = auto_arima(train,start_p=1,d=1,start_q=1,max_p=3,max_d=2,max_q=3,start_P=1,D=1,start_Q=1,
                   max_P=2,max_D=1,max_Q=2,seasonal =True, m=7, max_order=5,stationary=False, trace=True, error_action='ignore', suppress_warnings=True)
model.fit(train)

forecast = model.predict(n_periods=len(test))
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])

#plot the predictions for validation set
plt.figure(figsize=(16,8))
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()

In [None]:
rms = sqrt(mean_squared_error(test,forecast))
print(rms)

###  <div id= '3.4'>3.4 LSTM</div>

First, Lets us split sequence to prepare data in the required form for LSTM

       given sequence [10, 20, 30, 40, 50, 60, 70, 80, 90] into
            X,			y
        10, 20, 30		40
        20, 30, 40		50
        30, 40, 50		60
        ...

In [None]:
def split_sequence(sequence, n_steps_in, n_steps_out):
    X, y = list(), list()
    for i in range(len(sequence)):
        # find the end of this pattern
        end_ix = i + n_steps_in
        out_end_ix = end_ix + n_steps_out
        # check if we are beyond the sequence
        if out_end_ix > len(sequence):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_end_ix]
        X.append(seq_x)
        y.append(seq_y)
    return array(X), array(y)




we need to prepare dataset as a 3D matrix for LSTM from [samples, timesteps] to  [samples, timesteps, features] . Here, we have only 1 feature i.e., use,  and timesteps are the sequence of steps, here, we choose 28 timesteps and the output timesteps to predict as 16, because we need to validate on the Test set

In [None]:
# define input sequence
raw_seq = train[:307].values.tolist()
# choose a number of time steps
n_steps_in, n_steps_out = 28, 16
# split into samples
X, y = split_sequence(raw_seq, n_steps_in, n_steps_out)
# reshape from [samples, timesteps] into [samples, timesteps, features]
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))
# define model

In [None]:
#LSTM model
model = Sequential()
model.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(n_steps_in, n_features)))
model.add(LSTM(100, activation='relu'))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X, y, epochs=50, verbose=0)


In [None]:
# demonstrate prediction
x_input = train[307:].values
x_input = x_input.reshape((1, n_steps_in, n_features))
yhat = model.predict(x_input, verbose=0)
print(yhat)

In [None]:
yhat=yhat.reshape(16,1)

In [None]:

forecast = pd.DataFrame(yhat,index = test.index,columns=['Prediction'])

#plot the predictions for validation set
plt.figure(figsize=(16,8))
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()

In [None]:
rms = sqrt(mean_squared_error(test,forecast))
print(rms)

LSTM couldn't produce better predictions than Statistical models. Because, LSTMs need more data to tune their parameters. And, in our case, we have only 1 year data. Also, LSTM's are better at forecasting longterm not at shortterm

### <div id= '3.5'>3.5 Facebook-Prophet</div>

The advantages of [Prophet](https://facebook.github.io/prophet/) are, 
we can model holiday effects, weekly, yearly seasonalities, Saturation checks etc. Prophet requires variables as 'ds' and 'y'. Also, we can add regressors with out much effort!

In [None]:
new_train= pd.DataFrame(train)
new_train['ds']=new_train.index
new_train['y']=new_train['use']
new_train.drop(['use'],axis = 1, inplace = True)
new_train=new_train.reset_index()
new_train.drop(['time'],axis = 1, inplace = True)

In [None]:
#model
m = Prophet(mcmc_samples=300, holidays_prior_scale=0.25, changepoint_prior_scale=0.01, seasonality_mode='additive', \
           seasonality_prior_scale=0.4, weekly_seasonality=True, \
            daily_seasonality=False)

m.fit(new_train)
future = m.make_future_dataframe(periods=16)
#prediction
forecast = m.predict(future)

In [None]:
c=m.plot_components(forecast)
plt.show()

In [None]:
d=m.plot(forecast)
plt.show()

In [None]:
predictions=pd.DataFrame(forecast[335:]['yhat'])
predictions.index=test.index

fig,ax=plt.subplots(figsize=(15,8))
test.plot(ax=ax)
predictions.plot(ax=ax)

In [None]:
len(train)

In [None]:
rms = sqrt(mean_squared_error(test,forecast[['yhat']][335:]))
print(rms)

Adding other variables to the series to get causal relationship effects

In [None]:
temperature=dataD[dataD['month']<12].loc[:,'temperature']
rain=dataD[dataD['month']<12].loc[:,'precipIntensity']
wind=dataD[dataD['month']<12].loc[:,'windSpeed']

temperature=temperature.reset_index().drop('time',axis=1)
rain=rain.reset_index().drop('time',axis=1)
wind=wind.reset_index().drop('time',axis=1)

train_regressor=pd.concat([new_train,temperature,rain,wind],axis=1)

In [None]:
m = Prophet(mcmc_samples=300, holidays_prior_scale=0.25, changepoint_prior_scale=0.01, seasonality_mode='additive', \
           seasonality_prior_scale=0.4, weekly_seasonality=True, \
            daily_seasonality=False)

In [None]:
m.add_regressor('temperature', prior_scale=0.5, mode='additive')
m.add_regressor('precipIntensity', prior_scale=0.5, mode='additive')
m.add_regressor('windSpeed', prior_scale=0.5, mode='additive')

In [None]:
m.fit(train_regressor)
future = m.make_future_dataframe(periods=16)


In [None]:
testtemp=dataD.loc[:,'temperature']
testrain=dataD.loc[:,'precipIntensity']
testwind=dataD.loc[:,'windSpeed']
testtemp=testtemp.reset_index().drop('time',axis=1)
testrain=testrain.reset_index().drop('time',axis=1)
testwind=testwind.reset_index().drop('time',axis=1)
future['temperature']=testtemp
future['precipIntensity']=testrain
future['windSpeed']=testwind

In [None]:
future.tail()

In [None]:
forecast = m.predict(future)

In [None]:
f = m.plot_components(forecast)


In [None]:
d=m.plot(forecast)
plt.show()

In [None]:
predictions=pd.DataFrame(forecast[335:]['yhat'])
predictions.index=test.index

fig,ax=plt.subplots(figsize=(15,8))
test.plot(ax=ax)
predictions.plot(ax=ax)

In [None]:
rms = sqrt(mean_squared_error(test,forecast[['yhat']][335:]))
print(rms)

### <div id= '4'>4.Multi-Variate models</div>

### Before building Multi-variate models, we will check for *Granger's Causaulity*

In [None]:
maxlag=12
test = 'ssr_chi2test'
def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):    
    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. The values in the table 
    are the P-Values. P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.
    """
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + '_x' for var in variables]
    df.index = [var + '_y' for var in variables]
    return df

grangers_causation_matrix(dataD[useweather], variables = useweather)  

 The usage is not caused by any of the variables. As p>=0.05, We couldn't reject the null hypothesis

**VAR model**

The basis behind Vector AutoRegression is that each of the time series in the system influences each other. That is, you can predict the series with past values of itself along with other series in the system. So, we won't be able to model VAR for this problem because the variables failed Granger's Causaulity

### <div id= '5'>5.Conclusion</div>

To conclude, Auto-Arima performed better than LSTM and Prophet. Due to limited amount of data, statsmodels outshined neural networks

In [None]:
Image("/kaggle/input/conclusion/kagglextreme.PNG")

***

Future Work: Understanding [fireTs](https://pypi.org/project/fireTS/) and non-linear modelling of time-series.

                                                            * * *