# Electricity Usage Prediction Using Time Series 
<font size="3">@Cicily Wu</font>

![](https://assets.greentechmedia.com/assets/content/cache/made/assets/content/cache/remote/https_assets.greentechmedia.com/content/images/articles/Urban_Electric_Grid_721_420_80_s_c1.jpg)

# 1. Project Statement

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.

The demand for electricity has been continuously increasing over the years. To understand the future consumption, a good predictive method is entailed, which is time series analysis.

In this project, I will build two different models to predict the electricity usage:
1. ARIMA model: ARIMA, short for 'Auto Regressive Integrated Moving Average' is actually a class of models that 'explains' a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.
2. LSTM Neural Network model: Long short-term memory is an artificial recurrent neural network architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points, but also entire sequences of data. 

The dataset includes the monthly electricity usage data from 1985-01-01 to 2018-01-01. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf 
from statsmodels.tsa.seasonal import seasonal_decompose 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df=pd.read_csv('../input/time-series-datasets/Electric_Production.csv',parse_dates=[0])
df=df.rename(columns={'IPG2211A2N':'usage','DATE':'date'})
df = df.set_index('date')
df.head()

In [None]:
%matplotlib inline
df.plot(figsize=(8,4))

In [None]:
seasonal = seasonal_decompose(df.usage,model='add')
fig = plt.figure()  
fig = seasonal.plot()  
fig.set_size_inches(10, 8)

<font color='purple'>From above plots we can see, the usage data appeares a strong seasonal trend. Therefore, instead of ARIMA, we choose to use SRIMA.</font>

Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an extension of ARIMA that explicitly supports univariate time series data with a seasonal component.
Configuring a SARIMA requires selecting hyperparameters for both the trend and seasonal elements of the series.

**Trend Elements:**
There are three trend elements that require configuration.

They are the same as the ARIMA model; specifically:

p: Trend autoregression order.
d: Trend difference order.
q: Trend moving average order.

**Seasonal Elements:**
There are four seasonal elements that are not part of ARIMA that must be configured; they are:

P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
m: The number of time steps for a single seasonal period.

Together, the notation for an SARIMA model is specified as:

**SARIMA(p,d,q)(P,D,Q)m**

# 2. Seasonal ARIMA Model

<font color='purple'>I will use Dickey-Fuller test to see the stationarity of these data.</font>

In [None]:
from statsmodels.tsa.stattools import adfuller   #Dickey-Fuller test
def test_stationarity(timeseries):

    #Determing rolling statistics
    rolmean = timeseries.rolling(window=20).mean()
    rolstd = timeseries.rolling(window=20).std()

    #Plot rolling statistics:
    fig = plt.figure(figsize=(12, 6))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()

    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')  #autolag : {‘AIC’, ‘BIC’, ‘t-stat’, None}
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

In [None]:
test_stationarity(df.usage)

![](http://)<font color='purple'> We can see that the data are not stationary. Thus, I will build a “First Order Difference" column to stabilize the standard deviation.</font>

In [None]:
df['first_difference'] = df.usage - df.usage.shift(1)   
test_stationarity(df.first_difference.dropna(inplace=False))

<font color='purple'> After pocessing, the data are stationary for now. Let's build the model. "Pmdarima" is a good package to help us find the best parameters for SARIMA model. This helper function makes whole process easiear since I do not have to tune parameters manually.</font>

In [None]:
!pip install pmdarima

In [None]:
import pmdarima as pm
from pmdarima.model_selection import train_test_split

In [None]:
df1=df.drop(columns='first_difference')

In [None]:
train, test = train_test_split(df1, train_size=320)

In [None]:
train, test = train_test_split(df1, train_size=320)

# Fit your model
model = pm.auto_arima(train, seasonal=True, m=12)

In [None]:
model.summary()

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
pred_model = SARIMAX(train.usage, order=(1,0,2), seasonal_order=(0,1,1,12))
results = pred_model.fit()

In [None]:
test_pred=test.copy()
test_pred = results.predict(start = len(train), end = len(df)-1, typ="levels")  

In [None]:
test['usage'].plot(figsize = (12,5), label='real usage')
test_pred.plot(label = 'predicted usage')
plt.legend(loc='upper right')

In [None]:
from statsmodels.tools.eval_measures import rmse
arima_rmse_error = rmse(test['usage'], test_pred)
arima_mse_error = arima_rmse_error**2
print(f'MSE Error: {arima_mse_error}\nRMSE Error: {arima_rmse_error}')

![](http://)<font color='purple'> From above plot and the statistics we can see, the model reached MSE as 22.843, which is pretty good.</font>

# 3. LSTM Neural Network Model

![](http://)<font color='purple'> First, I want to process the data. A common operation on time-series data is to shift or "lag" the values back and forward in time, such as to calculate percentage change from sample to sample. The pandas method for this is .shift(), which will shift the values in the index by a specified number of units of the index's period.</font>

In [None]:
train1=pd.concat([train, train.shift(-1), train.shift(-2),train.shift(-3),train.shift(-4),train.shift(-5),
                 train.shift(-6),train.shift(-7),train.shift(-8),train.shift(-9),train.shift(-10),train.shift(-11),train.shift(-12)
                 ], axis=1).dropna()
train1.columns = ['usage', 'usage1', 'usage2','usage3','usage4', 'usage5','usage6'
                 ,'usage7', 'usage8','usage9','usage10', 'usage11', 'usage12']
train1.head()

In [None]:
test1=pd.concat([test, test.shift(-1), test.shift(-2),test.shift(-3),test.shift(-4),test.shift(-5),
                 test.shift(-6),test.shift(-7),test.shift(-8),test.shift(-9),test.shift(-10),test.shift(-11),test.shift(-12)
                 ], axis=1).dropna()
test1.columns = ['usage', 'usage1', 'usage2','usage3','usage4', 'usage5','usage6'
                 ,'usage7', 'usage8','usage9','usage10', 'usage11','usage12']
test1.head()

In [None]:
train1_y=train1.loc[:, train1.columns == 'usage']
train1_x=train1.loc[:, train1.columns != 'usage']

test1_y=test1.loc[:, test1.columns == 'usage']
test1_x=test1.loc[:, test1.columns != 'usage']

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

model = Sequential()

model.add(LSTM(20, activation='relu',input_shape=(12, 1)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse',metrics=['mean_squared_error'])

model.summary()

![](http://)<font color='purple'>The LSTM model's required input data shape is 3-dimensions. Since our data is 2-dimensions data, we need to modify it.</font>

In [None]:
train1_x = np.expand_dims(train1_x, 2)
test1_x = np.expand_dims(test1_x, 2)
print("New train data shape:")
print(train1_x.shape)
print("New test data shape:")
print(test1_x.shape)

In [None]:
run=model.fit(train1_x,train1_y,epochs=40)

In [None]:
plt.plot(run.epoch,run.history.get('loss'))

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
model.evaluate(test1_x,test1_y)

![](http://)<font color='purple'> After around 6-10 epochs, the model reached a 10.87 MSE while the same number on test dataset is 17.224. We can see the model had a good performance.</font>

In [None]:
test1_pred=model.predict(test1_x)
test_pred=pd.DataFrame(test1_pred, columns=['test_pred']) 
#test_true=pd.DataFrame(test1_y, columns=['test_true']) 
test_pred.index=test1_y.index
test_pred=test_pred.merge(test1_y,left_index=True, right_index=True)

In [None]:
plt.figure(figsize=(12,5))
plt.plot( test_pred.index, 'usage', data=test_pred, markerfacecolor='blue', markersize=12, color='skyblue', linewidth=2,label='reality')
plt.plot( test_pred.index, 'test_pred', data=test_pred, color='orange', linewidth=2,label='prediction')
plt.legend(loc='upper right')

![](http://)<font color='purple'> The comparison of real usage and predicted usage on testdataset is shown in the above plot.</font>