# Analysis of Meter Readings per Location

**Why is energy forecasting important?**

Long term planning of energy supply-demand needs to satisfy the requirement of sustainable development of the country [[1]](https://link.springer.com/article/10.1007%2Fs12667-016-0203-y#:~:text=Long%2Dterm%20planning%20of%20energy,operations%20of%20the%20supply%20system.). As a result, accurate forecasts are used to know the volume and trend of the future energy consumptions to better schedule and plan the operations of the supply system. This is especially useful in electricity, as it cannot be stored in large amounts and needs constant matching of supply and demand. 

What we did:
* Used cleaned datasets for modelling and predictions (Dataset cleaning can be found in other kernels)
    * Combined dataset can be found here: https://www.kaggle.com/julietian/combined-all-8-meter-datasets
* Created Prediction models for electricity consumption per location 
    * With other meter readings as exogenuous variables
* Explored relationships between each type of meter usage per location


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
lamb = pd.read_csv("/kaggle/input/meter-sums/lamb_meter_sums.csv")
crow = pd.read_csv("/kaggle/input/meter-sums/crow_meter_sums.csv")
bear = pd.read_csv("/kaggle/input/meter-sums/bear_meter_sums.csv")
robin = pd.read_csv("/kaggle/input/meter-sums/robin_meter_sums.csv")
panther = pd.read_csv("/kaggle/input/meter-sums/panther_meter_sums.csv")
bobcat = pd.read_csv("/kaggle/input/meter-sums/bobcat_meter_sums.csv")
rat = pd.read_csv("/kaggle/input/meter-sums/rat_meter_sums.csv")
fox = pd.read_csv("/kaggle/input/meter-sums/fox_meter_sums.csv")
shrew = pd.read_csv("/kaggle/input/meter-sums/shrew_meter_sums.csv")
mouse = pd.read_csv("/kaggle/input/meter-sums/mouse_meter_sums.csv")
peacock = pd.read_csv("/kaggle/input/meter-sums/peacock_meter_sums.csv")
hog = pd.read_csv("/kaggle/input/meter-sums/hog_meter_sums.csv")
cockatoo = pd.read_csv("/kaggle/input/meter-sums/cockatoo_meter_sums.csv")
moose = pd.read_csv("/kaggle/input/meter-sums/moose_meter_sums.csv")
gator = pd.read_csv("/kaggle/input/meter-sums/gator_meter_sums.csv")
eagle = pd.read_csv("/kaggle/input/meter-sums/eagle_meter_sums.csv")
wolf = pd.read_csv("/kaggle/input/meter-sums/wolf_meter_sums.csv")
bull = pd.read_csv("/kaggle/input/meter-sums/bull_meter_sums.csv")

In [None]:
!pip install pmdarima

In [None]:
pip install --upgrade pip

In [None]:
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
from statsmodels.tsa.arima_model import ARIMA

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import pmdarima as pm

# Lamb

Lamb corresponds to the buildings at Cardiff - City Buildings located in Cardiff, UK.

**Finding Relationships**

In [None]:
lamb.head()

In [None]:
pp = sns.pairplot(lamb)

In [None]:
ax = plt.subplot()

ax.plot(lamb["timestamp"], lamb["Lamb_electricity_sum"],color = 'limegreen', label = 'electricity')
lamb["Lamb_gas_sum"].plot(figsize=(25,10), color ='red', label = 'gas')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we can see that the relationship between electricity usage and gas usage is linear, but at different magnitudes; as the electricity usage increase, the gas usage increases as well. 

**Train-test Split: first year for training, second year for testing**

In [None]:
lamb = lamb.set_index("timestamp")

In [None]:
lamb_model_data = lamb[["Lamb_electricity_sum", "Lamb_gas_sum"]]
train = lamb_model_data.iloc[0:(len(lamb_model_data)-53)].copy()
test = lamb_model_data.iloc[len(train):(len(lamb_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC. 

In [None]:
plot_acf(train.Lamb_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Lamb_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.
* ACF at lags 0, 1 and 2 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Lamb_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(lamb["Lamb_electricity_sum"], period=12)

#change figure size 
fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout each year
* There seems to be a weak seasonality component of the data 
* Looking at the residuals, there appears to be randomness in the data 

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model 

> start_q = 1: Starting value of 1 for the order of the moving average model 

> m = 52: weekly data; the seasonal differencing 

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model 

> seasonal = True: fit to a seasonal model; since energy can vary between seasons 

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit


In [None]:
smodel = pm.auto_arima(train["Lamb_electricity_sum"], exogenous=train[["Lamb_gas_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Lamb_electricity_sum"],exog = sm.add_constant(train["Lamb_gas_sum"]), order=(0,1,1), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Lamb_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
#predict = model_fit.predict(start = len(train),end = len(train)+len(test)-1,exog = sm.add_constant(test[["Lamb_gas_sum"]]))
predict = mod_fit.predict(endog=train["Lamb_electricity_sum"],exog = sm.add_constant(test[["Lamb_gas_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Lamb_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Lamb_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Lamb_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

Predictions is similar to the actual trend but exagerated magnitudes.

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Crow

Crow corresponds to the buildings at Carleton University located in Ottawa, Canada.

**Finding Relationships**

In [None]:
crow.head()

In [None]:
pp = sns.pairplot(crow)

In [None]:
ax = plt.subplot()

ax.plot(crow["timestamp"], crow["Crow_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(crow["timestamp"], crow["Crow_chilled_sum"], color = 'blue', label = 'chilled water')
crow["Crow_hot_sum"].plot(figsize=(25,10), color ='red', label = 'hot water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

For the graphs, we can see that as electricity usage is increasing, chilled water usage is also increasing. Hot water usage has an inverse relationship of electricity and chilled water usage; as hot water usage increases, electricity and chilled water usage decreases. 

**Train-test Split: first year for training, second year for testing**

In [None]:
crow = crow.set_index("timestamp")

In [None]:
crow_model_data = crow[["Crow_electricity_sum", "Crow_chilled_sum", "Crow_hot_sum"]]
train = crow_model_data.iloc[0:(len(crow_model_data)-53)].copy()
test = crow_model_data.iloc[len(train):(len(crow_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Crow_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Crow_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is a strong correlation between variables but the results may not be significant.
* ACF at lags 0 to 4 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Crow_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(crow["Crow_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a cyclic trend each year.
    * The trend shows increase of electricity usage in the summer months. This may be attributed to longer days in the summer.
* There seems to be a seasonality component of the data 
* Looking at the residuals, there appears to be randomness in the data 

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Crow_electricity_sum"], exogenous=train[["Crow_chilled_sum", "Crow_hot_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Crow_electricity_sum"],exog = sm.add_constant(train[["Crow_chilled_sum", "Crow_hot_sum"]]), order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Crow_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Crow_electricity_sum"],exog = sm.add_constant(test[["Crow_chilled_sum", "Crow_hot_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Crow_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Crow_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Crow_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Bear

Bear corresponds to the buildings at University of California - Berkeley located in Berkeley, USA.

**Finding Relationships**

In [None]:
bear.head()

In [None]:
pp = sns.pairplot(bear)

In [None]:
ax = plt.subplot()

ax.plot(bear["timestamp"], bear["Bear_electricity_sum"],color = 'limegreen', label = 'electricity')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 20))

#display legend
ax.legend()

From the graph, we can see that there is decreased electricity usage at the start and end of each year.

**Train-test Split: first year for training, second year for testing**

In [None]:
bear = bear.set_index("timestamp")

In [None]:
bear_model_data = bear[["Bear_electricity_sum"]]
train = bear_model_data.iloc[0:(len(crow_model_data)-53)].copy()
test = bear_model_data.iloc[len(train):(len(crow_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Bear_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Bear_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is a not a strong correlation between the variables.

* ACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Bear_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(bear["Bear_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a trend each year.
    * The trend shows increase of electricity usage in the fall/winter months. This may be attributed to the colder days during this period.
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Bear_electricity_sum"],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Bear_electricity_sum"], order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Bear_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Bear_electricity_sum"])
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Bear_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Bear_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Bear_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Robin

Robin corresponds to the buildings at University College London located in London, UK.

**Finding Relationships**

In [None]:
robin.head()

In [None]:
pp = sns.pairplot(robin)

In [None]:
ax = plt.subplot()

ax.plot(robin["timestamp"], robin["Robin_electricity_sum"],color = 'limegreen', label = 'electricity')
robin["Robin_hot_sum"].plot(figsize=(25,10), color ='red', label = 'hot water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

For the graphs, there is no clear relationship between electricity and hot water usage. Although each meter usage does seem to each have its own yearly trend. 

**Train-test Split: first year for training, second year for testing**

In [None]:
robin = robin.set_index("timestamp")

In [None]:
robin_model_data = robin[["Robin_electricity_sum", "Robin_hot_sum"]]
train = robin_model_data.iloc[0:(len(robin_model_data)-53)].copy()
test = robin_model_data.iloc[len(train):(len(robin_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Robin_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Robin_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.
* ACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Robin_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(robin["Robin_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout each year
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Robin_electricity_sum"], exogenous=train[["Robin_hot_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Robin_electricity_sum"],exog = sm.add_constant(train["Robin_hot_sum"]), order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Robin_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Robin_electricity_sum"],exog = sm.add_constant(test[["Robin_hot_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Robin_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Robin_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Robin_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Panther

Panther corresponds to the buildings at University of Central Florida located in Orlando, USA.

**Finding Relationships**

In [None]:
panther.head()

In [None]:
pp = sns.pairplot(panther)

In [None]:
ax = plt.subplot()

ax.plot(panther["timestamp"], panther["Panther_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(panther["timestamp"], panther["Panther_gas_sum"],color = 'lightpink', label = 'gas')
ax.plot(panther["timestamp"], panther["Panther_irrigation_sum"],color = 'peru', label = 'irrigation')
ax.plot(panther["timestamp"], panther["Panther_chilled_sum"], color = 'blue', label = 'chilled water')
panther["Panther_water_sum"].plot(figsize=(25,10), color ='darkcyan', label = 'water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we can see that increases in gas, water and chilled water happen together, but at different magnitudes. Irrigation and electricity usage main relatively steady usage throughout each year. 

**Train-test Split: first year for training, second year for testing**

In [None]:
panther = panther.set_index("timestamp")

In [None]:
panther_model_data = panther[["Panther_electricity_sum", "Panther_gas_sum", "Panther_irrigation_sum", "Panther_water_sum", "Panther_chilled_sum"]]
train = panther_model_data.iloc[0:(len(panther_model_data)-53)].copy()
test = panther_model_data.iloc[len(train):(len(panther_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 15 values will be considered for AFC.



In [None]:
plot_acf(train.Panther_electricity_sum,lags=15)
plt.show()

In [None]:
plot_pacf(train.Panther_electricity_sum,lags=15)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 to 3 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at most lags are significant. The rest (lags at 5,14) of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Panther_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(panther["Panther_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout the first year, but linear trend the second year
    * This could indicate unrepresentative data from data sets. Improper data filling could have occured for this site. 
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Panther_electricity_sum"], exogenous=train[["Panther_gas_sum", "Panther_irrigation_sum", "Panther_water_sum", "Panther_chilled_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Panther_electricity_sum"],exog = sm.add_constant(train[["Panther_gas_sum", "Panther_irrigation_sum", "Panther_water_sum", "Panther_chilled_sum"]]), order=(0,1,1), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Panther_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Panther_electricity_sum"],exog = sm.add_constant(test[["Panther_gas_sum", "Panther_irrigation_sum", "Panther_water_sum", "Panther_chilled_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Panther_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Panther_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Panther_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Bobcat

Bobcat building location remains anonymous.

**Finding Relationships**

In [None]:
bobcat.head()

In [None]:
pp = sns.pairplot(bobcat)

In [None]:
ax = plt.subplot()

ax.plot(bobcat["timestamp"], bobcat["Bobcat_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(bobcat["timestamp"], bobcat["Bobcat_gas_sum"],color = 'lightpink', label = 'gas')
ax.plot(bobcat["timestamp"], bobcat["Bobcat_solar_sum"],color = 'goldenrod', label = 'solar')
ax.plot(bobcat["timestamp"], bobcat["Bobcat_chilled_sum"], color = 'blue', label = 'chilled water')
ax.plot(bobcat["timestamp"], bobcat["Bobcat_hot_sum"], color = 'red', label = 'hot water')
bobcat["Bobcat_water_sum"].plot(figsize=(25,10), color ='darkcyan', label = 'water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, there does not seem to be any clear relationships between the types of usages. 

**Train-test Split: first year for training, second year for testing**

In [None]:
bobcat = bobcat.set_index("timestamp")

In [None]:
bobcat_model_data = bobcat[["Bobcat_electricity_sum", "Bobcat_gas_sum", "Bobcat_solar_sum", "Bobcat_water_sum", "Bobcat_chilled_sum", "Bobcat_hot_sum"]]
train = bobcat_model_data.iloc[0:(len(bobcat_model_data)-53)].copy()
test = bobcat_model_data.iloc[len(train):(len(bobcat_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Bobcat_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Bobcat_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0, 1 and 2 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Bobcat_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(bobcat["Bobcat_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout each year
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Bobcat_electricity_sum"], exogenous=train[["Bobcat_gas_sum", "Bobcat_solar_sum", "Bobcat_water_sum", "Bobcat_chilled_sum", "Bobcat_hot_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Bobcat_electricity_sum"],exog = sm.add_constant(train[["Bobcat_gas_sum", "Bobcat_solar_sum", "Bobcat_water_sum", "Bobcat_chilled_sum", "Bobcat_hot_sum"]]), order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Bobcat_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Bobcat_electricity_sum"],exog = sm.add_constant(test[["Bobcat_gas_sum", "Bobcat_solar_sum", "Bobcat_water_sum", "Bobcat_chilled_sum", "Bobcat_hot_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Bobcat_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Bobcat_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Bobcat_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Rat

Rat corresponds to the buildings at Washington DC - City Buildings located in Washington DC, USA.

**Finding Relationships**

In [None]:
rat.head()

In [None]:
pp = sns.pairplot(rat)

In [None]:
ax = plt.subplot()

ax.plot(rat["timestamp"], rat["Rat_electricity_sum"],color = 'limegreen', label = 'electricity')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 20))

#display legend
ax.legend()

From the graphs, we can see that there is no clear trend throughout the two years.

**Train-test Split: first year for training, second year for testing**

In [None]:
rat = rat.set_index("timestamp")

In [None]:
rat_model_data = rat[["Rat_electricity_sum"]]
train = rat_model_data.iloc[0:(len(rat_model_data)-53)].copy()
test = rat_model_data.iloc[len(train):(len(rat_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Rat_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Rat_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is a not a strong correlation between the variables.

* ACF at lags 0 ro 3 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1, 5, 6, 17 and 18 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Rat_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(rat["Rat_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a trend each year but of different strengths
    * The trend shows increase of electricity usage in the summer months. This may be attributed to the warmer days during this period.
    * This trend is stronger in the first year compared to the second year
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Rat_electricity_sum"],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Rat_electricity_sum"], order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Rat_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Rat_electricity_sum"])
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Rat_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Rat_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Rat_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Fox

Fox corresponds to the buildings at Arizona State University located in Tempe, USA.

**Finding Relationships**

In [None]:
fox.head()

In [None]:
pp = sns.pairplot(fox)

In [None]:
ax = plt.subplot()

ax.plot(fox["timestamp"], fox["Fox_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(fox["timestamp"], fox["Fox_chilled_sum"], color = 'blue', label = 'chilled water')
fox["Fox_hot_sum"].plot(figsize=(25,10), color ='red', label = 'hot water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we can see an inverse relationship between chilled water and hot water usage; as chilled water usage increases, hot water usage decreases. Electricity usage remains constant. 

**Train-test Split: first year for training, second year for testing**

In [None]:
fox = fox.set_index("timestamp")

In [None]:
fox_model_data = fox[["Fox_electricity_sum", "Fox_chilled_sum", "Fox_hot_sum"]]
train = fox_model_data.iloc[0:(len(fox_model_data)-53)].copy()
test = fox_model_data.iloc[len(train):(len(fox_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Fox_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Fox_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is a correlation between variables but the results may not be significant.

* ACF at lags 0 to 2 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1, 11 and 13 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Fox_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(fox["Fox_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a cyclic trend each year.
    * The trend shows increase of electricity usage in the summer months. This may be attributed to longer days in the summer.
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Fox_electricity_sum"], exogenous=train[["Fox_chilled_sum", "Fox_hot_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Fox_electricity_sum"],exog = sm.add_constant(train[["Fox_chilled_sum", "Fox_hot_sum"]]), order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Fox_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Fox_electricity_sum"],exog = sm.add_constant(test[["Fox_chilled_sum", "Fox_hot_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Fox_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Fox_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Fox_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Shrew

Shrew corresponds to the buildings at UK Parliment located in London, UK.

**Finding Relationships**

In [None]:
shrew.head()

In [None]:
pp = sns.pairplot(shrew)

In [None]:
ax = plt.subplot()

ax.plot(shrew["timestamp"], shrew["Shrew_electricity_sum"],color = 'limegreen', label = 'electricity')
shrew["Shrew_gas_sum"].plot(figsize=(25,10), color ='lightpink', label = 'gas')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, there is no clear relationship between electricity and gas usage.

**Train-test Split: first year for training, second year for testing

In [None]:
shrew = shrew.set_index("timestamp")

In [None]:
shrew_model_data = shrew[["Shrew_electricity_sum", "Shrew_gas_sum"]]
train = shrew_model_data.iloc[0:(len(shrew_model_data)-53)].copy()
test = shrew_model_data.iloc[len(train):(len(shrew_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Shrew_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Shrew_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lag at 0 is significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 15, 16 and 19 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Shrew_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(shrew["Shrew_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout the first year. But a linear trend in the second year.
    * This could indicate faulty data. 
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness only in the first half of data.

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Shrew_electricity_sum"], exogenous=train[["Shrew_gas_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Shrew_electricity_sum"],exog = sm.add_constant(train["Shrew_gas_sum"]), order=(0,1,1), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Shrew_electricity_sum"].plot(figsize=(25,10), color="orange")
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Shrew_electricity_sum"],exog = sm.add_constant(test[["Shrew_gas_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Shrew_electricity_sum"].plot(figsize=(25,10),color = 'limegreen')
test['predicted'].plot()
plt.show()

In [None]:
test['residual'] = abs(test["Shrew_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Shrew_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Mouse

Mouse corresponds to buildings at Ormand Street Hospital located in London, UK. 

**Finding Relationships**

In [None]:
mouse.head(2)

has no other metric readings other than electricity!

In [None]:
pp = sns.pairplot(mouse)

In [None]:
ax = plt.subplot()

ax.plot(mouse["timestamp"], mouse["Mouse_electricity_sum"],color = 'limegreen', label = 'electricity')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graph, we see that there is decreasing electricity usage at the start and end of each year.

**Train-test Split: first year for training, second year for testing**

In [None]:
mouse = mouse.set_index("timestamp")

mouse_model_data = mouse[["Mouse_electricity_sum"]]
train = mouse_model_data.iloc[0:(len(lamb_model_data)-53)].copy()
test = mouse_model_data.iloc[len(train):(len(lamb_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Mouse_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Mouse_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is a not a strong correlation between the variables.

* ACF at lags 0 to 3 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1,13, 18 and 19 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Mouse_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(mouse["Mouse_electricity_sum"], period=6)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is no strong trend each year.
    * The graph shows slightly higher usage in the summer months with may be attributed to the longer days during this period.
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Mouse_electricity_sum"], start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = train["Mouse_electricity_sum"], order=(0,1,1), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Mouse_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot(color = "red")
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Mouse_electricity_sum"])
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Mouse_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Mouse_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Mouse_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (20,5))

# Peacock

Peacock correspond to the buildings at Princeton University located in Princeton, USA.

**Finding Relationship**

In [None]:
peacock.head(2)

In [None]:
pp = sns.pairplot(peacock)

In [None]:
ax = plt.subplot()

ax.plot(peacock["timestamp"], peacock["Peacock_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(peacock["timestamp"], peacock["Peacock_steam_sum"],color = 'orchid', label = 'steam')
peacock["Peacock_chilled_sum"].plot(figsize=(25,10), color ='blue', label = 'chilled water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we see that as electricity usage increase, steam usage increases as well. Chilled water usage does not seem to have a relationship with other meter types.

Additionally, we see major inconsistent peaks in chilled water usage. This can indicate outliers that were not removed during EDA. 

**Train-test Split: first year for training, second year for testing**

In [None]:
peacock = peacock.set_index("timestamp")

In [None]:
peacock_model_data = peacock[["Peacock_electricity_sum", "Peacock_chilled_sum", "Peacock_steam_sum"]]
train = peacock_model_data.iloc[0:(len(peacock_model_data)-53)].copy()
test = peacock_model_data.iloc[len(train):(len(peacock_model_data) -1)].copy()

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Peacock_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Peacock_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 to 4 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1 and 14 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Peacock_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(peacock["Peacock_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a cyclic trend each year.
    * There is increased usage in the winter months. This could be attributed to the colder days during this period. 
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Peacock_electricity_sum"], exogenous=train[["Peacock_chilled_sum", "Peacock_steam_sum"]],start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
endog = train["Peacock_electricity_sum"]
y = train.drop("Peacock_electricity_sum", axis = 1)
exog = sm.add_constant(y)

mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Peacock_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=train["Peacock_electricity_sum"],exog = sm.add_constant(test[["Peacock_chilled_sum", "Peacock_steam_sum"]]))
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Peacock_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Peacock_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Peacock_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (20,5))

# Hog

Hog building location remains anonymous.

**Finding Relationships**

In [None]:
hog.head(2)

In [None]:
pp = sns.pairplot(hog)

In [None]:
ax = plt.subplot()

ax.plot(hog["timestamp"], hog["Hog_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(hog["timestamp"], hog["Hog_chilled_sum"], color = 'blue', label = 'chilled water')
hog["Hog_steam_sum"].plot(figsize=(25,10), color ='violet', label = 'orchid')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, there does not seem to be any clear relationships between the types of usages.

**Train-test Split: first year for training, second year for testing**

In [None]:
hog = hog.set_index("timestamp")

In [None]:
hog_model_data = hog[["Hog_electricity_sum", "Hog_chilled_sum", "Hog_steam_sum"]]
train = hog_model_data.iloc[0:(len(hog_model_data)-53)].copy()
test = hog_model_data.iloc[len(train):(len(hog_model_data) -1)].copy()

endog = train["Hog_electricity_sum"]
y = train.drop("Hog_electricity_sum", axis = 1)
exog = sm.add_constant(y)

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 15 values will be considered for AFC.

In [None]:
plot_acf(train.Hog_electricity_sum,lags=15)
plt.show()

In [None]:
plot_pacf(train.Hog_electricity_sum,lags=15)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 to 4 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1 ,2 9, 10, 12 and 13 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Hog_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(hog["Hog_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a cyclic trend each year 
    * There is increased usage in the summer months. This could be attributed to the longer days during this period.
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(train["Hog_electricity_sum"], exogenous= exog ,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Hog_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog= endog,exog = exog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Hog_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Hog_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Hog_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (20,5))

# Cockatoo

Cockatoo corresponds to the buildings at Cornell University located in Cornell, USA.

**Finding Relationships**

In [None]:
cockatoo.head(2)

In [None]:
pp = sns.pairplot(cockatoo)

In [None]:
ax = plt.subplot()

ax.plot(cockatoo["timestamp"], cockatoo["Cockatoo_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(cockatoo["timestamp"], cockatoo["Cockatoo_steam_sum"],color = 'orchid', label = 'steam')
ax.plot(cockatoo["timestamp"], cockatoo["Cockatoo_chilled_sum"], color = 'blue', label = 'chilled water')
cockatoo["Cockatoo_hot_sum"].plot(figsize=(25,10), color ='red', label = 'hot water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we see a relationship between hot water and steam usage; as hot water usage increases, steam usage increases as well. Additionally, there is an inverse relationship between chilled water usage and hot and steam usage. We see that as hot water and steam usage increases, chilled water usage decreases. Electricity usage stays relatively constant. 

**Train-test Split: first year for training, second year for testing**

In [None]:
cockatoo = cockatoo.set_index('timestamp')

In [None]:
cockatoo_model_data = cockatoo[["Cockatoo_electricity_sum", "Cockatoo_chilled_sum", "Cockatoo_steam_sum", "Cockatoo_hot_sum"]]
train = cockatoo_model_data.iloc[0:(len(cockatoo_model_data)-53)].copy()
test = cockatoo_model_data.iloc[len(train):(len(cockatoo_model_data) -1)].copy()

endog = train["Cockatoo_electricity_sum"]
y = train.drop("Cockatoo_electricity_sum", axis = 1)
exog = sm.add_constant(y)

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Cockatoo_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Cockatoo_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 to 2 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Cockatoo_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(cockatoo["Cockatoo_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there seems to be a weak cyclic trend throughout the year 
    * There is a decrease in usage in the summer months. This could be attributed to the warmer days during this period. This sugguests that this location does not use electricity for cooling or this location may not have any cooling. 
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(endog, exogenous=exog,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Cockatoo_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog= endog,exog = exog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Cockatoo_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Cockatoo_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Cockatoo_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (20,5))

# Moose

Moose corresponds to buildings at Ottawa - City Buildings located in Ottawa, Canada. 

**Finding Relationships**

In [None]:
moose.head(2)

In [None]:
pp = sns.pairplot(moose)

In [None]:
ax = plt.subplot()

ax.plot(moose["timestamp"], moose["Moose_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(moose["timestamp"], moose["Moose_chilled_sum"], color = 'blue', label = 'chilled water')
ax.plot(moose["timestamp"], moose["Moose_hot_sum"], color = 'red', label = 'hot water')
moose["Moose_steam_sum"].plot(figsize=(25,10), color ='orchid', label = 'steam')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we see a relationship betwen steam and hot water usage; as steam usage increases, hot water usage also increases. There is also an inverse relationship between hot water and steam usage and chilled water usage. We see that as steam and hot water usage increases, chilled water decrease. 

**Train-tes Split: first year for training, second year for testing**

In [None]:
moose = moose.set_index("timestamp")

In [None]:
moose_model_data = moose[["Moose_electricity_sum", "Moose_chilled_sum", "Moose_steam_sum"]]
train = moose_model_data.iloc[0:(len(moose_model_data)-53)].copy()
test = moose_model_data.iloc[len(train):(len(moose_model_data) -1)].copy()

endog = train["Moose_electricity_sum"]
y = train.drop("Moose_electricity_sum", axis = 1)
exog = sm.add_constant(y)

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Moose_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Moose_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0, 1 and 2 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Moose_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

In [None]:
decomp = seasonal_decompose(moose["Moose_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout each year
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(endog, exogenous=exog,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 0)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Moose_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog = endog,exog = exog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Moose_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Moose_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Moose_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (20,5))

# Gator

Gator building locations remains anonymous.

**Finding Relationships**

In [None]:
gator.head(2)

In [None]:
pp = sns.pairplot(gator)

In [None]:
ax = plt.subplot()

ax.plot(gator["timestamp"], gator["Gator_electricity_sum"],color = 'limegreen', label = 'electricity')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 20))

#display legend
ax.legend()

From the graph, there does not seem to be any patterns or trends.

**Train-test Split: first year for training, second year for testing**

In [None]:
gator = gator.set_index("timestamp")

In [None]:
gator_model_data = gator[["Gator_electricity_sum"]]
train = gator_model_data.iloc[0:(len(gator_model_data)-53)].copy()
test = gator_model_data.iloc[len(train):(len(gator_model_data) -1)].copy()

endog = train["Gator_electricity_sum"]

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Gator_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Gator_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is a not a strong correlation between the variables.

* ACF at lags 0, 1 and 6 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1, 6 and 17 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Gator_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(gator["Gator_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there is a not much of a pattern or trend each year
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(endog,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 2) x (1, 0, [], 52)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog, order=(0,1,2), seasonal_order = (1,0,[],52), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Gator_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=endog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Gator_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Gator_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Gator_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (20,5))

# Eagle

Eagle buildings locations remain anonymous.

**Finding Relationships**

In [None]:
eagle.head()

In [None]:
pp = sns.pairplot(eagle)

In [None]:
ax = plt.subplot()

ax.plot(eagle["timestamp"], eagle["Eagle_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(eagle["timestamp"], eagle["Eagle_steam_sum"],color = 'orchid', label = 'steam')
ax.plot(eagle["timestamp"], eagle["Eagle_chilled_sum"], color = 'blue', label = 'chilled water')
eagle["Eagle_hot_sum"].plot(figsize=(25,10), color ='red', label = 'hot water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, we can see that hot water and chilled water usage have an inverse relationship; as hot water usage increases, chilled water usage decreases. The pairplot also indicates that steam usage and chilled water usage have an inverse relationship; as steam usage increases, chilled water usage decreases. There is also a linear relationship between hot water usage and steam usage. We can see that an increase in hot water usage also increases steam usage.  

**Train-test Split: first year for training, second year for testing**

In [None]:
eagle = eagle.set_index("timestamp")

In [None]:
eagle_model_data = eagle[["Eagle_electricity_sum", "Eagle_chilled_sum", "Eagle_steam_sum", "Eagle_hot_sum"]]
train = eagle_model_data.iloc[0:(len(eagle_model_data)-53)].copy()
test = eagle_model_data.iloc[len(train):(len(eagle_model_data) -1)].copy()

endog = train["Eagle_electricity_sum"]
y = train.drop("Eagle_electricity_sum", axis = 1)
exog = sm.add_constant(y)

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Eagle_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Eagle_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1 and 14 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Eagle_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(eagle["Eagle_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there seems to be a weak cyclic trend 
    * There seems to be a decrease is usage during the spring/summer months. This could be attributed to the warmer days during this period. 
* There does not seem to be much of a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(endog, exogenous=exog,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,0), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Eagle_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=endog,exog = exog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Eagle_electricity_sum"].plot(figsize=(25,10))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Eagle_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Eagle_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Wolf

Wolf corresponds to the buildimgs at University College Dublin located in Dublin, Ireland. 

**Finding Relationships**

In [None]:
wolf.head()

In [None]:
pp = sns.pairplot(wolf)

In [None]:
ax = plt.subplot()

ax.plot(wolf["timestamp"], wolf["Wolf_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(wolf["timestamp"], wolf["Wolf_gas_sum"],color = 'lightpink', label = 'gas')
wolf["Wolf_water_sum"].plot(figsize=(25,10), color ='darkcyan', label = 'water')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

From the graphs, there does not seem to be a stong relationship between any variables. However, slight increases in water usage also shows slight increases in electricity usage. 

**Train-test Split: first year for training, second year for testing**

In [None]:
wolf = wolf.set_index("timestamp")

In [None]:
wolf_model_data = wolf[["Wolf_electricity_sum", "Wolf_gas_sum", "Wolf_water_sum"]]
train = wolf_model_data.iloc[0:(len(wolf_model_data)-53)].copy()
test = wolf_model_data.iloc[len(train):(len(wolf_model_data) -1)].copy()

endog = train["Wolf_electricity_sum"]
y = train.drop("Wolf_electricity_sum", axis = 1)
exog = sm.add_constant(y)

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Wolf_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Wolf_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 and 4 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0 and 4 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Wolf_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is not significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is not constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(peacock["Peacock_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there seems to be a cyclic trend each year 
    * There is decreased usage of electricity in the spring/summer months. This could be attributed to the warmer days during this period. 
* There seems to be a seasonality component of the data
* Looking at the residuals, there appears to be randomness in the data

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(endog, exogenous= exog,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,1), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Wolf_electricity_sum"].plot(figsize=(25,10))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=endog,exog = exog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Wolf_electricity_sum"].plot(figsize=(20,8))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Wolf_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Wolf_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

# Bull

Bull corresponds to the buildings at University of Texas - Austin located in Austin, USA.

**Finding Relationships**

In [None]:
bull.head()

In [None]:
pp = sns.pairplot(bull)

In [None]:
ax = plt.subplot()

ax.plot(bull["timestamp"], bull["Bull_electricity_sum"],color = 'limegreen', label = 'electricity')
ax.plot(bull["timestamp"], bull["Bull_chilled_sum"], color = 'blue', label = 'chilled water')
bull["Bull_steam_sum"].plot(figsize=(25,10), color ='orchid', label = 'steam')

#change x axis ticks
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, 10))

#display legend
ax.legend()

There does not seem to be any strong relationships between variables. However, the pairplot shows a slight inverse relationship between chilled water usage and steam usage; as chilled water usage increases, steam usage decreases.

**Train-test Split: first year for traning, second year for testing**

In [None]:
bull = bull.set_index("timestamp")

In [None]:
bull_model_data = bull[["Bull_electricity_sum", "Bull_chilled_sum", "Bull_steam_sum"]]
train = bull_model_data.iloc[0:(len(bull_model_data)-53)].copy()
test = bull_model_data.iloc[len(train):(len(bull_model_data) -1)].copy()

endog = train["Bull_electricity_sum"]
y = train.drop("Bull_electricity_sum", axis = 1)
exog = sm.add_constant(y)

**ACF PACF**

ACF becomes more unreliable as lag value increases. There are a total of 52 oberservations in train, the first 20 values will be considered for AFC.

In [None]:
plot_acf(train.Bull_electricity_sum,lags=20)
plt.show()

In [None]:
plot_pacf(train.Bull_electricity_sum,lags=20)
plt.show()

From the graphs, we see that there is not a strong correlation between variables.

* ACF at lags 0 and 1 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.
* PACF at lags 0, 1 and 16 are significant. The rest of the values lie outside of the 95% confidience interval; it may or may not be significant.

**Dickey Fuller's Test**

In [None]:
t = sm.tsa.adfuller(train.Bull_electricity_sum, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])

p-value is significant while checking for stationarity

* This indicates that statistical properities of data, such as standard deviation, mean and variance is constant over time

**Seasonal Decomposition**

In [None]:
decomp = seasonal_decompose(bull["Bull_electricity_sum"], period=12)

fig = plt.figure()
fig = decomp.plot()
fig.set_size_inches(15,10)

* Looking at the trend graph, there does not seem to be much of a trend throughout each year
* There seems to be a seasonality component of the data
* Looking at the residuals, the data does not appear to be fully random. This might be caused from the filling of missing values. 

**Train Model**

> start_p = 1: Starting value of 1 for the order/number of time lages of the auto regressive model

> start_q = 1: Starting value of 1 for the order of the moving average model

> m = 52: weekly data; the seasonal differencing

> start_P = 0: Starting value of 0 for the order of the auto regressive portion of the seasonal model

> seasonal = True: fit to a seasonal model; since energy can vary between seasons

> trace = True: print status on fitting

> stepwise = True: Faster method of fitting all hyper-parameter combinations and is less likely to over fit

In [None]:
smodel = pm.auto_arima(endog, exogenous=exog,start_p=1, start_q=1, d = 1, D = 0, test="adf", 
                       max_p=12, max_q=12, m=52, start_P=0, seasonal = True, trace=True, suppress_warnings=True, stepwise=True
                      )

smodel.summary()

From the auto arima function, the best model is SARIMAX(0, 1, 1)

**Fit Model**

In [None]:
mod = sm.tsa.statespace.SARIMAX(endog = endog ,exog = exog, order=(0,1,1), trend='c')
mod_fit = mod.fit()
mod_fit.summary()

In [None]:
train["Bull_electricity_sum"].plot(figsize=(20,8))
mod_fit.fittedvalues.plot()
plt.show()

**Predict Using Model**

In [None]:
predict = mod_fit.predict(endog=endog,exog = exog)
test['predicted'] = predict.values
test.head(5)

In [None]:
test["Bull_electricity_sum"].plot(figsize=(20,8))
test['predicted'].plot(color = "red")
plt.show()

In [None]:
test['residual'] = abs(test["Bull_electricity_sum"]-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test["Bull_electricity_sum"]).sum()*100/len(test)
print("MAE:", MAE)
print("MAPE:", MAPE)

In [None]:
mod_fit.resid.plot(figsize= (10,5))

# Future Directions 

Considerations: 
* Building characteristics can greatly influence energy consumption [[2]](https://www.hydroquebec.com/residential/customer-space/electricity-use/electricity-consumption-factors.html)
    * Thermal envelope: poorly insulated walls, foundations, roof spaces can cause up to lose of 40% of a building's heat
    * Air leaks: up to 25% of a building's heat may be escaping through leaks
    * Doors and windows: the number, size and quality of doors and windows can influence heat lose up to 25%

    
Ideas: 
* What is the most common type of energy at a location and how does the correspond to our buildings 
* How does each location compare to others in the region
* Find the main energy type of the location and remodel
    * We based the assumption that every location is using electricity as primary energy source, which may not be correct.