# Index

<a href= "#Heading">Heading</a>

<a href= "#Data-Import">Data Import</a>

<a href= "#Date-Format-Update">Date Format Update</a>

<a href= "#Data-Visuals">Data Visuals</a>

<a href= "#Exponential-Smoothing">Exponential Smoothing</a>

<a href= "#Relationship-tests">Relationship tests</a>

<a href= "#Seasonal-Decomposition">Seasonal Decomposition</a>

<a href= "#Functions">Functions</a>

<a href= "#Stationarity-check">Stationarity check</a>

<a href= "#Model-Selection">Model Selection</a>

<a href= "#Train/-Test-split">Train/ Test split</a>

<a href= "#Model-with-Temp-&-Dew">Model with Temp & Dew</a>

<a href= "#Model-with-only-Temp">Model with only Temp</a>

<a href= "#Conclusions">Conclusions</a>

<a href= "#References">References</a>
    

[<a href='#Index'>Back to top</a>]

## Heading

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
url = 'https://5vtj648dfk323byvjb7k1e9w-wpengine.netdna-ssl.com/wp-content/uploads/2018/05/shutterstock_170867918-e1525266245642.jpg'
Image(url= url, width=600, height=600, unconfined=True)

Image source from:- www.fleetcarma.com

Objective of this Notebook is to explore features that are critical for forcasting the power usage for a given period. In the process of exploration, we will uncover best possible ways to get to the answer. 

Details of Data: https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install pmdarima 

[<a href='#Index'>Back to top</a>]

## Data Import

In [None]:
# Import files
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt


# Load specific forecasting tools
from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from statsmodels.graphics.tsaplots import month_plot, quarter_plot
from pmdarima import auto_arima # for determining ARIMA orders
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.filters.hp_filter import hpfilter
from statsmodels.tsa.stattools import adfuller,kpss,coint,bds,q_stat,grangercausalitytests,levinson_durbin

import sys

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

from sklearn.metrics import mean_squared_error, mean_absolute_error, explained_variance_score, r2_score, max_error,median_absolute_error, mean_squared_log_error

In [None]:
df_usage = pd.read_csv('../input/residential-power-usage-3years-data-timeseries/power_usage_2016_to_2020.csv')
df_weather = pd.read_csv('../input/residential-power-usage-3years-data-timeseries/weather_2016_2020_daily.csv')

In [None]:
df_usage.head()

In [None]:
df_weather.head()

[<a href='#Index'>Back to top</a>]

## Date Format Update

In [None]:
# Date column update for 'df_usage'

n = df_usage.shape[0]
p1 = pd.Series(range(n), pd.period_range('2016-06-01 00:00:00', freq = '1H', periods = n))
df_usage['StartDate'] = p1.to_frame().index

# Date column update for 'df_weather'
m = df_weather.shape[0]
p2 = pd.Series(range(m), pd.period_range('2016-06-01', freq = '1D', periods = m))
df_weather['Date'] = p2.to_frame().index

# convert the period date into timestamp
df_usage['StartDate'] = df_usage['StartDate'].apply (lambda x: x.to_timestamp())
df_usage['Date'] = pd.DatetimeIndex(df_usage['StartDate']).date

# convert the period date into timestamp
df_weather['Date'] = df_weather['Date'].apply (lambda x: x.to_timestamp())

In [None]:
df_usage_daily = df_usage.groupby('Date').sum()

df_usage_daily['day_of_week'] = df_usage_daily['day_of_week'].apply(lambda x: x/24)

notes_col = df_usage.groupby('Date').first()['notes'].values
df_usage_daily['notes'] = notes_col
df_usage_daily.head()

In [None]:
#filter the weather data to match with power usage dataframe. 

k = df_usage_daily.shape[0]
df_weather = df_weather[0:k]
df_weather.set_index('Date', inplace=True)
df_weather.head()

In [None]:
df_weather.shape

In [None]:
comb_df = pd.merge(df_weather,df_usage_daily,left_index=True, right_index=True)

In [None]:
comb_df.columns

In [None]:
comb_df.drop(columns= ['Temp_avg', 'Temp_min','Dew_avg',
       'Dew_min', 'Hum_avg', 'Hum_min', 'Wind_avg',
       'Wind_min','Press_avg', 'Press_min', 'Precipit','day_of_week_x', 'day_of_week_y'], inplace=True)
comb_df.index.freq= 'D'

In [None]:
comb_df.head()

In [None]:
comb_df['Value (kWh)'].loc['2017-01-01':'2019-12-31'].plot(figsize= (16,9), legend= True, ylabel='Power in kWh')

Three year power usage vs time. Peak value of power usage in 2017 year is around 50kWh, and during 2018 was about 68kWh, followed by 2019 was 55kWh. 

[<a href='#Index'>Back to top</a>]

## Data Visuals

In [None]:
comb_df[['Temp_max','Value (kWh)', 'Dew_max' ]].loc['2017-01-01':'2019-12-31'].plot(figsize= (16,9))

From the above graph, power usage has direct relation to Temperature & Dew. The data has fluctations hence needs smoothing and filters. 

In [None]:
comb_df.head()

In [None]:
df_short = comb_df.loc['2017-01-01':'2019-12-31']

In [None]:
df_short.resample(rule= 'M').mean().plot(figsize= (16,9))

The monthly data for 3 years show minimal fluctations. Again only two curves (Temp and Dew) show co relation to power, hence rest of the data columns are removed from the analysis. 



In [None]:
df_short = df_short[['Temp_max', 'Dew_max', 'Value (kWh)','notes']]

df_short.resample(rule= 'W').mean().plot(figsize= (16,9))

The graph shows significant fluctations. Hence filters to be applied. 

In [None]:
df_short['Value (kWh)'].loc['2017-01-01': '2018-01-01'].resample(rule= 'W').mean().plot(figsize= (16,9), legend=True)

From the above graph it is clear that only three features participate in predictions of power, namely Temp & Dew. 

[<a href='#Index'>Back to top</a>]

## Exponential Smoothing

In [None]:
df_short['EWMA12'] = df_short['Value (kWh)'].ewm(span=30,adjust=True).mean()
df_short['EWMA12_Temp'] = df_short['Temp_max'].ewm(span=30,adjust=True).mean()
df_short['EWMA12_Dew'] = df_short['Dew_max'].ewm(span=30,adjust=True).mean()

In [None]:
df_short[['Value (kWh)','EWMA12', 'EWMA12_Temp', 'EWMA12_Dew']].plot(figsize= (16,9))

If you closely observe the power data, and exponential smoothing, there is still fluctions, the trend line is not smooth. 

In [None]:
## Lets see how the data compares against the monthly and quarterly 
# plot all four graphs in one go to show the performance of temp vr 


fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize= (16,9),  squeeze=False)
#fig = plt.figure(8,5)


dfm = df_short['Value (kWh)'].resample(rule='M').mean()
month_plot(dfm, ylabel= 'Power in kWh', ax =ax1);
dfq = df_short['Value (kWh)'].resample(rule='Q').mean()
quarter_plot(dfq, ylabel = 'Power kWh', ax=ax2);

dftm = df_short['Temp_max'].resample(rule='M').mean()
month_plot(dftm, ylabel= 'Temp in Fdeg', ax=ax3);
dftq = df_short['Temp_max'].resample(rule='Q').mean()
quarter_plot(dftq, ylabel = 'Temp in Fdeg', ax=ax4);

fig.tight_layout(pad=1.2)

# for ax in fig.get_axes():
#     ax.label_outer()


Clearly the data shows seasonality, during summar month between may to oct the power bill is higher. The Theromstat settings is at 66F for heating and 70F for cooling. 
1. During months of Jan, Feb, march, April, Nov and Dec the AC is not running for most of the time, occationally Heater is on. Hence you see lower power bill during these months. 
2. Q2 and Q3 each year the power bill is higher due to summer. 

[<a href='#Index'>Back to top</a>]

## Relationship tests

In [None]:
# It is always two way comparision. If p <0.05 the relationship exisits
# Add a semicolon at the end to avoid duplicate output
grangercausalitytests(df_short[['Temp_max','Dew_max']],maxlag=3);

The interaction between temperature and Dew seems minimal at lag1 & lag2. 

In [None]:
# It is always two way comparision. If p <0.05 the relationship exisits
# Add a semicolon at the end to avoid duplicate output
grangercausalitytests(df_short[['Temp_max','Value (kWh)']],maxlag=3);

Temperature and power are strongly corelated to each other. 

[<a href='#Index'>Back to top</a>]

## Seasonal Decomposition

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
result_pwr= seasonal_decompose(df_short['EWMA12'], model='additive')
result_pwr.plot();

In [None]:
result_pwr.trend.plot(figsize=(16,9), legend= True, ylabel= 'Power in kWh trend')

The data seems still fluctuations, further smoothing is needed for the trend curves. 

In [None]:
## HP Filter
# Filtering the seasonality out of the data and converting this into only trend lines.
pwr_cycle, pwr_trend = hpfilter(df_short['EWMA12'],lamb=129600)
df_short['hpfilt_trend'] = pwr_trend.values

# Filtering the seasonality out of the data and converting this into only trend lines.
temp_cycle, temp_trend = hpfilter(df_short['EWMA12_Temp'],lamb=129600)
df_short['hpfilt_temp']= temp_trend.values

# Filtering the seasonality out of the data and converting this into only trend lines.
dew_cycle, dew_trend = hpfilter(df_short['EWMA12_Dew'],lamb=129600)
df_short['hpfilt_dew']= dew_trend.values

In [None]:
df_short.head()

In [None]:

ax = df_short['EWMA12'].plot(figsize=(16,9), legend=True)


df_short['hpfilt_trend'].plot(figsize=(16,9), legend=True)
pwr_cycle.plot(legend=True)


for day in df_short[df_short['notes'] =='vacation'].index:
    ax.axvline(x=day, color= 'red', alpha= .25);

    
ax.axhline(y=(pwr_cycle.values.min())*.6, xmin=0, xmax=1, color= 'black', alpha= .25, ls= '--' )  
ax.axhline(y=(pwr_cycle.values.max())*.6, xmin=0, xmax=1, color= 'black', alpha= .25, ls= '--' )  
#ax.axhline(y=gdp_cycle.values )
# for day in df[(df['weekday']=='Friday') | (df['weekday']=='Saturday') | (df['weekday']=='Sunday')].index:
#     ax.axvline(x=day, color= 'black')


# Mark vacation days. 

The red lines indicate vaccation period during the year. In this period, the AC or heater is turned off and hence you generally see dip in power consumption in comparision to regular days. 
The green line represents the sesonal cycle of the power consumption during 3 years. 


In [None]:
title = 'Autocorrelation: Power usage'
lags = 10
plot_acf(df_short['EWMA12'],title=title,lags=lags);
#plot_pacf(df_short['EWMA12'],title=title,lags=lags);

The difference between day 1 and day 2 is gradually decreasing over period of time, within 10lags. 

[<a href='#Index'>Back to top</a>]

## Functions

In [None]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")

#Code from Jose Portilla 'Python for Time Series Data Analysis'

In [None]:
# define function to evulate the performance each timeseries models. 

from sklearn.metrics import explained_variance_score,mean_squared_error, r2_score,max_error, mean_absolute_error

def model_evaluate(model, y_test, y_pred):
    exp_var_score = explained_variance_score(y_test, y_pred)
    max_err= max_error(y_test, y_pred)
    r2= r2_score(y_test, y_pred)
    mae= mean_absolute_error(y_test, y_pred)
    mse= mean_squared_error(y_test, y_pred)
    rmse= np.sqrt(mse)
    
    row_label = [model]
    
    data_score = { 'exp_varne': exp_var_score, 'max_error':max_err, 
                 'r2': r2, 'mae':mae, 'mse':mse, 'rmse':rmse,}
    
    df_data = pd.DataFrame(data= data_score, index= row_label)
    
    return df_data

[<a href='#Index'>Back to top</a>]

## Stationarity check

In [None]:
adf_test(df_short['Temp_max'])

In [None]:
adf_test(df_short['hpfilt_trend'])

In [None]:
adf_test(df_short['hpfilt_temp'])

In [None]:
adf_test(df_short['hpfilt_dew'])

[<a href='#Index'>Back to top</a>]

## Model Selection

In [None]:
stepwise_fit= auto_arima(df_short['hpfilt_trend'],max_order= 20,n_jobs=-1, stepwise=True)
stepwise_fit.summary()

[<a href='#Index'>Back to top</a>]

## Train/ Test split

In [None]:
size = int(len(df_short)*(-.1))
train, test = df_short[:size],df_short[size:]

[<a href='#Index'>Back to top</a>]

## Model with Temp & Dew

In [None]:
model = SARIMAX(train['hpfilt_trend'],exog= train[['hpfilt_temp', 'hpfilt_dew']], order=(2,2,1),seasonal_order=(0,0,0,0),enforce_invertibility=False)
results = model.fit()
results.summary()

In [None]:
# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, exog= test[['hpfilt_temp', 'hpfilt_dew']], dynamic=False).rename('SARIMA(2,2,1) Predictions')

In [None]:
ax = train['hpfilt_trend'].plot(legend=True,figsize=(12,6),title=title)
test['hpfilt_trend'].plot(legend=True)
predictions.plot(legend=True)

In [None]:
model_evaluate('SARIMA(2,2,1)',test['hpfilt_trend'], predictions)

[<a href='#Index'>Back to top</a>]

## Model with only Temp

In [None]:
model_nodew = SARIMAX(train['hpfilt_trend'],exog= train[['hpfilt_temp']], order=(2,2,1),seasonal_order=(0,0,0,0),enforce_invertibility=False)
results_nodew = model_nodew.fit()
results_nodew.summary()

In [None]:
predictions_nodew = results_nodew.predict(start=start, end=end, exog= test[['hpfilt_temp']], dynamic=False).rename('SARIMA(2,2,1)nodew')

In [None]:
ax = train['hpfilt_trend'].plot(legend=True,figsize=(12,6),title=title)
test['hpfilt_trend'].plot(legend=True)
predictions_nodew.plot(legend=True,ls = '--', color= 'black')

In [None]:
model_evaluate('SARIMA(2,2,1)_nodew', test['hpfilt_trend'], predictions_nodew)

[<a href='#Index'>Back to top</a>]

# Conclusions

1. The ARIMA models were able to predict a known path to certain extent (R2 value= .94), when train and test sizes are changed to 80/20 the algorithm struggles to predict the sudden changes in path. Model with temperature has exogenous feature has better r2, mae, mse values in comparision with model with two exogenous features. 
2. Another important point is the trend lines to be smoothened and filters to applied inorder to get the stationarity and as well to increase better predicatability. 
3. Only two full year cycle is not sufficient for the better forecast analysis, hence more data is needed. 

[<a href='#Index'>Back to top</a>]

# References

1. Jose Portilla udemy class on 'Python for Time Series Data Analysis'
2. BV Vishwas & Ashish Patel book on 'Hands on Timeseries analysis with Python'
3. Jonathan D Cryer, Kung-Sik Chan book on 'Time Series Analysis with applications in R'