# GEFCOM2017 Data Exploration Notebook

### Set up an environment

To run this notebook, please download GEFCom 2017 dataset by executing these commands from the root folder of TSPerf:
    
    conda env create --file ./common/conda_dependencies.yml
    source activate tsperf
    python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
    python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py

Install dependencies

In [1]:
!pip install patsy
!pip install statsmodels



### Load training data

In [None]:
import os
import warnings
warnings.simplefilter("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import pacf
%matplotlib inline
plt.rcParams['figure.figsize'] = [15, 5]

train_data_dir = '../data/train'

Combine train_base.csv and train_round_6.csv to get the entire training dataset. 

In [None]:
train_base = pd.read_csv(os.path.join(train_data_dir, 'train_base.csv'),parse_dates=['Datetime'])
train_round_6 = pd.read_csv(os.path.join(train_data_dir, 'train_round_6.csv'),parse_dates=['Datetime'])
train_all = pd.concat([train_base, train_round_6]).reset_index(drop=True)
train_all

### Check missing values and feature ranges

Check if there are missing values in any of the columns

In [None]:
print("Number of missing values: {}".format(train_all.isna().sum().sum()))

Summary of the distribution of values of numeric columns

In [None]:
train_all.describe()

Show all distinct zones and their timespans

In [None]:
train_all.groupby('Zone')['Datetime'].agg([np.min, np.max]).reset_index().\
          rename(columns={'amin':'min time', 'amax':'max time'})

Show summary of the distribution of DEMAND values across zones

In [None]:
train_all.groupby('Zone')['DEMAND'].agg([np.mean, np.min, np.max]).\
          rename(columns={'mean':'mean demand', 'amin':'min demand', 'amax':'max demand'}).\
          sort_values(by='mean demand').reset_index()

### Compute correlations between different features

In [None]:
train_all[['DEMAND','DewPnt','DryBulb','Holiday']].corr()

This table shows that DewPnt and DryBulb features are highly correlated. Note that these temperature features can not be used directly in forecasting, because they are not available at forecasting time. However, lagged temperatures from the available training data can be used. 

### Visualize seasonalities in energy demand

In this section we show that DEMAND data has multiple seasonalities

In [None]:
mean_demand = train_all.groupby('Datetime')['DEMAND'].mean()
mean_demand.plot(title="Mean demand over 6.5 years")

#### Daily seasonality

The following graph shows that mean energy consumption has daily seasonality. Energy consumption peaks around noon and then around 6pm. Also energy consumption drops significantly at night.

In [None]:
mean_demand[:24*3].plot(title="Mean demand over 3 days")

#### Weekly seasonality

The following graph shows that mean energy consumption has weekly seasonality. Energy consumption is higher at week days (January 3-7, January 10-14, January 17-22) and lower during weekend (January 1-2, January 8-9, January 15-16).

In [None]:
mean_demand[:24*21].plot(title="Mean demand over 21 days")

In [None]:
mean_total_daily_demand = mean_demand.resample('24h').sum()
weekday_mean_total_demand = mean_total_daily_demand[mean_total_daily_demand.index.dayofweek<5].mean()
weekend_mean_total_demand = mean_total_daily_demand[mean_total_daily_demand.index.dayofweek>=5].mean()
print('Total demand during weekday: {0:.2f} (averaged over all zones and weekdays)'.format(weekday_mean_total_demand))
print('Total demand during weekend day: {0:.2f} (averaged over all zones and weekend days)'.format(weekend_mean_total_demand))

#### Annual seasonality

The following graph shows that mean energy consumption has annual seasonality. Energy consumption increases in winter and summer and decreases in spring and fall.

In [None]:
mean_demand.resample('1m').sum().plot(title="Total monthly demand (averaged over all zones)")

### Compute partial autocorrelation

The following plot shows partial autocorrelation with of the lags up to 24 hours * 14 days = 2 weeks

In [None]:
plot_pacf(mean_demand, lags=24*14)
plt.show()

This graph shows that most of the lags have very small correlation. In the next cell we find 20 lags with the largest partial autocorrelation.

In [None]:
pacf_values, pacf_conf_intervals = pacf(mean_demand, nlags=24*14, alpha=0.05)
top20_lags = np.argsort(np.abs(pacf_values))[-2::-1][:20]
print(top20_lags)

In [None]:
pacf_values[top20_lags]

The lags with the highest correlation are from today (lags 1,2,13-19), from about a day ago (lags 22, 24, 25, 27), from 3 days ago (lag 73), from 6 days ago (lags 144, 145, 147) and from 7 days ago (lags 168, 169).

95% confidence intervals of 20 lags with the largest partial autocorrelation:

In [None]:
pacf_conf_intervals[top20_lags]

The 95% confidence intervals of partial correlations of these lags do not contain zeros. Hence all these lags have statistically significant partial autocorrelation.

This analysis suggests to use these lags when developing feature sets of energy demand forecasting models. However, in this benchmark, the forecast horizon is 1 to 2 months ahead and most recent lags cannot be used as features. But features from the same hour, same day of week, and same week of year could be useful.