# Multi Level Electricity Load Forecasting

Competition: https://www.kaggle.com/c/multi-level-electricity-load-forecasting/overview
- better structured competition for the same data - https://www.kaggle.com/jeanmidev/smart-meters-in-london

**Abstract**

You have a set of power consumption data at half-hour time steps for the year 2013. More precisely, this is the average consumption of groups of consumers of sizes 10, 100 and 1000. For each size of group you have 10 consumption series:
- size 10 aggregates: X1, X2,…, X10
- size 100 aggregates: X11, X2,…, X20
- aggregates of size 1000: X21, X2,…, X30

The objective is to forecast the consumption of these groups of customers from January 1 to February 27, 2014. In addition to consumption data for 2013, you have the temperature observed in 2013 and 2014 at **half-hourly intervals**.

Data is from UK Power Networks-led Low Carbon London project: https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households (dead link)
+ backup link ? - https://old.datahub.io/dataset/smartmeter-energy-use-data-in-london-households


**Evaluation**
The performances will be evaluated in RMSE (root mean square error).

The sample for the public evaluation is composed of 30% of the test dataset distributed randomly, the private evaluation is carried out on the remaining 70% of the data.
Submissions must be included in a text file including 2 columns named Id and Prediction.

---

## Hours invested:
+ 6h - initial structure, EDA plots tweaks, correlations
+ 2h - Prophet model tweaking, checklist
+ 1h - outlier detection, signal smoothing
+ 2h - notes, checklists, diagrams
---

<span id='toc'></span>

**Table of Contents**:
1. [Imports](#toc-imports)
  1. [Import deps](#toc-deps)
  2. [Inserts for Jupyter](#toc-jupins)
  3. [Import data](#toc-data)
2. [EDA](#toc-eda) 
  1. [Energy Consumption per X1 (matplotlib)](#toc-eda-mpl)
    1. [Checking Stationarity](#toc-eda-stationarity)
  2. [Energy Consumption per X1 (plotly)](#toc-eda-ply)
  3. [Energy Consumption vs Temperature (plotly)](#toc-eda-link)
  4. [Checking rolling correlation](#toc-eda-corr-roll)
  5. [Checking global correlation (weekly resampling)](#toc-eda-corr-global)
3. [Modelling](#toc-modelling)
  1. [Outlier Detection](#toc-outlier)
  2. [Signal Smoothing](#toc-signal)
  3. [Prophet Model](#toc-prophet)
4. [Bibliography](#toc-bib)


Notes:
+ ...


<!-- 
# Python 3 environment defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
-->

---

# Technical TODO

## **1.C.** Data Checklist
⬜️ summarize alternate datasets  
+ ⬜️ temperature - depending on the local climate, extreme heat or cold create outliers, mild shifts create trends due to heating/cooling systems

## **3.A.** Outlier Detection Checklist
✅ check PyCaret - used IsolationForest
+ ⬜️ thresholding for outlier elimination

⬜️ extreme outliers -> normalize to normal outliers  
or  
⬜️ extreme outliers -> eliminate and roll data

## **3.C.** Prophet Checklist
  
**Model Tweaking / Feature Engineering**:  

⬜️ consider the stationarity tests results in preprocessing the data - difference / detrend  
⬜️ consider cleaning the input for the model:
- **generative** - smoothen the input to the model (rolling window, signal smoothing, downsampling via numerical calculus interpolations)
- **discriminative** - take into account all the raw data, minus statistical or model-detected outliers

✅ make the Prophet linear model a little bit more flexible  
✅ take into account holidays for each country (you may use more electricity during holidays)  
+ ⬜️ consider holiday prior ([reference](https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#seasonalities-that-depend-on-other-factors#prior-scale-for-holidays-and-seasonality))  
+ ⬜️ are there other major events to be considered in 2013-2014 in London (that may stimulate people to stay more at home)

⬜️ take into account seasons vs electricity usage - electricity-based heating can play a big factor, though summer also has AC - winter vs transition vs summer - ([reference](https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#seasonalities-that-depend-on-other-factors))  

⬜️ weekends can be modelled separately - people tend to  ([reference](https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#additional-regressors))  

⬜️ what if seasonality is not additive - eg: due to urban development ([reference](https://facebook.github.io/prophet/docs/multiplicative_seasonality.html))  

⬜️ consider uncertainty  
+ ⬜️ trend uncertainty ([reference](https://facebook.github.io/prophet/docs/uncertainty_intervals.html#uncertainty-in-the-trend))  
+ ⬜️ seasonality uncertainty ([reference](https://facebook.github.io/prophet/docs/uncertainty_intervals.html#uncertainty-in-seasonality))  

⬜️ model drift

⬜️ time series split - crossvalidation ([reference](https://facebook.github.io/prophet/docs/diagnostics.html))  
⬜️ granularity of the models - models for aggregates of 10, 100, 1000, ensembles of the aforementioned

**Tooling / MLOps**:  
⬜️ serializing models ([reference](https://facebook.github.io/prophet/docs/additional_topics.html))

---

**Assumptions list**:
- Higher energy consumption during winter, smaller consumption during summer.
- Given the climate of London, summer temperatures don't affect the model as much, the same applying to transitory months. As such, only ~November-March might be relevant for the seasonality component.
- Weekends would generally have a higher energy consumption (people staying at home, etc.)
- High correlation between aggregates of the same order.
- Higher variance in correlation between aggregates of different sizes.


In [None]:
import graphviz
source= '''\
digraph G {
    rankdir=LR;
    
    subgraph cluster_data {
        rank=same;
        # style="rounded";
        node [shape=record,style=filled];
        data [
            label="<f0> Better data via cleaning \l |<f1> Better data via feature engineering \l |<f2> Better data via outlier pruning \l 
            |<f3> More data sources/features \l |<f4> Statistical Validation \l |<f4> Auto Feature Engineering \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=red;
        label="Data";
    };
    
    subgraph cluster_data_product {
        rank=same;
        node [shape=record,style=filled];
        product_data [
            label="<f0> Data Governance via Github for public / small datasets \l |<f1> Features centralized in-repo \l |<f2> Dataset Versioning \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=blue;
        label="Product (Data)";
    }
    
    subgraph cluster_eda {
        rank=same;
        node [shape=record,style=filled];
        eda [
            label="<f0> Plotly to reporting \l |<f1> Subtrends (not currently modelled) \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=red;
        label="EDA";
    };
    
    subgraph cluster_eda_product {
        rank=same;
        node [shape=record,style=filled];
        product_eda [
            label="<f0> Streamlit / Dash \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=blue;
        label="Product (EDA)";
    };
    
    subgraph cluster_models {
        rank=same;
        node [shape=record,style=filled];
        models [
            label="<f0> Prophet parameter optimization \l |<f1> Model granularity \l |<f2> Time Series Cross-Validation \l |<f2> Overfit/Underfit metrics based on prod perf \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=red;
        label="Models";
    };
    
    subgraph cluster_models_product {
        rank=same;
        node [shape=record,style=filled];
        product_models [
            label="<f0> Models master page \l |<f1> Models taxonomy \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=blue;
        label="Product (Models)";
    };
    
    subgraph cluster_mlops {
        rank=same;
        node [shape=record,style=filled];
        mlops [
            label="<f0> Model Serialization \l |<f1> Model Versioning \l |<f2> Automation / Github Actions \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=red;
        label="MLOps";
    };
    
    subgraph cluster_mlops_product {
        rank=same;
        node [shape=record,style=filled];
        product_mlops [
            label="<f0> Release docs \l |<f1> Model rankings \l |<f2> Model KPIs (business) \l"; 
            fillcolor=white;
        ];
        color=white;
        fontcolor=blue;
        label="Product (MLOps)"; 
    };
    
    data -> eda -> models -> mlops;
    product_data -> product_eda -> product_models -> product_mlops;
    
}
'''
display(graphviz.Source(source))

---

# 1. Imports
<span id="toc-imports"></span>

In [None]:
%%time
!pip install pycaret -q
!pip install prophet -q
!conda install -y graphviz pygraphviz -q

## 1.A. Import deps
<span id="toc-deps"></span>

In [None]:
# BASE ------------------------------------
from datetime import datetime as dt
nb_start = dt.now()

# Be mindful when you have this activated.
# import warnings
# warnings.filterwarnings('ignore')

import json
from pathlib import Path

from time import sleep

# Display libs
from IPython.display import display, HTML

from tqdm import tqdm, tqdm_notebook
tqdm.pandas()

SEED = 24

In [None]:
%%time

# ETL ------------------------------------
import numpy as np
import pandas as pd

# VIZ ------------------------------------
import matplotlib.cm as cm
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import plotly.express as px
import plotly.io as pio
from plotly.tools import mpl_to_plotly

In [None]:
%%time

# Modelling ------------------------------------
from statsmodels.tsa.seasonal import seasonal_decompose


from pycaret.anomaly import *

import scipy.stats as stats
from scipy.signal import savgol_filter

from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly, plot_yearly, add_changepoints_to_plot

## 1.B. Inserts for Jupyter
<span id="toc-jupins"></span>

In [None]:
# class RenderJSON(object):
# https://github.com/xR86/core/blob/master/template_notebook.ipynb

In [None]:
%%javascript
/*Increase timeout to load properly*/
var rto = 120;
console.log('[Custom]: Increase require() timeout to', rto, 'seconds.');
window.requirejs.config({waitSeconds: rto});

In [None]:
%%html

<style>
    /* font for TODO */
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    
    .hl {
        padding: 0.25rem 0.3rem;
        border-radius: 5px;
    }
    /* used: https://www.color-hex.com/color-palette/87453 */
    .hl.hl-yellow  { background-color: rgba(204,246,43,0.5); /*#fdef41;*/ }
    .hl.hl-orange  { background-color: rgba(255,150,42,0.5); }
    .hl.hl-magenta { background-color: rgba(244,73,211,0.5); }
    .hl.hl-blue    { background-color: rgba(80,127,255,0.5); }
    .hl.hl-violet  { background-color: rgba(149,47,255,0.5); }
                    
    kbd.cr {
        padding: 2px 3px;
        background-color: red;
        color: #FFF;
        border-radius: 5px;
    }

    kbd.xmltag {
        background-color: #ff8c8c;
        color: #FFF;
    }
    kbd.xmltag.xmltag--subnode {
        background-color: #9f8cff;
        color: #FFF;
    }
    kbd.xmltag.xmltag--subsubnode {
        background-color: #de8cff;
        color: #FFF;
    }
</style>

<!-- ========================================== -->
<h3 style="margin-top:1rem; margin-bottom:2rem"> Examples: </h3>
    
<div>Highlighted text in:
    <span class="hl hl-yellow">yellow</span>,
    <span class="hl hl-orange">orange</span>,
    <span class="hl hl-magenta">magenta</span>,
    <span class="hl hl-blue">blue</span>,
    <span class="hl hl-violet">violet</span>,
</div>

<br/>

<br/><br/>

Tags: <kbd class="cr">CR</kbd> (CR for Camera-Ready, graphs/sections that are important)

## 1.C. Import Data
<span id="toc-data"></span>

Files:
+ `Data0.csv` - training sample
+ `Data1.csv` - test sample
+ `sampleSubmission.csv` - sample submission in the correct format

Data fields:
+ `Date` - la date au format Y-M-D HH:MM:SS
+ `X1`, ..., `X10` - power consumption of size 10 units
+ `X11`, ..., `X20` - power consumption of size 100 units
+ `X21`, ..., `X30` - power consumption of size 1000 units
+ `Temperature` - temperature achieved in London

In [None]:
!ls ~/.kaggle

In [None]:
%%bash
if [ ! -f ~/.kaggle/kaggle.json ]; then
    mkdir ~/.kaggle
    echo '{"username":"danalexandru","key":"e2ea618037910b87f70cc886f61ef5db"}' > ~/.kaggle/kaggle.json
    chmod 600 ~/.kaggle/kaggle.json
else
    echo 'kaggle.json already exists'
fi

In [None]:
!rm -rf data/raw/

In [None]:
%%bash
FORCE='False'
if [ ! -f data/raw/Data0.csv ] || [ $FORCE == 'True' ];then
    kaggle competitions download -c multi-level-electricity-load-forecasting
    # 1.11 GB download
    # kaggle datasets download -d jeanmidev/smart-meters-in-london
    
    mkdir -p data/raw/
    unzip multi-level-electricity-load-forecasting.zip -d data/raw/
    # unzip smart-meters-in-london.zip -d data/raw/
    # or
    cp ../input/smart-meters-in-london/weather_* data/raw/
    
    rm -f *.zip
else
    echo 'Data exists'
fi

In [None]:
%%bash
SOURCE="data/raw/"
ls -l $SOURCE

In [None]:
%%time
df_train = pd.read_csv('data/raw/Data0.csv', parse_dates=['Date'])
df_train.info(verbose=False)
print()

df_test = pd.read_csv('data/raw/Data1.csv', parse_dates=['Date'])
df_test.info(verbose=False)
print()

df_weather = pd.read_csv('data/raw/weather_hourly_darksky.csv', parse_dates=['time'])
df_weather.info(verbose=False)
print()

In [None]:
train_split = round(df_train.shape[0] * 100 / (df_train.shape[0] + df_test.shape[0]),2)
test_split = 100 - train_split

print(f'Train-Test split: {train_split}% / {test_split}%')

**Note**: Strange split, would've expected 60-80% train.

In [None]:
df_train.head()

In [None]:
df_train.tail()

[⬆️ Back to top](#toc)

---

# 2. EDA
<span id="toc-eda"></span>

Power consumption data at half-hour time steps for the year 2013. More precisely, this is the average consumption of groups of consumers of sizes 10, 100 and 1000.
For each size of group you have 10 consumption series:
- size 10 aggregates: X1, X2,…, X10
- size 100 aggregates: X11, X2,…, X20
- aggregates of size 1000: X21, X2,…, X30

The objective is to forecast the consumption of these groups of customers from January 1 to February 27, 2014.

In [None]:
df_train_10   = df_train[['Date'] + [f'X{i}' for i in range(1, 11)]  + ['Temperature']]
df_train_100  = df_train[['Date'] + [f'X{i}' for i in range(11, 21)] + ['Temperature']]
df_train_1000 = df_train[['Date'] + [f'X{i}' for i in range(21, 31)] + ['Temperature']]

df_train_10.set_index('Date', inplace=True)
df_train_100.set_index('Date', inplace=True)
df_train_1000.set_index('Date', inplace=True)

df_train_10.shape[0]

In [None]:
df_train_10.head()

## 2.A. Energy Consumption per X1 (matplotlib)
<span id="toc-eda-mpl"></span>

In [None]:
%%time
plt.figure(figsize=(20, 7))

x = df_train_10.index
y1 = df_train_10.X1
time_intervals = {
    'Last Day Consumption'  : [x[-1] - pd.DateOffset(days=1)  ,x[-1]],
    'Last Week Consumption' : [x[-1] - pd.DateOffset(weeks=1) ,x[-1]],
    'Last Month Consumption': [x[-1] - pd.DateOffset(months=1),x[-1]],
    'Last Year Consumption' : [x[-1] - pd.DateOffset(years=1) ,x[-1]],
}

# clipping values within range
time_intervals = {
    name: [
        interval[0] if interval[0] in x else x[0],
        interval[1] if interval[1] in x else x[-1]
    ] for name, interval in time_intervals.items()
}

# timestamps to iloc-type index
time_intervals_ind = {name: [x.get_loc(interval[0]), x.get_loc(interval[1])] for name, interval in time_intervals.items()}
# print(y1.between(interval[0], interval[1]))

x = list(x)
y1 = list(y1)

#for i, interval in zip(range(len(time_intervals_ind)), time_intervals_ind):
i = 0
for name, interval in time_intervals_ind.items():
    ind_sta = interval[0]
    ind_end = interval[1]
    ax = plt.subplot(len(time_intervals_ind), 1, i + 1)
    
    ax.plot(x[ind_sta:ind_end], y1[ind_sta:ind_end], c='b')
    # ax.title(name, fontweight='bold')
    ax.title.set_text(r'$\bf{%s}$' % name)
    
    # plt.xlabel('')
    i += 1

plt.subplots_adjust(wspace = 0.35, hspace = len(time_intervals_ind) * 0.4)
# plt.title('Energy consumption per X1 (aggregate of 10)')
plt.show()

Note: By virtue of the type of dataset and visual inspection, the time series seems stationary. This also has to be confirmed statistically.

[⬆️ Back to top](#toc)

### 2.A.a. Checking Stationarity
<span id="toc-eda-stationarity"></span>

In [None]:
from statsmodels.tsa.stattools import adfuller

x = df_train_10.index.dropna().values

result = adfuller(x)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

**Note:** There is a good possibility that the data is non-stationary, given that the p-value stands above the 0.05 mark (null hypothesis of the time series having a unit root can't be rejected), even though the critical values are smaller than the [ADF statistic](https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test).

This might be an experiment setup error, due to improper usage of [ADF parameters](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.adfuller.html).

In [None]:
from statsmodels.tsa.stattools import kpss


def kpss_test(timeseries):
    print("Results of KPSS Test:")
    kpsstest = kpss(timeseries, regression="c", nlags="auto")
    kpss_output = pd.Series(
        kpsstest[0:3], index=["Test Statistic", "p-value", "Lags Used"]
    )
    for key, value in kpsstest[3].items():
        kpss_output["Critical Value (%s)" % key] = value
    print(kpss_output)

kpss_test(x)

**Note**: The statistical test falling below significance level of 0.05 indicates that there is good evidence for rejecting the null hypothesis in favor of the alternative. As such, the [KPSS test](https://en.wikipedia.org/wiki/KPSS_test) concludes that the time series is non-stationary.

**Note**: Since both tests conclude that the time series is non-stationary, **detrending is recommended**, either by differencing or trend removal (model fitting). 

[⬆️ Back to top](#toc)

## 2.B. Energy Consumption per X1 (plotly) 
<span id="toc-eda-ply"></span>

In [None]:
%%time
# x = df_train_10.Date
x = list(df_train_10.index)
y1 = list(df_train_10.X1)

trace = [
    go.Scattergl(
        x = x,
        y = y1,
        name = 'X1',
        mode = 'lines+markers',
        # hoverinfo = 'text',
        # text = ['x: %s<br>y: %s<br>cluster %i' % (x_i, y_i, c_i) for x_i, y_i, c_i in zip(x, y, c)]
    )
]

layout = go.Layout(
    title='Energy consumption per X1 (aggregate of 10)',
    hovermode='closest',
    
    xaxis=dict(
        #autorange = False,
        #fixedrange= True,
        #constrain = 'range',
        range=[x[0] - pd.DateOffset(days=1),x[-1] + pd.DateOffset(hours=3)],
        type='date',
        
        rangeslider=dict(
            visible=True,
        ),
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1D', step='day', stepmode='backward'),
                dict(count=7, label='7D', step='day', stepmode='backward'),
                dict(count=1, label='1M', step='month', stepmode='backward'),
                dict(count=6, label='6M', step='month', stepmode='backward'),
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=1, label='YTD', step='year', stepmode='todate'),
                dict(step='all')
            ])
        ),
        # range=[x[-1000],x[-1]]
        # range=[x[-1] - pd.DateOffset(months=1),x[-1]]
    ),
    yaxis=dict(
        # title='Energy consumed (kWh) ?'
    )
)

fig = go.Figure(data=trace, layout=layout)
iplot(fig)

[⬆️ Back to top](#toc)

## 2.C. Energy Consumption vs Temperature (plotly) 
<span id="toc-eda-link"></span>

In [None]:
df_train_10.describe()

In [None]:
%%time
df = df_train_10['Temperature']
df = df.resample('W').mean()

x = list(df.index)
y = list(df.values)

trace = [
    go.Scattergl(
        x = x,
        y = y,
        name = 'Temperature',
        mode = 'lines+markers',
        
        marker=dict(
            size=16,
            cmax=40,
            cmin=-10,
            color=y,
            colorbar=dict(
                title="Temperature"
            ),
            colorscale="Spectral",
            reversescale=True
        ),
    ),
]

layout = go.Layout(
    title='Temperature in London',
    hovermode='closest',
    
    xaxis=dict(
        range=[x[0] - pd.DateOffset(days=1),x[-1] + pd.DateOffset(hours=3)],
        type='date',
        
        rangeslider=dict(
            visible=True,
        ),
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1D', step='day', stepmode='backward'),
                dict(count=7, label='7D', step='day', stepmode='backward'),
                dict(count=1, label='1M', step='month', stepmode='backward'),
                dict(count=6, label='6M', step='month', stepmode='backward'),
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=1, label='YTD', step='year', stepmode='todate'),
                dict(step='all')
            ])
        ),
    ),
    yaxis=dict(
        # title='Energy consumed (kWh) ?'
        # type='log'
    )
)

fig = go.Figure(data=trace, layout=layout)
iplot(fig)

**Note:** Given the mild climate of London, seasonality due to the cold season is the more relevant one (summer temperatures are mild and shouldn't create any trend increase in energy consumption due to cooling solutions).

In [None]:
%%time
# df = df_train_10.resample('3T').sum()
df = df_train_10[['X1', 'Temperature']]
agg_rules = { 'X1': 'sum', 'Temperature': 'mean'}
df = df.resample('W').agg(agg_rules)

# x = df_train_10.Date
x = list(df.index)
y1 = list(df.X1)
y2 = list(df.Temperature * 5)

trace = [
    go.Scattergl(
        x = x,
        y = y1,
        name = 'X1',
        mode = 'lines+markers',
        # hoverinfo = 'text',
        # text = ['x: %s<br>y: %s<br>cluster %i' % (x_i, y_i, c_i) for x_i, y_i, c_i in zip(x, y, c)]
    ),
    go.Scattergl(
        x = x,
        y = y2,
        name = 'Temperature (shifted)',
        mode = 'lines+markers',
    ),
]

layout = go.Layout(
    title='Possible link between temperature and energy consumption',
    hovermode='closest',
    
    xaxis=dict(
        #autorange = False,
        #fixedrange= True,
        #constrain = 'range',
        range=[x[0] - pd.DateOffset(days=1),x[-1] + pd.DateOffset(hours=3)],
        type='date',
        
        rangeslider=dict(
            visible=True,
        ),
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1D', step='day', stepmode='backward'),
                dict(count=7, label='7D', step='day', stepmode='backward'),
                dict(count=1, label='1M', step='month', stepmode='backward'),
                dict(count=6, label='6M', step='month', stepmode='backward'),
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=1, label='YTD', step='year', stepmode='todate'),
                dict(step='all')
            ])
        ),
        # range=[x[-1000],x[-1]]
        # range=[x[-1] - pd.DateOffset(months=1),x[-1]]
    ),
    yaxis=dict(
        # title='Energy consumed (kWh) ?'
        # type='log'
    )
)

fig = go.Figure(data=trace, layout=layout)
iplot(fig)

## 2.D. Checking rolling correlation
<span id="toc-eda-corr-roll"></span>

In [None]:
%%time
df = df_train_10[['X1', 'Temperature']]
agg_rules = { 'X1': 'sum', 'Temperature': 'mean'}
df = df.resample('W').agg(agg_rules)

overall_pearson_r = df.corr().iloc[0,1]
print(f"Pandas computed Pearson r: {overall_pearson_r}")

r, p = stats.pearsonr(df.dropna()['X1'], df.dropna()['Temperature'])
print(f"Scipy computed Pearson r: {r} and p-value: {p}")

f,ax = plt.subplots(figsize=(7,3))
df.rolling(window=30,center=True).median().plot(ax=ax)
ax.set(xlabel='Time',ylabel='Pearson r')
ax.set(title=f"Overall Pearson r = {np.round(overall_pearson_r,2)}");

In [None]:
r_window_size = 10
df_interpolated = df.interpolate()

rolling_r = df_interpolated['X1'].rolling(window=r_window_size, center=True).corr(df_interpolated['Temperature'])

f,ax=plt.subplots(2,1,figsize=(14,6),sharex=True)
df.rolling(window=30,center=True).median().plot(ax=ax[0])
ax[0].set(xlabel='Frame',ylabel='X1')

rolling_r.plot(ax=ax[1])
ax[1].set(xlabel='Frame',ylabel='Pearson r')
plt.suptitle('Rolling window correlation')

## 2.E. Checking global correlation (weekly resampling)
<span id="toc-eda-corr-global"></span>

In [None]:
df = df_train_10[[f'X{i}' for i in range(1, 11)] + ['Temperature']]
agg_rules = { f'X{i}': 'sum' for i in range(1,11)}
agg_rules['Temperature'] = 'mean'
df = df.resample('W').agg(agg_rules)

# method : {‘pearson’, ‘kendall’, ‘spearman’}
corr = df.corr(method="pearson")

bool_upper_matrix = np.tril(np.ones(corr.shape), k=-1).astype(np.bool)
corr = corr.where(bool_upper_matrix)
display(corr)

In [None]:
data = corr.reset_index(drop=True).values.tolist()
fig = px.imshow(
    data,
    labels=dict(
        x='Feature', y='Feature', color='Pearson Correlation'
    ),
    x=[f'X{i}' for i in range(1, 11)] + ['Temp'],
    y=[f'X{i}' for i in range(1, 11)] + ['Temp']
)
fig.update_xaxes(side='top')
fig.update_layout(
    title='Aggregates of size 10 correlations',
    autosize=False,
    width=1300,
    height=600
)
fig.show()

In [None]:
x1 = corr.drop(['Temperature'], axis=1).drop(['Temperature']).values.reshape(-1)
x2 = corr.iloc[-1].values.reshape(-1)
x1 = x1[~np.isnan(x1)]
x2 = x2[~np.isnan(x2)]
# print(np.sort(np.concatenate((x1, x2))))

data = [
    go.Histogram(x=x1, nbinsx=15, name='Pairwise X[1-10] Pearson correlations'),
    go.Histogram(x=x2, nbinsx=15, name='Temperature vs X[1-10] correlations')
]

layout = go.Layout(title='Histogram of correlation scores')

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Note:
+ Pairwise correlations between aggregates of 10 (`[X1-X10]`) indicate a strong positive correlation. This observation will be considered in the ensemble model.
+ Temperature correlation with each of the aggregates indicates a strong negative correlation. Informally, this partially validates the assumption that higher consumption of energy is consistent with lower temperatures, even though 

[⬆️ Back to top](#toc)

---

# 2. Modelling
<span id="toc-modelling"></span>

In [None]:
df = df_train_10[['X1']].reset_index()
df = df.rename({'Date': 'ds', 'X1': 'y'}, axis=1)
#df = df.loc[:10000]

In [None]:
df.head()

## 2.A. Outlier Detection
<span id="toc-outlier"></span>

In [None]:
# plt.rc('figure',figsize=(12,8))
# plt.rc('font',size=15)
# result = seasonal_decompose(df.set_index('ds').y,model='additive')
# fig = result.plot()

In [None]:
s = setup(df, session_id = 123)

In [None]:
models()

In [None]:
iforest = create_model('iforest', fraction = 0.1)
iforest_results = assign_model(iforest)
iforest_results.head()

In [None]:
iforest_results[iforest_results['Anomaly'] == 1].head()

In [None]:
print(iforest_results.shape[0])
print(iforest_results[iforest_results['Anomaly'] == 1].shape[0])

In [None]:
x = iforest_results[iforest_results['Anomaly'] == 1].Anomaly_Score

data = [
    go.Histogram(x=x, nbinsx=50, name=''),
]

layout = go.Layout(title='Histogram of anomaly scores (IsolationForest)')

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
%%time
x1 = list(df.ds)
y1 = list(df.y)
x2 = list(iforest_results[iforest_results['Anomaly'] == 1].ds)
y2 = list(iforest_results[iforest_results['Anomaly'] == 1].y)

trace = [
    go.Scattergl(
        x = x1,
        y = y1,
        name = 'X1',
        mode = 'lines+markers',
        # hoverinfo = 'text',
        # text = ['x: %s<br>y: %s<br>cluster %i' % (x_i, y_i, c_i) for x_i, y_i, c_i in zip(x, y, c)]
    ),
    go.Scattergl(
        x = x2,
        y = y2,
        name = 'Anomaly',
        mode = 'markers',
        marker=dict(color='red',size=10)
    )
]

layout = go.Layout(
    title='Anomaly Detection per X1 (aggregate of 10)',
    hovermode='closest',
    
    xaxis=dict(
        #range=[x1[0] - pd.DateOffset(days=1),x1[-1] + pd.DateOffset(hours=3)],
        type='date',
        
        rangeslider=dict(
            visible=True,
        ),
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1D', step='day', stepmode='backward'),
                dict(count=7, label='7D', step='day', stepmode='backward'),
                dict(count=1, label='1M', step='month', stepmode='backward'),
                dict(count=6, label='6M', step='month', stepmode='backward'),
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=1, label='YTD', step='year', stepmode='todate'),
                dict(step='all')
            ])
        ),
    ),
)

fig = go.Figure(data=trace, layout=layout)
iplot(fig)

[⬆️ Back to top](#toc)

## 2.B. Signal Smoothing
<span id="toc-signal"></span>

In [None]:
df = df_train_10[['X1']].reset_index()
df = df.rename({'Date': 'ds', 'X1': 'y'}, axis=1)
df = df.loc[:10000]

df.tail()

In [None]:
%%time
yhat = savgol_filter(df.y, 51, 3) # window size 51, polynomial order 3

In [None]:
%%time
x = list(df.ds)
y1 = list(df.y)

trace = [
    go.Scatter(
        x = x,
        y = y1,
        name = 'X1',
        mode = 'lines+markers',
    ),
    go.Scatter(
        x = x,
        y = yhat,
        name = 'Savgol Filter(X1,51,3)',
        mode = 'lines+markers',
    )
]

layout = go.Layout(
    title='Energy consumption per X1 (aggregate of 10)',
    hovermode='closest',
    
    xaxis=dict(
        range=[x[0] - pd.DateOffset(days=1),x[-1] + pd.DateOffset(hours=3)],
        type='date',
        
        rangeslider=dict(
            visible=True,
        ),
        rangeselector=dict(
            buttons=list([
                dict(count=1, label='1D', step='day', stepmode='backward'),
                dict(count=7, label='7D', step='day', stepmode='backward'),
                dict(count=1, label='1M', step='month', stepmode='backward'),
                dict(count=6, label='6M', step='month', stepmode='backward'),
                dict(count=1, label='1Y', step='year', stepmode='backward'),
                dict(count=1, label='YTD', step='year', stepmode='todate'),
                dict(step='all')
            ])
        ),
    ),
    yaxis=dict(
        # title='Energy consumed (kWh) ?'
    )
)

fig = go.Figure(data=trace, layout=layout)
iplot(fig)

[⬆️ Back to top](#toc)

## 2.C. Prophet model
<span id="toc-prophet"></span>

In [None]:
%%time
model_prophet = Prophet(changepoint_prior_scale=5)
# default: changepoint_prior_scale=0.05
m = Prophet()
m.add_country_holidays(country_name='UK')
m.fit(df)

In [None]:
m.train_holiday_names

In [None]:
%%time
future = m.make_future_dataframe(periods=50)
print(future.tail())

In [None]:
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
fig1 = m.plot(forecast[:2000])

In [None]:
fig1 = m.plot(forecast)

In [None]:
fig1 = m.plot(forecast)
a = add_changepoints_to_plot(fig1.gca(), m, forecast)

In [None]:
fig2 = m.plot_components(forecast)

In [None]:
plot_plotly(m, forecast)

In [None]:
# plot_components_plotly(m, forecast)

Note:
+ 

[⬆️ Back to top](#toc)

---

In [None]:
nb_end = dt.now()
print('Time elapsed: %s' % (nb_end - nb_start))

In [None]:
'Time elapsed: %.2f minutes' % (
    (nb_end - nb_start).total_seconds() / 60
)

---

# 3. Bibliography
<span id="toc-bibliography"></span>

Resources:
+ [Statsmodels - Stationarity and detrending (ADF/KPSS)](https://www.statsmodels.org/dev/examples/notebooks/generated/stationarity_detrending_adf_kpss.html)
+ ...

[⬆️ Back to top](#toc)

---