# Task 1:  Day-ahead energy forecasting


## <u>General Setting</u>

Electricity, till date, cannot be stored in large amounts, therefore supply and demand need to always be balanced by the energy providers. The accurate short term forecast of energy demand is critical for the operations and control of productive capacity, with significant consequences. However, as with any forecast, there is typically some uncertainty involved. This uncertainty is especially heightened in the case of energy forecasting today, where alternate sources of energy such as solar panels are ubiquitous. Also, with the transition to e-mobility additional non-traditional consumer patterns contribute to the forecasting uncertainty. Therefore, understanding electricty consumption behaviour either for individual households or for regional groups of households becomes key for the future electricity market.

## <u>Task</u>

This data science challenge task entails estimating day-ahead-forecasts for upto a week, for 61 groups of dwellings in the UK energy market, based on geographical similarity. The challenge has two sub-tasks- the first where only one value for the single day ahead is required to be estimated, in other words-the aggregated day-ahead demand. In the second sub task, the demand for each hour in the day-ahead is to be estimated (24 per day).

You are provided with historical half-hourly energy readings for the 61 anonymised groups between 1 January, 2017 and 04 September 2019. A week is sliced off from each 45 day window and reserved for testing purposes. You are required to estimate these missing periods in the two frequencies. 

Every group consists of a different number of dwellings, which energy consumption profile has been summed up for two reasons: data privacy and forecasting accuracy. 

All data is provided in csv format and described below. We also provide code snippets for loading the data and creating submission files.


## <u> Data </u>

`train.csv`: contains the data values in KWh at a half hourly frequency for the 61 different groups.

<b>Column Description</b>:

`pseudo_id`: Anonymised IDs for dwelling groups (string).

$a_{ij}$: Energy consumption for household $i$ between timestep $j$ and $j+1$ (float64). </br>
For e.g.: </br>
`2017-01-01 00:00:00` indicates electricity consumption in KWh between 2017-01-01 00:00:00 and 2017-01-01 00:30:00. 

### Data snapshot

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import average_precision_score

In [2]:
### Load csv in pandas, index_col = 'pseduo_id'
train_df = pd.read_csv('../clustered/train.csv', index_col = 'pseudo_id')

In [3]:
### Print first few lines
train_df.head()

Unnamed: 0_level_0,2017-01-01 00:00:00,2017-01-01 00:30:00,2017-01-01 01:00:00,2017-01-01 01:30:00,2017-01-01 02:00:00,2017-01-01 02:30:00,2017-01-01 03:00:00,2017-01-01 03:30:00,2017-01-01 04:00:00,2017-01-01 04:30:00,...,2019-08-28 19:00:00,2019-08-28 19:30:00,2019-08-28 20:00:00,2019-08-28 20:30:00,2019-08-28 21:00:00,2019-08-28 21:30:00,2019-08-28 22:00:00,2019-08-28 22:30:00,2019-08-28 23:00:00,2019-08-28 23:30:00
pseudo_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0x16cb02173ebf3059efdc97fd1819f14a2,45.023,39.985,36.5695,34.748,35.972,38.439,36.591,36.3155,32.6605,0.142,...,24.288,23.994,26.1995,25.027,23.0665,26.093,23.4295,25.4715,26.246,22.602
0x1c9d08cd16fce04790ef900695861e786,2.931,1.641,2.26,2.273,2.651,3.137,2.532,3.142,2.528,0.0,...,2.57,1.446,1.523,1.563,2.588,2.19,1.486,2.527,2.288,1.794
0x1612e4cbe3b1b85c3dbcaeaa504ee8424,11.014,12.6525,10.824,13.7485,12.383,12.342,13.413,11.484,11.5105,0.0455,...,6.3565,5.766,5.4955,5.0885,6.814,7.492,5.7705,6.824,6.072,6.7205
0x20158d36236a640cf0524dba149459169,55.813,49.04,49.095,41.133,45.66,48.477,50.539,45.737,42.68,0.0,...,32.646,30.439,30.247,31.266,34.339,33.076,33.108,33.726,30.009,34.84
0xc305005dcb1ed6128d816954c5ab9e7e,26.925,28.118,25.6,28.091,26.53,23.858,26.556,27.714,23.174,0.0,...,13.398,13.28,13.734,13.606,14.7,16.29,15.124,15.365,14.36,13.935


## <u>Evaluation</u>


The evaluation metric for this competition is Mean Absolute Percentage Error, computed as:

$\text{MAPE} = \frac{1}{T*G} \sum_{g} (\sum_{t} |\frac{a_{t,g}-f_{t,g}}{a_{t,g}}|)  \times 100$ where,

$T$: number of timesteps (days for sub-task 1 and hours for sub-task 2)</br>
$G$: number of groups (days for sub-task 1 and hours for sub-task 2)</br>
$a_{i,j}$: actual value for time point t of group g </br>
$f_{i,j}$: forecasted value for time point t of group g</br>

We will weight the the MAPE for both sub-tasks equally. 

The script below may be used to calculate the final score. MAPE averaged over all the timesteps over all the days will be the final score.

In [12]:
def evaluate(y, yhat, perc=True):
    y = y.drop('pseudo_id', axis = 1).values
    yhat = yhat.drop('pseudo_id', axis = 1).values
    n = len(yhat.index) if type(yhat) == pd.Series else len(yhat)
    for i in range(n):
        error = []
        for a, f in zip(y[i], yhat[i]):
            # avoid division by 0
            if a > 0:
                error.append(np.abs((a - f)/(a)))
        mape = np.mean(np.array(error))
    return mape * 100. if perc else mape

### Submission of Results

For each `pseudo_id` given in the train set, you are required to predict the energy consumption for each missing day in the follwing two formats:

1. Subtask 1: Aggregated usage values for each day, i.e., 1 value per day, and 
2. Subtask 2: Hourly usage values for each day, i.e., 24 values per day.

<b> Example Submission </b>

Subtask 1:


<code> ```pseudo_id, 2017-02-08 00:00:00 , 2017-02-09 00:00:00, 2017-02-10 00:00:00, ........, 2019-02-28 00:00:00
0xd05, 23.4, 11.3, 23.2, ......., 32.4 
0xd06, 21.4, 21.3, 13,2, ......., 42.4```</code>



Subtask 2:


<code>```pseudo_id, 2017-02-08 00:00:00 , 2017-02-08 01:00:00, 2017-02-08 02:00:00, ........, 2019-02-28 23:00:00
0xd05, 3.4, 1.3, 2.2, ......., 3.4
0xd06, 1.4, 1.3, 3.2, ......., 1.4```</code>

In [13]:
submission_1 = pd.read_csv('../clustered/sample_submission_daily.csv')
submission_1_GT = pd.read_csv('../clustered/test_daily.csv')

<b> Example submission 1 file </b>


In [14]:
submission_1.head()

Unnamed: 0,pseudo_id,2017-02-08,2017-02-09,2017-02-10,2017-02-11,2017-02-12,2017-02-13,2017-02-14,2017-03-25,2017-03-26,...,2019-08-01,2019-08-02,2019-08-03,2019-08-29,2019-08-30,2019-08-31,2019-09-01,2019-09-02,2019-09-03,2019-09-04
0,0x16cb02173ebf3059efdc97fd1819f14a2,13.987208,29.553732,47.6192,143.206933,58.411633,43.56297,64.988182,26.708951,20.661111,...,29.835391,36.48033,23.979775,143.74715,18.88134,138.550731,18.482187,43.408845,51.734609,14.747854
1,0x1c9d08cd16fce04790ef900695861e786,1.08491,2.081561,3.002167,8.366533,3.4908,2.360909,3.558864,1.423885,1.132635,...,2.493652,3.29322,2.189275,12.4148,1.488588,10.490077,1.468819,3.977207,3.92587,1.214549
2,0x1612e4cbe3b1b85c3dbcaeaa504ee8424,4.007826,7.90778,13.61035,43.0345,19.164833,15.053394,20.003545,11.056025,7.523532,...,6.00337,7.9228,6.300308,33.49185,4.994289,34.775462,4.569518,10.389448,11.236304,3.112341
3,0x20158d36236a640cf0524dba149459169,35.230236,67.194805,98.574933,264.427533,113.1225,72.965606,100.897636,36.520836,24.231635,...,37.844283,44.18342,27.868945,185.029,24.310608,181.893923,23.39559,56.930517,65.395217,18.410768
4,0xc305005dcb1ed6128d816954c5ab9e7e,6.789202,14.484,22.991033,66.257067,27.345,20.335576,31.606682,13.573951,10.192111,...,21.208717,24.36026,17.34844,99.9749,12.834557,102.790385,11.396711,24.625069,28.853609,8.209098


<b> Compute MAPE on Sub-task 1: </b>

In [15]:
evaluate(submission_1_GT, submission_1)

94.9937354787529

In [16]:
submission_2 = pd.read_csv('../clustered/sample_submission_hourly.csv')
submission_2_GT = pd.read_csv('../clustered/test_hourly.csv')

<b> Example submission 2 file </b>


In [20]:
submission_2.head()

Unnamed: 0,pseudo_id,2017-02-08 00:00:00,2017-02-08 01:00:00,2017-02-08 02:00:00,2017-02-08 03:00:00,2017-02-08 04:00:00,2017-02-08 05:00:00,2017-02-08 06:00:00,2017-02-08 07:00:00,2017-02-08 08:00:00,...,2019-09-04 14:00:00,2019-09-04 15:00:00,2019-09-04 16:00:00,2019-09-04 17:00:00,2019-09-04 18:00:00,2019-09-04 19:00:00,2019-09-04 20:00:00,2019-09-04 21:00:00,2019-09-04 22:00:00,2019-09-04 23:00:00
0,0x16cb02173ebf3059efdc97fd1819f14a2,1.617657,0.678305,0.606452,1.203138,0.59481,1.381811,0.975096,0.73,0.739507,...,1.773145,1.274789,0.915474,0.646434,0.985317,1.273444,0.545165,0.501576,2.264024,0.532512
1,0x1c9d08cd16fce04790ef900695861e786,0.125943,0.041286,0.049639,0.102936,0.039607,0.137216,0.083053,0.061795,0.06938,...,0.157032,0.093868,0.057105,0.046566,0.088583,0.096244,0.039468,0.042182,0.171524,0.052268
2,0x1612e4cbe3b1b85c3dbcaeaa504ee8424,0.453614,0.225649,0.188331,0.357681,0.171113,0.507932,0.291158,0.21459,0.193479,...,0.37929,0.318158,0.199579,0.12903,0.176175,0.246567,0.124622,0.124051,0.477667,0.155732
3,0x20158d36236a640cf0524dba149459169,4.409086,1.918156,1.767253,3.284383,1.821131,4.220216,2.831474,1.949667,1.792296,...,2.157065,1.586842,1.105526,0.792771,1.146467,1.532578,0.674117,0.646576,3.095619,0.739439
4,0xc305005dcb1ed6128d816954c5ab9e7e,0.772486,0.319701,0.311663,0.615936,0.323869,0.697946,0.513684,0.319782,0.372324,...,0.887677,0.719263,0.501447,0.339012,0.522633,0.695178,0.293479,0.290869,1.163095,0.319951


<b> Compute MAPE on Sub-task 2: </b>

In [21]:
evaluate(submission_2_GT, submission_2)

94.70371431190887

### The lower the final score, the better the forecast

### Challenge rules


1. The submission file must be a single text-file as csv in the format:
</i><b> N×[′pseudo_id′,timestamp_1(float), timestamp_2(float),...,timestamp_24(float) ] </i></b>, where pseudo_id is the predicted group id.
    
2. Participants are not permitted to use external data for system development. Any combination of feature engineering and modelling techniques given the data is permitted. Creating high nunmber of decision ensembles is highly encouraged.

