Note for the start: *I was a bit shocked that here on kaggle only a few notebooks exist, daring to predict the M5 with uncertainty. Most of the notebooks simply translate the results of the accuracy competition.*

I just recently started writing in Python and tried applying a Neural Network to the problem. I am just a beginner in this field, therefore please be patient :)

# M5 Competition - Uncertainty - LSTM Neural Network

1. [Introduction and sources](#sources)


2. [Preparing to start](#prepare)
    * [Loading packages](#packages)
    * [Loading data](#data)
    * [Looking at the hierachy](#hierarchy_ts)
    
    
3. [The submission format](#submission)
    * [Intro](#intro)
    * [Prediction intervals and quartiles](#PIs)
    * [Outlier](#outlier)
    * [Aggregation levels](#sub_aggregation_levels)
    
    
4. [Model Preparation](#Feature_Creation)
    * [Limited Features](#limfeat)
    * [More Features](#morefeat)
    * [Pricing Feature](#pricefeat)
    * [Feature Scaling](#featscale)
    * [Train and Test Data Creation](#traintest)
    
    
5. [LSTM Modeling](#Modeling)
    * [Loss Function](#lossfct)
    * [Running the Model](#runmodel)
    * [Creating the submission file](#submission)

The M5 competition ran from 2 March to 30 June 2020. Basis of the competition is to predict sales forecasts for walm,art stores. For that, we use hierarchical sales data, generously made available by Walmart, starting at the item level and aggregating to that of departments, product categories, stores in three geographical areas of the US: California, Texas, and Wisconsin.

Each row contains an id that is a concatenation of an item_id and a store_id, which is either validation (corresponding to the Public leaderboard), or evaluation (corresponding to the Private leaderboard). 

We are predicting 28 forecast days (F1-F28) of items sold for each row. For the **validation rows**, this corresponds to d_1914 - d_1941, and for the **evaluation rows**, this corresponds to d_1942 - d_1969. (Note: a month before the competition close, the ground truth for the validation rows will be provided.)

Detailed Information can be found [at the university website or](https://mofc.unic.ac.cy/m5-competition/) and [the competition website on kaggle](https://www.kaggle.com/c/m5-forecasting-accuracy/data).

An overview of the data given, can be seen here:

Data exists in three files:
1. File 1: “calendar.csv”: Contains information about the dates the products are sold.
    * date: The date in a “y-m-d” format.
    * wm_yr_wk: The id of the week the date belongs to.
    * weekday: The type of the day (Saturday, Sunday, …, Friday).
    * wday: The id of the weekday, starting from Saturday.
    * month: The month of the date.
    * year: The year of the date.
    * event_name_1: If the date includes an event, the name of this event.
    * event_type_1: If the date includes an event, the type of this event.
    * event_name_2: If the date includes a second event, the name of this event.
    * event_type_2: If the date includes a second event, the type of this event.
    * snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP  purchases on the examined date. 1 indicates that SNAP purchases are allowed.


2. File 2: “sell_prices.csv”: Contains information about the price of the products sold per store and date.
    * store_id: The id of the store where the product is sold. 
    * item_id: The id of the product.
    * wm_yr_wk: The id of the week.
    * sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).  


3. File 3: “sales_train.csv”: Contains the historical daily unit sales data per product and store.
    * item_id: The id of the product.
    * dept_id: The id of the department the product belongs to.
    * cat_id: The id of the category the product belongs to.
    * store_id: The id of the store where the product is sold.
    * state_id: The State where the store is located.
    * d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29. 


![grafik.png](attachment:grafik.png)

# Sources and guidelines <a class="anchor" id="sources"></a>

Aknowledgements
As a starting point I mainly used these notebooks:  
* [baseline LSTM of Accuracy Prediction](https://www.kaggle.com/bountyhunters/baseline-lstm-with-keras-0-8#Future-Improvements)    
* [Quantile regression, from linear models to trees to deep learning](https://towardsdatascience.com/quantile-regression-from-linear-models-to-trees-to-deep-learning-af3738b527c3)
* [Deep Quantiel Regression](https://towardsdatascience.com/deep-quantile-regression-c85481548b5a)
* [M5 Uncertainty Notebook by Allunia](https://www.kaggle.com/allunia/m5-uncertainty)

My other M5 notebook for Accuracy can be seen here: (will be uploaded soon)

# Preparing to start <a class="anchor" id="prepare"></a>

## Loading packages <a class="anchor" id="packages"></a>

In [None]:
import pandas as pd
import numpy as np
import sklearn as skl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sc
from sklearn.metrics import roc_auc_score
import gc #importing garbage collector
import time
from scipy import signal



import warnings
warnings.filterwarnings('ignore')

%matplotlib inline  

SEED = 42
#Pandas - Displaying more rorws and columns
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
timesteps = 14
startDay = 0

## Loading data <a class="anchor" id="data"></a>

In [None]:
df_train = pd.read_csv('/kaggle/input/m5-forecasting-uncertainty/sales_train_validation.csv')
df_prices = pd.read_csv('/kaggle/input/m5-forecasting-uncertainty/sell_prices.csv')
df_days = pd.read_csv('/kaggle/input/m5-forecasting-uncertainty/calendar.csv')

df_train = reduce_mem_usage(df_train)
df_prices = reduce_mem_usage(df_prices)
df_days = reduce_mem_usage(df_days)

## Looking at the hierachy <a class="anchor" id="hierarchy_ts"></a>

In the competition guideline we can find that the hierarchy consits of 12 levels. Let's try to reconstruct some of them:

1. The top is given by the unit sales of all products, aggregated for all stores/states. 
2. Unit sales of all products, aggregated for each state.
3. Unit sales of all products, aggregated for each store.
4. Unit sales of all products, aggregated for each category.
5. Unit sales of all products, aggregated for each department.  
...

In [None]:
series_cols = df_train.columns[df_train.columns.str.contains("d_")].values
level_cols = df_train.columns[df_train.columns.str.contains("d_")==False].values

In [None]:
df_train.head(1)

In [None]:
sns.set_palette("colorblind")

fig, ax = plt.subplots(5,1,figsize=(20,28))
df_train[series_cols].sum().plot(ax=ax[0])
ax[0].set_title("Top-Level-1: Summed product sales of all stores and states")
ax[0].set_ylabel("Unit sales of all products");
df_train.groupby("state_id")[series_cols].sum().transpose().plot(ax=ax[1])
ax[1].set_title("Level-2: Summed product sales of all stores per state");
ax[1].set_ylabel("Unit sales of all products");
df_train.groupby("store_id")[series_cols].sum().transpose().plot(ax=ax[2])
ax[2].set_title("Level-3: Summed product sales per store")
ax[2].set_ylabel("Unit sales of all products");
df_train.groupby("cat_id")[series_cols].sum().transpose().plot(ax=ax[3])
ax[3].set_title("Level-4: Summed product sales per category")
ax[3].set_ylabel("Unit sales of all products");
df_train.groupby("dept_id")[series_cols].sum().transpose().plot(ax=ax[4])
ax[4].set_title("Level-4: Summed product sales per product department")
ax[4].set_ylabel("Unit sales of all products");

### Insights

* It has become much clearer how these levels are aggregated by performing groupby- and summing up the sales.
* We can already observe nice periodic patterns. 

# The submission format <a class="anchor" id="submission"></a>

## Intro <a class="anchor" id="intro"></a>

* We have 28 F-columns as we are predicting daily sales for the next 28 days. 
* We are asked to make uncertainty estimates for these days.

In [None]:
submission_sample =pd.read_csv('/kaggle/input/m5-forecasting-uncertainty/sample_submission.csv')
submission_sample.head(10)

* In the first submission row we are asked to make precitions for the top level 1 (unit sales of all products, aggregated for all stores/states)
* The next 3 rows represent level 2.
* Followed by level 3 and so on
* Some rows contain aggregations at different levels. An X indicates the absence of an second aggregration level.
* The prediction interval can be validation (related to the public leaderboard) or evaluation (related to the private leaderboard).

An overview of the different levels is given in the Competitors Guide as follows:

| Level id|	Aggregation Level|	Number of series|
|:----|:----|:----|
|1|Unit sales of all products, aggregated for all stores/states|	1|
|2|Unit sales of all products, aggregated for each State|	3|
|3|Unit sales of all products, aggregated for each store| 	10|
|4|Unit sales of all products, aggregated for each category|	3|
|5|Unit sales of all products, aggregated for each department|	7|
|6|Unit sales of all products, aggregated for each State and category|	9|
|7|Unit sales of all products, aggregated for each State and department|	21|
|8|Unit sales of all products, aggregated for each store and category|	30|
|9|Unit sales of all products, aggregated for each store and department|	70|
|10|Unit sales of product x, aggregated for all stores/states|	3,049|
|11|Unit sales of product x, aggregated for each State|	9,147|
|12|Unit sales of product x, aggregated for each store|	30,490|
| |**Total**|**42,840**|

## Prediction intervals and quantiles <a class="anchor" id="PIs"></a>

Given that forecasters will be asked to provide the median, and the 50%, 67%, 95%, and 99% PIs, u is set to u1=0.005, u2=0.025, u3=0.165, u4=0.25, u5=0.5, u6=0.75, u7=0.835, u8=0.975, and u9=0.995, therefore leading to the following quartiles:

* 99% PI - $u_{1} = 0.005$ and $u_{9} = 0.995$
* 95% PI - $u_{2} = 0.025$ and $u_{8} = 0.975$
* 67% PI - $u_{3} = 0.165$ and $u_{7} = 0.835$
* 50% PI - $u_{4} = 0.25$ and $u_{6} = 0.75$
* median - $u_{5} = 0.5$

In [None]:
# total number of series * number of quartiles * 2 (validation & evaluation)
42840*9*2

In [None]:
submission_sample.shape

## Outlier <a class="anchor" id="outlier"></a>

At certain days, sales dropped significantly (e.g. christmas).

Here, we take a look at peak days (i.e. peaks in terms of zero sales) on an overall levelm:

In [None]:
temp_series = df_train
plt.figure(figsize=(12,8))
peak_days = []
x = np.count_nonzero(temp_series==0, axis=0)
peaks, _ = sc.signal.find_peaks(x, height=np.quantile(x,0.75), threshold=max(x)/25)
peak_d = temp_series.columns[peaks]
peak_days=peak_d
plt.plot(x)
plt.plot(peaks, x[peaks], "x", color='red')
    
plt.title('Number of Zero Sales per Day')
plt.ylabel('Number of Zero Sales')
plt.xlabel('Days')

In [None]:
peak_days

In [None]:
df_days[df_days['d'].isin(peak_days)]

At almost every outlier day, there is a special vent like thanksgiving or christmas.

In [None]:
peak_days_before=[]
peak_days_after=[]

for i, days in enumerate(peak_days):
    peak_days_before.append('d_'+str(np.int(peak_days[i][2:])-1))
    peak_days_after.append('d_'+str(np.int(peak_days[i][2:])+1))

In [None]:
df_train_no_outlier = df_train.copy().T[1:]
df_train_no_outlier.columns = df_train.T.iloc[0]

for x,y,z in zip(peak_days,peak_days_before,peak_days_after):
        df_train_no_outlier[df_train_no_outlier.index==x] = np.reshape([pd.concat([df_train_no_outlier[df_train_no_outlier.index==y],df_train_no_outlier[df_train_no_outlier.index==z]],axis=0).mean()],(1,30490))

df_train_no_outlier = df_train_no_outlier.T.reset_index()

In [None]:
df_train_no_outlier = pd.concat([df_train_no_outlier[level_cols],df_train_no_outlier[series_cols].apply(pd.to_numeric,downcast='float')],axis=1)
df_train_no_outlier = reduce_mem_usage(df_train_no_outlier)

In [None]:
df_train_no_outlier.info()

Let's take a look if this worked

In [None]:
temp_series = df_train_no_outlier
plt.figure(figsize=(12,8))
x = np.count_nonzero(temp_series==0, axis=0)
plt.plot(x)
    
plt.title('Number of Zero Sales per Day')
plt.ylabel('Number of Zero Sales')
plt.xlabel('Days')
plt.ylim(0,30000)

In [None]:
del temp_series, peak_days_before, peak_days_after, peak_d, peak_days, peaks

In [None]:
del df_train

## Aggregation levels <a class="anchor" id="sub_aggregation_levels"></a>

In [None]:
df_train_no_outlier.head()

As seen in the table above, we are going to create the 12 Levels one after another through grouping statements 

In [None]:
series_cols = df_train_no_outlier.columns[df_train_no_outlier.columns.str.contains("d_")].values
level_cols = df_train_no_outlier.columns[df_train_no_outlier.columns.str.contains("d_")==False].values

In [None]:
Level1 = pd.DataFrame(df_train_no_outlier[series_cols].sum(),columns={'Total'}).T
Level2 = df_train_no_outlier.groupby("state_id")[series_cols].sum()
Level3 = df_train_no_outlier.groupby("store_id")[series_cols].sum()
Level4 = df_train_no_outlier.groupby("cat_id")[series_cols].sum()
Level5 = df_train_no_outlier.groupby("dept_id")[series_cols].sum()

Level6 = df_train_no_outlier.groupby(["state_id",'cat_id'])[series_cols].sum().reset_index()
Level6['index']=''
for row in range(len(Level6)):
    Level6['index'][row]=str(Level6['state_id'][row])+'_'+str(Level6['cat_id'][row])
Level6.set_index(Level6['index'],inplace=True)
Level6.drop(['state_id','cat_id','index'],axis=1,inplace=True)

Level7 = df_train_no_outlier.groupby(["state_id",'dept_id'])[series_cols].sum().reset_index()
Level7['index']=''
for row in range(len(Level7)):
    Level7['index'][row]=str(Level7['state_id'][row])+'_'+str(Level7['dept_id'][row])
Level7.set_index(Level7['index'],inplace=True)
Level7.drop(['state_id','dept_id','index'],axis=1,inplace=True)

Level8 = df_train_no_outlier.groupby(["store_id",'cat_id'])[series_cols].sum().reset_index()
Level8['index']=''
for row in range(len(Level8)):
    Level8['index'][row]=str(Level8['store_id'][row])+'_'+str(Level8['cat_id'][row])
Level8.set_index(Level8['index'],inplace=True)
Level8.drop(['store_id','cat_id','index'],axis=1,inplace=True)

Level9 = df_train_no_outlier.groupby(["store_id",'dept_id'])[series_cols].sum().reset_index()
Level9['index']=''
for row in range(len(Level9)):
    Level9['index'][row]=str(Level9['store_id'][row])+'_'+str(Level9['dept_id'][row])
Level9.set_index(Level9['index'],inplace=True)
Level9.drop(['store_id','dept_id','index'],axis=1,inplace=True)

Level10= df_train_no_outlier.groupby(["item_id"])[series_cols].sum()


Level11= df_train_no_outlier.groupby(["item_id",'state_id'])[series_cols].sum().reset_index()
Level11['index']=''
for row in range(len(Level11)):
    Level11['index'][row]=str(Level11['item_id'][row])+'_'+str(Level11['state_id'][row])
Level11.set_index(Level11['index'],inplace=True)
Level11.drop(['item_id','state_id','index'],axis=1,inplace=True)


Level12= df_train_no_outlier.copy()
Level12.set_index(Level12['id'],inplace=True, drop =True)
Level12.drop(level_cols,axis=1,inplace=True)

df=pd.concat([Level1,Level2,Level3,Level4,Level5,Level6,Level7,Level8,Level9,Level10,Level11,Level12])

del Level1,Level2,Level3,Level4,Level5,Level6,Level7,Level8,Level9,Level10,Level11,Level12

Now, we need to test, whether the combination of the levels is right and the rows contain the same input. We are going to do this by comparing the row names.

In [None]:
test = pd.concat([df.reset_index()['index'],submission_sample.reset_index().id[:42840]],axis=1)
test
test['index'].replace('_validation','',regex=True,inplace=True)

In [None]:
test['proof'] = ''
for row in range(len(test)):
    if test['index'][row] in test['id'][row]:
        test['proof'][row]=True
test[test['proof']==False]

Every combination is fine and in the right order.

In [None]:
del test

# Feature Creation  <a class="anchor" id="Feature_Creation"></a>

In the next part we have to decide how many features we want to take for test and training datasets. 

## Limited Features <a class="anchor" id="limfeat"></a>

In the first part we are taking only one extra feature (i.e. limited features).

In [None]:
df_days["date"] = pd.to_datetime(df_days['date'])
df_days.set_index('date', inplace=True)

df_days['is_event_day'] = [1 if x ==False else 0 for x in df_days['event_name_1'].isnull()] 
df_days['is_event_day'] = df_days['is_event_day'].astype(np.int8)

day_before_event = df_days[df_days['is_event_day']==1].index.shift(-1,freq='D')
df_days['is_event_day_before'] = 0
df_days['is_event_day_before'][df_days.index.isin(day_before_event)] = 1
df_days['is_event_day_before'] = df_days['is_event_day_before'].astype(np.int8)

del day_before_event

daysBeforeEventTest = df_days['is_event_day_before'][1913:1941]
daysBeforeEvent = df_days['is_event_day_before'][startDay:1913]
daysBeforeEvent.index = df_train_no_outlier.index[startDay:1913]

In [None]:
df_final = pd.concat([df.T.reset_index(drop=True), daysBeforeEvent.reset_index(drop=True)], axis = 1)
df_final = df_final[startDay:]

## More Features <a class="anchor" id="morefeat"></a>
Next, we want to increase our number of features a little.

In [None]:
df_days

In [None]:
features = df_days[['is_event_day_before','wday','snap_CA','snap_TX','snap_WI']]
features.head()

### Pricing Feature <a class="anchor" id="pricefeat"></a>

What if we add prices for the products? Though we only have them on a weekly level, they could increase the model.

Even, if we take prices per product per week, we have to take into account 3049 products * 282 weeks leading to 859818 additional columns.

However, if we group by store and category, we receive 10 (stores) * 3 (categories),therefore, only 30 additional columns.

In [None]:
# adding 'id' column as well as 'cat_id', 'dept_id' and 'state_id', then changing the type to 'categorical'
df_prices.loc[:, "id"] = df_prices.loc[:, "item_id"] + "_" + df_prices.loc[:, "store_id"] + "_validation"
df_prices['state_id'] = df_prices['store_id'].str.split('_',expand=True)[0]
df_prices = pd.concat([df_prices, df_prices["item_id"].str.split("_", expand=True)], axis=1)
df_prices = df_prices.rename(columns={0:"cat_id", 1:"dept_id"})
df_prices[["store_id", "item_id", "cat_id", "dept_id", 'state_id']] = df_prices[["store_id","item_id", "cat_id", "dept_id", 'state_id']].astype("category")
df_prices = df_prices.drop(columns=2)

In [None]:
price_features = pd.DataFrame(df_prices.groupby(['wm_yr_wk','store_id','cat_id'])['sell_price'].mean().reset_index())
price_features['sell_price'] = price_features['sell_price'].astype('float32')

In [None]:
price_features['store_cat'] = 0

for row in range(len(price_features)):
     price_features['store_cat'][row]=str(price_features['store_id'][row])+'_'+str(price_features['cat_id'][row])

In [None]:
price_features= price_features.pivot(index='store_cat',columns='wm_yr_wk',values='sell_price').T
price_features.head()

In [None]:
features = df_days[['wm_yr_wk','is_event_day_before','wday','snap_CA','snap_TX','snap_WI']]
features.head()
features = pd.merge(features.reset_index(),price_features,how='left', left_on='wm_yr_wk', right_on='wm_yr_wk').set_index('date')
features.drop('wm_yr_wk', axis=1, inplace=True)
features.head()

In [None]:
features_test = features.iloc[1913:1941,:]
features_train = features.iloc[startDay:1913,:]
df_final_more = pd.concat([df.T.reset_index(drop=True), features_train.reset_index(drop=True)], axis = 1)
df_final = df_final_more.copy()

In [None]:
del df_final_more, features

## Feature Scaling <a class="anchor" id="featscale"></a>
For better modeling, we are scaling features using min-max scaler in range 0-1.

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
dt_scaled = sc.fit_transform(df_final)

In [None]:
gc.collect()

## Generating Train and Test Data <a class="anchor" id="traintest"></a>

In the next step, let's create X_train and y_train by creating different dataframes with 14 days of projection. For y_train we only use sales values for predictions. As we only predict sales, only 0:42840 columns are choosen.

![grafik.png](attachment:grafik.png)

In [None]:
X_train = []
y_train = []
for i in range(timesteps, 1913 - startDay):
    X_train.append(dt_scaled[i-timesteps:i])
    y_train.append(dt_scaled[i][0:42840]) 
    
X_train = np.array(X_train)
y_train = np.array(y_train)
print('Shape of X_train :'+str(X_train.shape))
print('Shape of y_train :'+str(y_train.shape))

In [None]:
inputs = df_final[-timesteps:]
inputs = sc.transform(inputs)

In [None]:
%who

In [None]:
del df_train_no_outlier, df_prices, df_days, df, df_final, dt_scaled, price_features

In [None]:
gc.collect()

# LSTM Modeling <a class="anchor" id="Modeling"></a>

Next, we start our modelling. We will use LSTM Neural Networks with different layers. 

In general, neural networks are easily described by the following picture:
1. The neural network model starts with random weights and tries to find the best weights for the different layers, predicting outcomes and comparing them with the true target outcomes. For this it uses the loss function. 
2. The loss function measures the quality of  the network’s output
3. Then, the loss score is used as a feedback signal to adjust the weights.

![grafik.png](attachment:grafik.png)

## Loss Function <a class="anchor" id="lossfct"></a>

In the M5 we have to project different aggregation levels at certain quantiles. All that changes in comparison to the [baseline lstm](https://www.kaggle.com/bountyhunters/baseline-lstm-with-keras-0-7#Future-Improvements) is the loss function. The following few lines defines the loss function defined in the section above.

In [None]:
def tilted_loss(q, y, f):
    e = (y - f)
    return keras.backend.mean(keras.backend.maximum(q * e, (q - 1) * e), 
                              axis=-1)

## Running the Model <a class="anchor" id="runmodel"></a>

When creating X_test, we are using the last 14 days in order to predict day 1915 sales. Therefore, in order to predict 1916th day, 13 days from our input data and 1 day from our prediction are used. After that we slide the window one by one, i.e.:

* 12 days from input data + 2 days from our prediction to predict 1917th day
* 11 days from input data + 3 days from our prediction to predict 1918th day
* .....
* 14 days our prediction to predict last 1941th day sales.



![grafik.png](attachment:grafik.png)

In [None]:
QUANTILES = [0.005, 0.025, 0.165, 0.25, 0.5, 0.75, 0.835, 0.975, 0.995]

In [None]:
EPOCHS = 32 # going through the dataset 32 times
BATCH_SIZE = 32 # with each training step the model sees 32 examples

In [None]:
# Importing the Keras libraries and packages
import tensorflow_probability as tfp
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
import tensorflow as tf
import keras

def run_model(X_train, y_train, q):

    model = Sequential()

    # Adding the first LSTM layer and some Dropout regularisation
    layer_1_units=40
    model.add(LSTM(units = layer_1_units, return_sequences = True, input_shape = (X_train.shape[1], X_train.shape[2])))
    model.add(Dropout(0.2))

    # Adding a second LSTM layer and some Dropout regularisation
    layer_2_units=300
    model.add(LSTM(units = layer_2_units, return_sequences = True))
    model.add(Dropout(0.2))

    # Adding a third LSTM layer and some Dropout regularisation
    layer_3_units=300
    model.add(LSTM(units = layer_3_units))
    model.add(Dropout(0.2))

    # Adding the output layer
    model.add(Dense(units = y_train.shape[1]))

    # Compiling the RNN
    model.compile(optimizer = 'adam',loss=lambda y, f: tilted_loss(q, y, f))
    
    # To follow at which quantile we are predicting right now  
    print('Running the model for Quantil: '+str(q)+':')

    # Fitting the RNN to the Training set
    fit = model.fit(X_train, y_train, epochs = EPOCHS, batch_size = BATCH_SIZE, verbose=2)
    
    X_test = []
    X_test.append(inputs[0:timesteps])
    X_test = np.array(X_test)
    prediction = []
     
    for j in range(timesteps,timesteps + 28):
        predicted_volume = model.predict(X_test[0,j - timesteps:j].reshape(1, timesteps, 42875)) #incl. features
        testInput = np.column_stack((np.array(predicted_volume), np.array(features_test.iloc[j-timesteps,:]).reshape(1,35))) #here no of features is 5
        X_test = np.append(X_test, testInput).reshape(1,j + 1,42875) #incl. features
        predicted_volume = sc.inverse_transform(testInput)[:,0:42840] #without features
        prediction.append(predicted_volume)
    
    prediction = pd.DataFrame(data=np.array(prediction).reshape(28,42840)).T
    return prediction

In [None]:
# We run the model for all the quantiles mentioned above. 
# Combining all quantile predictions one after another to a large dataset.
predictions = pd.concat(
    [run_model(X_train, y_train, q) 
     for q in QUANTILES]) 

In [None]:
gc.collect()

In [None]:
predictions.shape

We can see that our shape matches the requested outcome. Multiplying by two (for the validation and evaluation data).

In [None]:
predictions.shape[0]*2

## Creating the Submission File <a class="anchor" id="submission"></a>

Finally, let's create a submission file, using the ids of the sample submission. As we have the validation and evaluation data, we need to stack the submission file on top of itself.

In [None]:
predictions.to_pickle('Uncertainty_Predictions.pkl')

In [None]:
submission = pd.concat((predictions, predictions), ignore_index=True)
idColumn = submission_sample[["id"]]    
submission[["id"]] = idColumn  

#re-arranging collumns
cols = list(submission.columns)
cols = cols[-1:] + cols[:-1]
submission = submission[cols]
#
colsname = ["id"] + [f"F{i}" for i in range (1,29)]
submission.columns = colsname

submission.to_csv("submission.csv", index=False)

Let's take a look at one of the predicted datasets (here we take the median with quantile = 0.5)

We want to test, whether the sum of our daily predictions on the different levels equal to level1 (Total levels).

In [None]:
temp_series = submission[171360:171360+42840]

border = [1,3,10,3,7,9,21,30,70,3049,9147,30490]
sumi = 0
levels =[]
for i in border:
    sumi += i
    levels.append(pd.DataFrame(temp_series[sumi-i:sumi]))


for i,level in enumerate(levels):
    levels[i] = levels[i].sum()

Let's take a look at it graphically:

In [None]:
plt.figure(figsize=(20, 8))

for i in range(12):
    plt.plot(levels[i][1:],label='level'+str(i+1))

plt.legend()

Level 12, 11 and 10 (the most detailled ones) have the lowest total sums. Level12 is approx 2/3 the sum of level 1

Let's plot them at different levels to see if the curves move similarly.

In [None]:
fig,ax = plt.subplots(figsize=(20, 8))

for i in range(6):
    ax = ax.twinx()
    ax.plot(levels[i][1:],label='level'+str(i+1))
    plt.yticks([])

At least, they all have similar ups and downs!

Please feel free to share any ideas for improvement as a comment and we can discuss more in detail