#### Pre-processing of Kaggle M5 training dataset:


The purpose of this notebook is to format the training data to be able to pre-train my Creme model before going into production. For a product in a shop, I will extract the date, product ID, calendar events and product price.  I can safely say that these informations are available in real time in a production context. I'm also storing the ground truth because we're going to train the model.

For each product, each day, each store I want to obtain the informations:

```
x = {
    'date': '2018-07-15', 
    'id': 'HOBBIES_1_001_CA_1_validation',
    'y': 200, # Ground truth
}
```

In [25]:
import pandas as pd

In [26]:
import pickle

#### Reading training dataset

In [27]:
data = pd.read_csv('./sales_train_validation.csv', dtype = {'id': 'category',
 'item_id': 'category', 'dept_id': 'category', 'cat_id': 'category', 'store_id': 'category',
 'state_id': 'category'})

In [28]:
data.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,0,5,4,1,0,1,3,7,2
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,1,0,1,1,2,2,2,4


#### Constructing test dataset:

In [29]:
test = pd.DataFrame(data[['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']])
for i in range(1914, 1970):
    test[f'd_{i}'] = 0

In [30]:
test.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1914,d_1915,d_1916,d_1917,...,d_1960,d_1961,d_1962,d_1963,d_1964,d_1965,d_1966,d_1967,d_1968,d_1969
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
test.shape

(30490, 62)

#### Reshape training and testing datasets

In [32]:
metadata = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'] 
sells    = [column for column in list(data.columns) if column not in metadata]

In [33]:
data = data.set_index(metadata).stack().reset_index()

In [34]:
test = test.set_index(metadata).stack().reset_index()

#### Distinguish training set from testing set

In [35]:
data['test'] = False
test['test'] = True

In [36]:
data = pd.concat([data, test], sort = False, axis = 'rows')

Here I convert the column d as integer to save memory

In [37]:
data = data.rename(columns = {'level_6': 'd', 0: 'y'})
data['d'] = data['d'].str.split('_').str[1]
data['d'] = data['d'].astype('int16')
data['y'] = data['y'].astype('int32')

In [38]:
data = data[data['d'] > 1913 - 60]

In [39]:
data.shape

(3536840, 9)

In [None]:
data.tail()

#### Extract metadata about the date 

In [None]:
calendar = pd.read_csv('./calendar.csv')

In [None]:
calendar.head()

Here I convert the column d as integer to save memory

In [None]:
calendar['d'] = calendar['d'].str.split('_').str[1]
calendar['d'] = calendar['d'].astype('int16')

In [None]:
data = pd.merge(left=data, right=calendar, how='left', on=['d'])

In [None]:
train = data[data['test'] == False].copy(deep = True)
train.head()

In [None]:
test = data[data['test']].copy(deep = True)
test.head()

#### Sorting the data over time is essential because we want to keep the sales records consistent over time while the model is being trained.

In [None]:
train = train.sort_values('d')

In [None]:
test = test.sort_values('d')

In [None]:
train.head()

#### We keep only the columns needed to train the model

In [None]:
columns_to_keep = [
    'date', 
    'id', 
    'y',
]

In [None]:
train = train[columns_to_keep]

In [None]:
test = test[columns_to_keep]

In [None]:
train.to_csv('./train_preprocessed.csv', index=False)

In [None]:
test.to_csv('./test_preprocessed.csv', index=False)


To train my model with the cream library I will iterate on each of the dataset observations. Each observation of the training and testing datasets have the fields of the dictionary below.  I will not use any additional informations from the training and testing datasets I have built.

```
x = {
    'date': '2018-07-15', 
    'id': 'HOBBIES_1_001_CA_1_validation',
    'y': 200, # Ground truth
}
```