# <b>1. Introduction</b>
The project is based on sales forecasting. Indeed, we have been provided with aggregated weekly-recorded information for different products (_SKU_) throughout a 3  years time window, from December 2016 to December 2019. W
e are asked to forecast the weekly sales regarding 12 of the 43 products in the dataset. <br>
The metric that we are going to use to estimate the goodness of our predictions is the Mean Average Percentage Error (_MAPE_).

Here a quick glance to the dataset and to the time series:

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import sys
sys.path.append("../")

csv_train = pd.read_csv("../dataset/original/train.csv")
csv_test = pd.read_csv("../dataset/original/x_test.csv")

from preprocessing.preprocessing import convert_date 

convert_date(csv_train.copy()).head()

Unnamed: 0,Date,sku,pack,size (GM),brand,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,scope,target
0,2016-12-10,2689,SINGLE,395.41,BRAND1,1.16,,,,0,24175.0
1,2016-12-17,2689,SINGLE,395.41,BRAND1,1.15,1.0,17.676112,24175.0,0,23521.0
2,2016-12-24,2689,SINGLE,395.41,BRAND1,1.16,1.0,24.482803,23521.0,0,22075.0
3,2016-12-31,2689,SINGLE,395.41,BRAND1,1.16,0.0,19.410646,22075.0,0,16492.0
4,2017-01-07,2689,SINGLE,395.41,BRAND1,1.16,0.0,29.81203,16492.0,0,25971.0


---
# <b>2. Data Preprocessing</b>
First of all, we have decided how to deal with the NaNs in the train and we have chosen to impute them for the first week. <br>
Moreover, we have found useful to convert the values of **_sales w-1_** and **_target_** to the logarithm to smooth and flatten the range which the predictions belong to. The reason of this choice is that it performs better with _decisions trees_, which are some of the model that we are going to show.  
In the end, we have attached to the whole dataframe the column **_real_target_** with the target that we have to predict for that specific week.

In [7]:
from preprocessing.preprocessing import preprocessing, inverse_interpolation
df = preprocessing(csv_train, csv_test)
df

Unnamed: 0,Date,sku,pack,size (GM),brand,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,scope,target,real_target
0,2016-12-10,144,0,114.23,1,2.18,73.0,100.000000,10.497091,1,10.845855,51320.0
1,2016-12-17,144,0,114.23,1,2.00,45.0,100.000000,10.845855,1,11.103934,66431.0
2,2016-12-24,144,0,114.23,1,2.05,17.0,100.000000,11.103934,1,10.950842,57001.0
3,2016-12-31,144,0,114.23,1,3.00,2.0,100.000000,10.950842,1,9.619333,15052.0
4,2017-01-07,144,0,114.23,1,2.99,2.0,28.534193,9.619333,1,9.999570,22016.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6014,2019-05-25,2718,1,395.41,0,1.11,0.0,26.050480,10.430462,0,10.414183,33328.0
6015,2019-06-01,2718,1,395.41,0,1.30,1.0,43.099496,10.414183,0,10.021848,22512.0
6016,2019-06-08,2718,1,395.41,0,1.55,0.0,0.000000,10.021848,0,9.767782,17461.0
6017,2019-06-15,2718,1,395.41,0,1.55,0.0,0.000000,9.767782,0,9.747185,17105.0


Another important decision taken has been the choice to increase the number of data. Dealing with decision trees is better to have a lot of samples and we saw some improvements with data augmentation. So, taking weeks of 2017 and 2018, computing their means and adding some noise, we have created a realistic 2016.<br>
This will be useful in the next notebooks when we will try a stacking approach (?).

In [8]:
df_augmented = preprocessing(csv_train, csv_test, useTest=False, dataAugmentation=True)
df_augmented

Unnamed: 0,sku,pack,size (GM),brand,scope,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,target,Date,real_target
0,144,0,114.23,1,1,2.563188,0.958994,13.682060,,10.330254,2016-01-09,30644.898835
1,144,0,114.23,1,1,2.356713,21.228482,66.558755,10.330254,10.730537,2016-01-16,45730.242944
2,144,0,114.23,1,1,2.285770,14.365462,68.406960,10.730537,10.603303,2016-01-23,40266.627971
3,144,0,114.23,1,1,2.251557,11.570699,100.000000,10.603303,10.912508,2016-01-30,54857.253481
4,144,0,114.23,1,1,2.777469,11.551469,100.000000,10.912508,10.476862,2016-02-06,35483.879428
...,...,...,...,...,...,...,...,...,...,...,...,...
8078,2718,1,395.41,0,0,1.110000,0.000000,26.050480,10.430462,10.414183,2019-05-25,33328.000000
8079,2718,1,395.41,0,0,1.300000,1.000000,43.099496,10.414183,10.021848,2019-06-01,22512.000000
8080,2718,1,395.41,0,0,1.550000,0.000000,0.000000,10.021848,9.767782,2019-06-08,17461.000000
8081,2718,1,395.41,0,0,1.550000,0.000000,0.000000,9.767782,9.747185,2019-06-15,17105.000000


In [9]:
df_augmented[6138:6145]

Unnamed: 0,sku,pack,size (GM),brand,scope,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,target,Date,real_target
6138,2682,0,105.44,0,0,1.680616,3.990522,35.013759,9.239858,9.62105,2016-11-26,15077.872842
6139,2682,0,105.44,0,0,1.746707,0.847055,21.145042,9.62105,9.559643,2016-12-03,14179.787708
6140,2682,0,105.44,0,0,1.48,0.423527,11.459094,9.559643,9.11548,2016-12-10,9094.0
6141,2682,0,105.44,0,0,1.5,0.0,1.773147,9.11548,9.102644,2016-12-17,8978.0
6142,2682,0,105.44,0,0,1.49,0.0,0.103865,9.102644,9.05614,2016-12-24,8570.0
6143,2682,0,105.44,0,0,1.48,0.0,0.0,9.05614,8.749574,2016-12-31,6307.0
6144,2682,0,105.44,0,0,1.5,0.0,2.526954,8.749574,9.265113,2017-01-07,10562.0


In [10]:
df_augmented[df_augmented.Date == df_augmented.Date.sort_values().values[0]]

Unnamed: 0,sku,pack,size (GM),brand,scope,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,target,Date,real_target
0,144,0,114.23,1,1,2.563188,0.958994,13.68206,,10.330254,2016-01-09,30644.898835
206,546,1,114.23,1,1,0.530205,0.0,25.81888,,11.075675,2016-01-09,64579.97458
412,549,1,114.23,1,1,0.524162,0.0,27.737568,,10.506589,2016-01-09,36554.57106
618,554,1,114.23,1,1,0.549452,2.945194,31.573748,,11.597453,2016-01-09,108819.321492
824,686,0,125.65,3,1,2.679173,5.065106,23.415497,,10.359165,2016-01-09,31543.817013
1030,688,1,125.65,3,1,0.53291,0.0,28.767869,,10.53193,2016-01-09,37492.782822
1236,1027,1,114.23,1,1,0.523025,0.201699,22.739267,,10.889747,2016-01-09,53622.716804
1442,1035,1,114.23,1,1,0.529415,0.0,23.509649,,10.618643,2016-01-09,40889.103616
1648,1051,0,125.65,3,1,2.338666,1.455807,12.496512,,10.070386,2016-01-09,23631.679099
1854,1058,1,125.65,3,1,0.516235,0.311769,26.511083,,10.415411,2016-01-09,33368.95427
