# Dataset exploration

In this notebook, the BIP dataset is explored in order to get used to the different features. The project is based on a sales forecasting dataset with aggregated information for different products throughout a 3-years time window. The full dataset is split time-wise into train and test sets with the following criterion.

- 2 years and 6 months in the **train set**
- 6 months in the **test set**

Data is aggregated weekly from December 2016 to December 2019. In the train set, there is data available for 43 SKU (unique identifier for the products) but the target for the prediction is restricted to 12 SKUs.

In [77]:
import pandas as pd
from datetime import datetime

In [92]:
raw = pd.read_csv('data/train.csv')
raw = raw.rename(columns={ 'Unnamed: 0': 'date' })
raw.head()

Unnamed: 0,date,sku,pack,size (GM),brand,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,scope,target
0,WE 10 December 2016,2689,SINGLE,395.41,BRAND1,1.16,,,,0,24175.0
1,WE 17 December 2016,2689,SINGLE,395.41,BRAND1,1.15,1.0,17.676112,24175.0,0,23521.0
2,WE 24 December 2016,2689,SINGLE,395.41,BRAND1,1.16,1.0,24.482803,23521.0,0,22075.0
3,WE 31 December 2016,2689,SINGLE,395.41,BRAND1,1.16,0.0,19.410646,22075.0,0,16492.0
4,WE 07 January 2017,2689,SINGLE,395.41,BRAND1,1.16,0.0,29.81203,16492.0,0,25971.0


In [95]:
# Convert date to datetime object
df = raw.copy()
df.date = df.date.apply(lambda x: datetime.strptime(x, "WE %d %B %Y"))
df.set_index('date', inplace=True)
df.head()

Unnamed: 0_level_0,sku,pack,size (GM),brand,price,POS_exposed w-1,volume_on_promo w-1,sales w-1,scope,target
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-12-10,2689,SINGLE,395.41,BRAND1,1.16,,,,0,24175.0
2016-12-17,2689,SINGLE,395.41,BRAND1,1.15,1.0,17.676112,24175.0,0,23521.0
2016-12-24,2689,SINGLE,395.41,BRAND1,1.16,1.0,24.482803,23521.0,0,22075.0
2016-12-31,2689,SINGLE,395.41,BRAND1,1.16,0.0,19.410646,22075.0,0,16492.0
2017-01-07,2689,SINGLE,395.41,BRAND1,1.16,0.0,29.81203,16492.0,0,25971.0


## Dataset exploration

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5719 entries, 2016-12-10 to 2019-06-22
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   sku                  5719 non-null   int64  
 1   pack                 5719 non-null   object 
 2   size (GM)            5719 non-null   float64
 3   brand                5719 non-null   object 
 4   price                5719 non-null   float64
 5   POS_exposed w-1      5676 non-null   float64
 6   volume_on_promo w-1  5676 non-null   float64
 7   sales w-1            5676 non-null   float64
 8   scope                5719 non-null   int64  
 9   target               5719 non-null   float64
dtypes: float64(6), int64(2), object(2)
memory usage: 491.5+ KB
