# Introduction

The purpose of this notebook is to do some EDA. The data was grabbed from https://www.kaggle.com/c/m5-forecasting-accuracy/data.

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai.tabular import *

In [5]:
data_path = Path('./data/')
data_path.ls()

[PosixPath('data/sell_prices.csv'),
 PosixPath('data/calendar.csv'),
 PosixPath('data/sales_train_validation.csv'),
 PosixPath('data/sample_submission.csv')]

## Data

The competition includes four csv files:

* calendar.csv - Contains information about the dates on which the products are sold.

* sales_train_validation.csv - Contains historical daily unit sales data per product and store (d_1 - d_1913)

* sell_prices.csv - Contains information about the proces of products sold per store and data

* sample_submission.csv - Correct format for submission.

### Calendar.csv

The data columns are defined below:

**date**: The date in a “y-m-d” format. 

**wm_yr_wk**: The id of the week the date belongs to. 

**weekday**: The type of the day (Saturday, Sunday, …, Friday). 

**wday**: The id of the weekday, starting from Saturday. 

**month**: The month of the date. 

**year**: The year of the date. 

**event_name_1**: If the date includes an event, the name of this
    event. 

**event_type_1**: If the date includes an event, the type of 
    this event. 

**event_name_2**: If the date includes a second event, the name of 
    this event. 

**event_type_2**: If the date includes a second event, the type 
    of this event. 

**snap_CA, snap_TX, and snap_WI**: A binary variable (0 or 1) 
    indicating whether the stores of CA, TX or WI allow 
    SNAP[3] purchases on the examined date. 1 indicates that 
    SNAP purchases are allowed. 

In [9]:
calendar_df = pd.read_csv(data_path/'calendar.csv')
calendar_df

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,,0,0,0
3,2011-02-01,11101,Tuesday,4,2,2011,d_4,,,,,1,1,0
4,2011-02-02,11101,Wednesday,5,2,2011,d_5,,,,,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1964,2016-06-15,11620,Wednesday,5,6,2016,d_1965,,,,,0,1,1
1965,2016-06-16,11620,Thursday,6,6,2016,d_1966,,,,,0,0,0
1966,2016-06-17,11620,Friday,7,6,2016,d_1967,,,,,0,0,0
1967,2016-06-18,11621,Saturday,1,6,2016,d_1968,,,,,0,0,0


In [18]:
calendar_df[calendar_df.event_name_1.isnull() == False]['event_type_1'].unique()

array(['Sporting', 'Cultural', 'National', 'Religious'], dtype=object)

**Takeaways**: Categorize the relevant columns, add date part, and add elpased info.

### sales_train_validation.csv

The data columns are described below:

**item_id**: The id of the product. 

**dept_id**: The id of the department the product belongs to. 

**cat_id**: The id of the category the product belongs to. 

**store_id**: The id of the store where the product is sold. 

**state_id**: The State where the store is located. 

**d_1, d_2, …, d_i, … d_1941**: The number of units sold at day i, starting from 2011-01-29.  

In [52]:
sales_df = pd.read_csv(data_path/'sales_train_validation.csv')
sales_df

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,0,5,4,1,0,1,3,7,2
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,1,0,1,1,2,2,2,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30485,FOODS_3_823_WI_3_validation,FOODS_3_823,FOODS_3,FOODS,WI_3,WI,0,0,2,2,...,2,0,0,0,0,0,1,0,0,1
30486,FOODS_3_824_WI_3_validation,FOODS_3_824,FOODS_3,FOODS,WI_3,WI,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
30487,FOODS_3_825_WI_3_validation,FOODS_3_825,FOODS_3,FOODS,WI_3,WI,0,6,0,2,...,2,1,0,2,0,1,0,0,1,0
30488,FOODS_3_826_WI_3_validation,FOODS_3_826,FOODS_3,FOODS,WI_3,WI,0,0,0,0,...,0,0,1,0,0,1,0,3,1,3


**Takeaways**: Looks like the the data for units sold on a particular day is in
columns. Will have to create a new table with this broken into rows. Plus will have to categorize a few columns. Look into pandas' melt.

### sell_prices.csv

The data columns are described below:

**store_id**: The id of the store where the product is sold.  

**item_id**: The id of the product. 

**wm_yr_wk**: The id of the week. 

**sell_price**: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).   

In [26]:
prices_df = pd.read_csv(data_path/'sell_prices.csv')
prices_df

Unnamed: 0,store_id,item_id,wm_yr_wk,sell_price
0,CA_1,HOBBIES_1_001,11325,9.58
1,CA_1,HOBBIES_1_001,11326,9.58
2,CA_1,HOBBIES_1_001,11327,8.26
3,CA_1,HOBBIES_1_001,11328,8.26
4,CA_1,HOBBIES_1_001,11329,8.26
...,...,...,...,...
6841116,WI_3,FOODS_3_827,11617,1.00
6841117,WI_3,FOODS_3_827,11618,1.00
6841118,WI_3,FOODS_3_827,11619,1.00
6841119,WI_3,FOODS_3_827,11620,1.00


**Takeaways**: Looks like this data can just be attached to the sales df.

### submission.csv

**Takeaways**: The id column in a concat of item_id and store_id. Will have to generate forecast for f1-f28 and place in a csv formatted like this.

In [29]:
sub_df = pd.read_csv(data_path/'sample_submission.csv')
sub_df

Unnamed: 0,id,F1,F2,F3,F4,F5,F6,F7,F8,F9,...,F19,F20,F21,F22,F23,F24,F25,F26,F27,F28
0,HOBBIES_1_001_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HOBBIES_1_002_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HOBBIES_1_004_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HOBBIES_1_005_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60975,FOODS_3_823_WI_3_evaluation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60976,FOODS_3_824_WI_3_evaluation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60977,FOODS_3_825_WI_3_evaluation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60978,FOODS_3_826_WI_3_evaluation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Data Prep/Feature Engineering

The three csv files can be converted into one. This new csv will have the following columns:

* id 	
* item_id
* dept_id
* cat_id
* store_id
* state_id
* date
* plus all of the date part
* elapsed cols
* wm_yr_wk
* units sold

In [53]:
df_test = sales_df[:500]
df_test.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,0,5,4,1,0,1,3,7,2
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,1,0,1,1,2,2,2,4


In [42]:
cols_to_melt = [f'd_{i}' for i in range(1, 1914)]
df_melt = df_test.melt(id_vars=['id', 'item_id', 'dept_id', 
                                'cat_id', 'store_id', 'state_id'])

In [44]:
df_melt

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,variable,value
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
...,...,...,...,...,...,...,...,...
956495,HOBBIES_2_080_CA_1_validation,HOBBIES_2_080,HOBBIES_2,HOBBIES,CA_1,CA,d_1913,0
956496,HOBBIES_2_081_CA_1_validation,HOBBIES_2_081,HOBBIES_2,HOBBIES,CA_1,CA,d_1913,0
956497,HOBBIES_2_082_CA_1_validation,HOBBIES_2_082,HOBBIES_2,HOBBIES,CA_1,CA,d_1913,0
956498,HOBBIES_2_083_CA_1_validation,HOBBIES_2_083,HOBBIES_2,HOBBIES,CA_1,CA,d_1913,0


In [72]:
#rename col 'variable' to 'd'
df_melt = df_melt.rename(columns={'variable': 'd'})