# M5: First Look at the Data

Not the world's prettiest notebook, but good enough to get a feel for the data in this competition and how to use it.

***

### Loading packages

In [None]:
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Input files
***
### File Paths

In [None]:
files = {}
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        files[filename[:4]] = os.path.join(dirname, filename)
files

### Loading files

Seems like the data is not too big, all can be loaded in one go in around 12 seconds.

In [None]:
%%time
train_df, cal_df, prc_df, sub_df = [pd.read_csv(files[f]) for f in ['sale', 'cale', 'sell', 'samp']]

## Understanding the data

***

### Training data

* The primary key is the `id`, which is a combination of all remaining categorical columns:
  * `item_id` = `dept_id` + `_{nnn}`; `dept_id` = `cat_id` + `_{n}` 
  * `store_id` = `state_id` + `_{n}`
* Item sales are stored in wide format in columns `d_1`, ... `d_1913`
  * These are the **target (y)** values for training!

In [None]:
train_df.sample(5)

In [None]:
print("There are %d unique item ids to forecast!"%train_df.shape[0])

One thing that should be noted is that the are **a lot** of items with days that have no sales, as shown in the figure below (where blue represents days with sales).
* It is quite possible that some items weren't selling from the very beggining which seems to be confirmed by price data (see the section on the price data).

In [None]:
n = 100 # number of items to sample
sales = train_df[[c for c in train_df.columns if c.startswith('d_')]].sample(n)
fig, ax = plt.subplots(1, 1, facecolor='w', figsize=(15,10))
ax = sns.heatmap(sales>0, cbar=False, xticklabels=False, yticklabels=False, cmap="GnBu")
plt.title("Heatmap of >0 sales indicator for %d randomly selected items"%n, fontsize=16)
plt.ylabel("Items")
plt.xlabel("Time")
plt.show()

### Submission file

The submission file requires two types of predictions:

* `'validation'`: this is for the initial training stage and covers `d_1914` to `d_1941`
* `'evaluation'`: this is for the final submission stage and covers `d_1942` to `d_1969`

In [None]:
sub_df['type'] = sub_df['id'].apply(lambda x: x.split('_')[-1]).astype('category')
sub_df['type'].value_counts()

Since we currently only have `'validation'` stage data, we should extract only `id`s that have the `'validation'` suffix, which can be matched against the `id` column from `train_df`:

In [None]:
val_df = sub_df[sub_df['type']=='validation'].drop('type',1)
val_df.sample(5)

### Calendar Data

The calendar data contains day-specific information for **all** 1969 days in the training, validation and evaluation time periods. These can be used as features for the training set, using the `d` column for merging - and can be used for forecasting since they are known ex ante.

In [None]:
print("First 3 rows of the calendar data:")
display(cal_df.head(3))
print("Last 3 rows of the calendar data:")
display(cal_df.tail(3))

### Price data

The price data contains weekly prices for every `id`, spanning the training, validation and evaluation periods.

* Importantly, this means we have price data available ex ante, allowing us to use future prices when making predictions!

In [None]:
prc_df.loc[:,'id'] = prc_df['item_id'] + '_' + prc_df['store_id'] + '_validation'
prc_df = prc_df.drop(['store_id', 'item_id'],1)
display(prc_df.sample(5))
print("Earliest week: %d"%prc_df['wm_yr_wk'].min())
print("Latest week: %d"%prc_df['wm_yr_wk'].max())
print("Number of unique id: %d"%len(prc_df['id'].unique()))

The price data can also help pinpoint when certain items may not have been available:

* First, we can see that **all** items are available and have price data up to the latest period in the data i.e. week 11621 (there may be gaps though which should be checked later...)
* Second, we can see that some items are not available from the very beginning (though most are)
  * This needs to be accounted for, as some 0 sale counts should in fact be NaN in the training data set!
  * Also there's some seasonality to when new products get introduced - not sure if this matters...

In [None]:
id_grp = prc_df.groupby('id')
min_wk = id_grp['wm_yr_wk'].min()
max_wk = id_grp['wm_yr_wk'].max()
print("Summary statistics for latest week of prices for each item shows that ALL items have prices available in the last week:")
display(max_wk.describe())
print("New items appear to be rolled out in a staggered, seasonal fashion:")
fig, ax = plt.subplots(1, 1, facecolor='w', figsize=(12,6))
ax.hist(min_wk, bins=25)
plt.title('Minimum week across different item IDs', fontsize=16)
plt.show()

*** 
<br><br><br><br><br><br><br><br>