# M5 Challenge - Accuracy 
<img src="https://asset.barrons.com/public/resources/images/ON-CJ684_quizgr_B620_20171229134559.jpg" width="500" height="300" />

This notebook is a simple EDA for the M5 challenge accuracy competition, updated regularly.

Things to get in consideration:
- There are two parallel competitions: **Accuracy** and **Uncertainty**, both based in different metrics measurements
- The data, covers stores in three US States (California, Texas, and Wisconsin).
- The data are divided in the following three csv's:

# File 1: “calendar.csv” 
Contains information about the dates the products are sold.
* date: The date in a “y-m-d” format.
* wm_yr_wk: The id of the week the date belongs to.
* weekday: The type of the day (Saturday, Sunday, …, Friday).
* wday: The id of the weekday, starting from Saturday.
* month: The month of the date.
* year: The year of the date.
* event_name_1: If the date includes an event, the name of this event.
* event_type_1: If the date includes an event, the type of this event.
* event_name_2: If the date includes a second event, the name of this event.
* event_type_2: If the date includes a second event, the type of this event.
* snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP1 purchases on the examined date. 1 indicates that SNAP purchases are allowed.
    
# File 2: “sell_prices.csv”
Contains information about the price of the products sold per store and date.
* store_id: The id of the store where the product is sold. 
* item_id: The id of the product.
* wm_yr_wk: The id of the week.
* sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).  

# File 3: “sales_train.csv” 
Contains the historical daily unit sales data per product and store.
* item_id: The id of the product.
* dept_id: The id of the department the product belongs to.
* cat_id: The id of the category the product belongs to.
* store_id: The id of the store where the product is sold.

You can access the following link to read more about the competition, it's the competitors guide: https://mofc.unic.ac.cy/m5-competition/

Let's load our packages

In [None]:
#Data manipulation
import pandas as pd
import numpy as np

#Data visualization
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle

pd.set_option('max_columns', 50)

Get our data

In [None]:
# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
calendar = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
sales_train_validation = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
sample_sub = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sell_prices = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

First of all, let's take a look at what we are handling here

In [None]:
# Printing shapes

print('The sell prices size is:',sell_prices.shape)
print('The calendar size is:',calendar.shape)
print('The sales_train_validation size is:',sales_train_validation.shape)

In [None]:
# Head of the data

sell_prices.head()

In [None]:
calendar.head()

In [None]:
sales_train_validation.head()

In [None]:
# fig, ax = plt.subplots()
# ax.pie(pd.DataFrame(sales_train_validation.groupby('cat_id').id.count()).reset_index(drop=True))


# Time series visualizations

In [None]:

# First, let's gather all time data
time = [column for column in sales_train_validation.columns if 'd_' in column]

# Lets plot everything
sns.set(rc={'figure.figsize':(22.7,12.27)})
sns.set_style('whitegrid')
sns.set_context('talk')
plt.xticks(np.arange(min(calendar['date'].index), max(calendar['date'].index)+1, 150.0))
sns.lineplot(data = pd.concat([pd.DataFrame(sales_train_validation.loc[sales_train_validation['id'] == 'HOBBIES_1_004_CA_1_validation',
                                time].T.reset_index()).rename(columns={3: 'sales'}), calendar['date']], axis=1),
             x='date', y='sales').set(title='Time series from id HOBBIES_1_004_CA_1_validation')
plt.xticks(rotation=45)

The behaviour seems much like what we already see in daily data. Let's bring a aggregated data

In [None]:
categories = pd.DataFrame(sales_train_validation.groupby('cat_id')[time].sum().T.reset_index()).columns[1:]

for i in categories:
    # Lets plot everything
    sns.set(rc={'figure.figsize':(22.7,12.27)})
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.xticks(np.arange(min(calendar['date'].index), max(calendar['date'].index)+1, 150.0))
    sns.lineplot(data = pd.concat([pd.DataFrame(sales_train_validation.groupby('cat_id')[time].sum().T.reset_index()), calendar['date']], axis=1),
                 x='date', y=i).set(title='Categories time series')
    plt.xticks(rotation=45)
    plt.legend(labels=categories.values)

There's a different behaviour here, apparently a seasonalized drop of sales, strange.

In [None]:
#Let's take a look at the same graphic but separated by states!

categories = pd.DataFrame(sales_train_validation.groupby(['cat_id', 'state_id'])[time].sum().T.reset_index()).columns[1:]

for i in range(len(categories)):
    # Lets plot everything
    sns.set(rc={'figure.figsize':(22.7,16.27)})
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.xticks(np.arange(min(calendar['date'].index), max(calendar['date'].index)+1, 150.0))
    sns.lineplot(data = pd.concat([pd.DataFrame(sales_train_validation.groupby(['cat_id', 'state_id'])[time].sum().T.reset_index()), calendar['date']], axis=1),
                 x='date', y=categories[i]).set(title='Categories time series')
    plt.xticks(rotation=45)
    plt.legend(labels=categories.values)


Turns out to be difficult to see, the dfferences between states are big, and it seems that CA is the biggest one in sales volume

In [None]:
# #Can we see how stores are performing in each state?
# #Let's take a look at the same graphic but separated by states!

# categories = pd.DataFrame(sales_train_validation.groupby(['cat_id'])[time].sum().T.reset_index()).columns[1:]
# states = pd.DataFrame(sales_train_validation.groupby(['state_id'])[time].sum().T.reset_index()).columns[1:]
# stores = pd.DataFrame(sales_train_validation.groupby(['store_id'])[time].sum().T.reset_index()).columns[1:]


# for i in range(len(categories)):
#     # Lets plot everything
#     sns.set(rc={'figure.figsize':(22.7,16.27)})
#     sns.set_style('whitegrid')
#     sns.set_context('talk')
#     plt.subplot(3,1,)
#     plt.xticks(np.arange(min(calendar['date'].index), max(calendar['date'].index)+1, 150.0))
#     sns.lineplot(data = pd.concat([pd.DataFrame(sales_train_validation.groupby(['cat_id', 'state_id'])[time].sum().T.reset_index()), calendar['date']], axis=1),
#                  x='date', y=categories[i]).set(title='Categories time series')
#     plt.xticks(rotation=45)
#     plt.legend(labels=categories.values)
