# Some opening remarks

I consider myself a competent R coder; however, this is my first proper foray into python so please forgive any bad practices (and feel free to draw my attention to them).

Full discloser: I have copied a few imgur links from the [It is time for M5. Going step by step](https://www.kaggle.com/artgor/it-is-time-for-m5-going-step-by-step) notebook as they're really useful!


# Introduction

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the worldâ€™s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

*Note that this is the **point estimate** competition (there is also a complementary competition running which is concerned with predicting the uncertainty distribution).*

### Evaluation
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE):
![](https://i.imgur.com/uqhsf3d.png)
![](https://i.imgur.com/B1hglCf.png)

### Submission
Each row contains an id that is a concatenation of an item_id and a store_id, which is either validation (corresponding to the Public leaderboard), or evaluation (corresponding to the Private leaderboard). You are predicting 28 forecast days (F1-F28) of items sold for each row. For the validation rows, this corresponds to d_1914 - d_1941, and for the evaluation rows, this corresponds to d_1942 - d_1969. (Note: a month before the competition close, the ground truth for the validation rows will be provided.)

# Understanding the Data

The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details split as follows:

![](https://i.imgur.com/C5hASXe.png)

Let's take a deeper dive!

# Exploratory Analysis

## Setup

In [None]:
# libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy
import os
import plotly.express as px
from scipy import stats

# read in data provided by kaggle for the competition
source_filesinfolder = os.listdir('../input/m5-forecasting-accuracy')
source_filenames = [file.replace('.csv', '') for file in source_filesinfolder]
source_data_path = '/kaggle/input/m5-forecasting-accuracy'
for file in source_filenames:
    globals()[file] = pd.read_csv(f'{source_data_path}/{file}.csv')   

## Quick Peek at the Datasets

In [None]:
print(source_filenames)
for file in source_filenames:
    print(globals()[file])

# High level price and sale info

We've got over 3k items, so can't easily look at each individually, but let's look at some high level info on sales, prices within categories, and prices across time & stores.

We'll calculate some item level summaries


In [None]:
# Get a dataset with info on each item
item_info = sales_train_validation[["dept_id","cat_id","item_id"]].drop_duplicates()

# Join category onto sell_prices
sell_prices_for_eda = (
    sell_prices
    .merge(
        item_info
        , how = "left"
        , on = "item_id"
    )
)

# Summarise by item & store
item_store_summaries = (
    sell_prices_for_eda
    .groupby(['item_id', 'store_id' ,'dept_id', 'cat_id'])['sell_price']
    .agg(
        price_mode = lambda x: stats.mode(x)[0][0]
        , price_mean = 'mean'
        , price_min = 'min'
        , price_max = 'max'
        , price_sd = np.std
        , price_sum = 'sum'
        , price_size = 'size'
    )
    .reset_index()
    .assign(max_discount_from_peak = lambda x: 1 - x['price_min']/x['price_max'])
)

# Summarise by item
item_summaries = (
    item_store_summaries
    .groupby(['item_id','dept_id'])['price_mode']
    .agg(
        price_mode__nationwide_mean = 'mean'
        , min = 'min'
        , max = 'max'
    )
    .reset_index()
    .assign(price_mode__nationwide_range = lambda x: x['max'] - x['min'])
)



and summaries of prices over time

In [None]:
# Summarise by time
time_dept_summaries = (
    sell_prices_for_eda
    .groupby(['wm_yr_wk','dept_id'])['sell_price']
    .agg(
        perct_25th = lambda x: np.percentile(x, q = 25)
        , perct_75th = lambda x: np.percentile(x, q = 75)
        , perct_95th = lambda x: np.percentile(x, q = 95)
        , median = 'median'
        , mean = 'mean'
    )
    .reset_index()
    .merge(
        calendar.groupby('wm_yr_wk').first().reset_index()[['wm_yr_wk','date']].rename(columns={'date':'start_of_week'})
        , how = "left"
        , on = "wm_yr_wk"
    )
)

And now let's visualise some of this...

### Number of items per department

In [None]:
px.histogram(
    item_info
    , x = "dept_id"
    , title = 'Count of unique items per department'
).show()

### Distributions of standard (mode) item prices across dept and store


In [None]:
px.box(
    item_store_summaries
    , x = "dept_id"
    , y = "price_mode"
    , color = "store_id"
    , title = 'Distributions of standard (mode) item prices across dept and store'
).show()

### Distributions of item prices (nationwide mean of the store mode prices) across departments

That's a mouthful, but this should be the fairest perception of a price distribution within a department

In [None]:
px.box(
    item_summaries
    , x = 'dept_id'
    , y = 'price_mode__nationwide_mean'
    #, color = "store_id"
    , title = 'Distributions of item prices across dept'
).show()

### Median price per dept over time

In [None]:
px.line(
    time_dept_summaries
    , x = "start_of_week"
    , y = "mean"
    , color = "dept_id"
    , title = 'Mean price per department over time'
).show()

Some strange fluctuations here - probably due to additions and removal of items (would need to verify).

### ... *to be continued* ...

we've touchced on the pricing, but should look to also gain some perspective on sales, events & SNAP^ windows

> ^The United States federal government provides a nutrition assistance benefit called the Supplement Nutrition Assistance Program (SNAP).  SNAP provides low income families and individuals with an Electronic Benefits Transfer debit card to purchase food products.  In many states, the monetary benefits are dispersed to people across 10 days of the month and on each of these days 1/10 of the people will receive the benefit on their card.  More information about the SNAP program can be found here: https://www.fns.usda.gov/snap/supplemental-nutrition-assistance-program