# Overview of Problem

## Problem statement:
Given historical data of Walmart sales, predict sales for the next 28 days.

[source](https://www.kaggle.com/c/m5-forecasting-accuracy/)

## Description: 

`Forecast` : Predicting events or trend in future 

The task we have at hand is forecasting, meaning we need to predict points for the upcoming future.
This task comprises doing a point estimate of sales of various products sold by Walmart for the points in future. Using historical data we need to forecast sales.

This will enable company to better prepare for future risks and implications by answering:
- What can be done to tackle upcoming losses?
- What can be done to maximise profits?
- How can operations in a company be optimized?

Based on the trend which is predicted, Walmart can increase or decrease the amount of items in the inventory. 

e.g: In the event of upcoming holiday and winter people are more likely to plan to go for skiing and buy items required for skiing, it is necessary that Walmart predicts this beforehand so that it keeps it's inventory ready for customers and gain more profits.

# Business Constraints

1. Cost of not predicting sales correctly can be very high
1. No strict latency constraints
1. Interpretability is partially important
1. We want to predict future sales as correctly as possible

# Machine Learning Problem

## Data

### Data Overview

**Data**:

1. Data is from 3 states of USA California, Texas, Wisconsin is present
1. Data is hierarchical in nature, it can be aggregated on different levels such as
    -  Item Level data
    -  Department Level data
    -  Product category level data
    -  Store level details

**Additional Data**:

1. Price
1. Promotions
1. Day of week
1. Special events

Looking briefly at data we have, we can tell that sales of Walmart depend on various factors.
Sales can drop or shoot with the upcoming special event.
At the same time sales of certain product might increase or decrease based on the upcoming event.
There will be seasonality in the sales of Walmart.
This seasonality depends on various factors *Additional Data* will help us account for that seasonality in the data.

Using additional data we will be make a robust forecasting system.


### Understanding Fields in data

**`calendar.csv`**:
- date : Actual date on calendar
- wm_yr_wk : Unique identifier given to a week of year
- weekday : Monday, Tuesday... Sunday
- wday : Number given to Weekday, Saturday -> 1 as per the data
- month : Number given to Month of the year
- year : Actual year on calendar
- d : Unique identifier given to a day in data
- event_name_1 : Name of special event 1
- event_type_1 : Type of special event 1
- event_name_2 : Name of special event 2
- event_type_2 : Type of special event 2
- snap_CA , snap_TX, snap_WI : SNAP event true or false

    *snap_XX* : 

    - The United States federal government provides a nutrition assistance benefit called the Supplement Nutrition Assistance Program (SNAP). SNAP provides low income families and individuals with an Electronic Benefits Transfer debit card to purchase food products.
    [reference](https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133614)
    - XX : CA, TX, WI refers to California, Texas, Wisconsin respectively

<hr>

**`sales_train_validation.csv`**:
- id: Unique ID given to each entry in the data format : \<item_id\>_\<store_id\>_\<type_of_data\>
    - e.g: HOBBIES_1_001_CA_1_validation
- item_id: Unique ID given to each item 
- dept_id: Unique ID given to each department
- cat_id: Unique ID given to category of product
- store_id: Unique ID given to each store
- state_id: Unique ID given to each state
- **d_1**, d_2, ... **d_1913**: Number of items sold for a particular *id*

<hr>

**`sales_train_evaluation.csv`**: 
- id: Unique ID given to each entry in the data format : \<item_id\>_\<store_id\>_\<type_of_data\>
    - e.g: HOBBIES_1_001_CA_1_validation
- item_id: Unique ID given to each item 
- dept_id: Unique ID given to each department
- cat_id: Unique ID given to category of product
- store_id: Unique ID given to each store
- state_id: Unique ID given to each state
- **d_1**, d_2, ... **d_1941**: Number of items sold for a particular *id*
    
<hr>

1. **`sell_prices.csv`**: 
- store_id: Unique ID given to each store
- item_id: Unique ID given to each item 
- wm_yr_wk : Unique identifier given to a week of year
- sell_price: Selling price

<hr>

### Summary of data 

- It used hierarchical sales data, generously made available by Walmart, starting at the item level and aggregating to that of departments, product categories and stores in three geographical areas of the US: California, Texas, and Wisconsin.
- Besides the time series data, it also included explanatory variables such as price, promotions, day of the week, and special events (e.g. Super Bowl, Valentine’s Day, and Orthodox Easter) that affect sales which are used to improve forecasting accuracy.
- The distribution of uncertainty was assessed by asking participants to provide information on four indicative prediction intervals and the median.
- The majority of the more than 42,840 time series display intermittency (sporadic sales including zeros).
- Series displays intermittency, i.e., sporadic demand including zeros.

[source](https://mofc.unic.ac.cy/m5-competition/)

## Mapping the real world problem to Machine Learning Problem

### Type of problem
The problem we have is time series forecasting, for given set of time dependent features.

### Performance Metric
Weighted Root Mean Squared Scaled Error (RMSSE). 
![image.png](attachment:image.png)

[Kaggle Evaluation metric](https://www.kaggle.com/c/m5-forecasting-accuracy/overview/evaluation)


[THE M5 COMPETITION Competitors’ Guide](https://mofc.unic.ac.cy/wp-content/uploads/2020/03/M5-Competitors-Guide-Final-10-March-2020.docx)