#  Introduction

## Problem Statement
The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s.

In this competition, the fifth iteration, hierarchical sales data from Walmart, the world‚Äôs largest company by revenue is given,** to forecast daily sales for the next 28 days.** The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

![Walmart](https://entrackr.com/wp-content/uploads/2020/01/Walmart-cash-and-carry.jpg)

## Data Provided
In the challenge, you are predicting item sales at stores in various locations for two 28-day time periods. Information about the data is found in the M5 Participants Guide.
Files
*      **calendar.csv** - Contains information about the dates on which the products are sold.
*     **sales_train_validation.csv** - Contains the historical daily unit sales data per product and store (d_1 - d_1913)
*     **sample_submission.csv** - The correct format for submissions. Reference the Evaluation tab for more info.
*     **sell_prices.csv** - Contains information about the price of the products sold per store and date.
*     **sales_train_evaluation.csv** - Available once month before competition deadline. Will include sales (d_1 - d_1941)

We will have a sneak peak into the dataset below 

## Evaluation Metric
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE). The RMSSE metric is a variant of the original MASE (mean absolute scaled error) metric. The scaling was introduced to provide a scale-free error regardless of the data. An example for this competition would be a product that sells hundreds of units per day vs. a product that only sells a few times a week or month (the intermittent demand we are seeing in there series).
Keeping that in mind our measurement is now scale free which allows us to compare many different types of time series. Another reason we are using RMSSE is because it does not suffer from actual values being zero like the MAPE metric.

# Content:
1. Loading necessary libraries
2. Reading the dataset
3. Summary Statistics
4. Time Series Views
5. Impact of Events and SNAP days on sales
6. Analysis on prices changes

# 1. Loading necessary libraries

In [None]:
!pip install cufflink

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os #using operating system dependent functionality
import datetime #datetime module supplies classes for manipulating dates and times.
import math # provides access to the mathematical functions
from IPython.display import display, HTML

#For Plotting
# Using plotly + cufflinks in offline mode
import plotly as py
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)
init_notebook_mode(connected=True)

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#For time series decomposition
from matplotlib import pyplot
from statsmodels.tsa.seasonal import seasonal_decompose

#Pandas option
pd.options.display.float_format = '{:.2f}'.format

# 2. Reading the dataset

Listing the available files

In [None]:
# Input data files are available in the "../input/" directory.
# Listing the available files 
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
sales_data = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv")
price_data = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/sell_prices.csv")
calender_data = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/calendar.csv")
submission_data = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/sample_submission.csv")

# 3. Summary Statistics

### 1.1 Sales Data

In [None]:
print("The sales data has '{}' rows and '{}' columns".format(sales_data.shape[0],sales_data.shape[1]))

In [None]:
#Let's have a look at the data
sales_data.head()

The unique identifier seems to be a concatenation of item_id and store_id. There are 3049 unique items and 10 unique stores (Total number of rows = 30490)

The columns d_1 to d_1913 gives the sales of the given item in that store on the nth day for 1913 days

In [None]:
#Let's make a list of date columns date_col = [d1,d2,d3,d4...]
date_col = [col for col in sales_data if col.startswith('d_')]

Let's look at the unique states in the sales dataset

In [None]:
#Let's look at the unique states in the sales dataset
sales_data.state_id.unique()

We have information about 3 states of U.S. (California, Texas and Wisconsin)

Lets look at the number of rows for each state

In [None]:
#Lets look at the number of rows for each state. Value_counts give you that
sales_data.state_id.value_counts()

In [None]:
#Let's have a look at the ratio of the number of rows. Normalize = True gives you the ratio
sales_data.state_id.value_counts(normalize =True) 

40% of the rows are about California, 30% rows are about Wisconsin and Texas

Total sales from each of the state

In [None]:
#Calcuating total sales for each row/ id by adding the sales of each of the 1913 days
sales_data['total_sales'] = sales_data[date_col].sum(axis=1)
#Adding all the sales for each state
sales_data.groupby('state_id').agg({"total_sales":"sum"}).reset_index()

### Plotting Sales Ratio across the 3 states

In [None]:
#Calculating the sales ratio
state_wise_sales_data = sales_data.groupby('state_id').agg({"total_sales":"sum"})/sales_data.total_sales.sum() * 100
state_wise_sales_data = state_wise_sales_data.reset_index()
#Plotting the sales ratio
fig1, ax1 = plt.subplots()
ax1.pie(state_wise_sales_data['total_sales'],labels= state_wise_sales_data['state_id'] , autopct='%1.1f%%',
        shadow=True, startangle=90)# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')  
plt.tight_layout()
plt.title("State Wise total sales percentage",fontweight = "bold")
plt.show()

We see that 43.6% of sales come from California while Texas and Winscoin have comparable sales 27.6% and 28.8%

In [None]:
#Let's have a look at the unique stores
print("Let's have a look at the unique stores - ",sales_data.store_id.unique())

We have information from 10 stores : 4 stores of California, 3 stores of Texas and 3 stores of Winscoin

### Plotting Sales Ratio across the 10 stores

In [None]:
#Caculating the sales ratio for the 10 stores
store_wise_sales_data=sales_data.groupby('store_id').agg({"total_sales":"sum"})/sales_data.total_sales.sum() * 100
#Plotting the sales ratio for the 10 stores
store_wise_sales_data = store_wise_sales_data.reset_index()
fig1, ax1 = plt.subplots()
ax1.pie(store_wise_sales_data['total_sales'],labels= store_wise_sales_data['store_id'] , autopct='%1.1f%%',
        shadow=True, startangle=90)# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')  
plt.tight_layout()
plt.title("Store Wise total sales percentage",fontweight = "bold")
plt.show()

We have the store CA_3 which has the highest sales ratio ~17% which has almost double the sales of any other store. While CA_4 has the lowest sales ratio ~6.2%

In [None]:
# Let's have a look at the unique categories 
print("Let's have a look at the unique categories -",sales_data.cat_id.unique())

In [None]:
#Let's have a look at the total sales from each of the 3 categries
print("Total Sales from each category")
sales_data.groupby('cat_id').agg({"total_sales":"sum"}).reset_index()

### Plotting Sales Ratio across the 3 categories

In [None]:
#Caculating the sales ratio for the 3 categories
cat_wise_sales_data = sales_data.groupby('cat_id').agg({"total_sales":"sum"})/sales_data.total_sales.sum() * 100
cat_wise_sales_data = cat_wise_sales_data.reset_index()
#Plotting the sales ratio for the 3 categories
fig1, ax1 = plt.subplots()
ax1.pie(cat_wise_sales_data['total_sales'],labels= cat_wise_sales_data['cat_id'] , autopct='%1.1f%%',
        shadow=True, startangle=90)# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')  
plt.tight_layout()
plt.title("Category Wise total sales percentage",fontweight = "bold")
plt.show()

We have almost 70% sales coming only from FOODS category.20% from HOUSEHOLD categories and a minor 10 % sales from HOBBIES 

### Plotting Sales of each category across the 3 states

In [None]:
cat_state_sales =sales_data.groupby(['cat_id','state_id']).agg({"total_sales":"sum"}).groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).unstack()
cat_state_sales.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in cat_state_sales.columns]
cat_state_sales.plot(kind='bar', stacked=True)
plt.title("Sales Distrubution for each category across states",fontweight = "bold")

What we see:
* California contributes to almost 40% sales of foods and household categories but contributes to about 50% sales of hobbies category
*  Winscoin has about 25% contribution in both hobbies and household categories but contributes to about 30% sales of food category
* Texas contributes to almost 30% sales of foods and household categories but contributes to about 25% sales of hobbies category

### Plotting sales ditribution for each state across categories

In [None]:
#Calculating sales distribution for each state 
state_cat_sales = sales_data.groupby(['state_id','cat_id']).agg({"total_sales":"sum"}).groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).unstack()
#Plotting the sales distribution for each state
state_cat_sales.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in state_cat_sales.columns]
state_cat_sales.plot(kind='bar', stacked=True)
plt.title("Sales Distrubution for each state across categories",fontweight = "bold")

What we see:
* Winscoin spends the highest in food 72%, 8% in hobbies and 20% in household
* California spend 67% on food, 11% on hobbies and 22% on household
* Texas spends 69% on food, 8% in hobbies and 23% in household
So we can see that all the 3 states have a bit of differences in spent

In [None]:
#Let's look at the unique departments
print("Let's look at the unique departments - ",sales_data.dept_id.unique())

We have 7 departments in total (2 hobbies ,2 household and 3 food departments)

### Plotting sales ditribution across departments

In [None]:
#Calculating sales distribution across departments
dept_sales = sales_data.groupby('dept_id').agg({"total_sales":"sum"})/sales_data.total_sales.sum() * 100
#Plotting
dept_sales = dept_sales.reset_index()
fig1, ax1 = plt.subplots()
ax1.pie(dept_sales['total_sales'],labels= dept_sales['dept_id'] , autopct='%1.1f%%',
        shadow=True, startangle=90)# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')  
plt.tight_layout()
plt.title("Department Wise total sales percentage",fontweight = "bold")
plt.show()

What we see:
* We have almost 50% sales coming from FOODS-3 department which is the highest (out of the total ~70% sales coming from foods)
* In household, houshold_1 contributes to 17.5% to the total sles (out of the 22 % sales coming from household)
* Of the 9.3% sales coming from hobbies 8.5% comes from hobbies_1

### Plotting sales distribution of stores across departments

In [None]:
# Calculating the sales distribution of stores
store_dept_sales = sales_data.groupby(['store_id','dept_id']).agg({"total_sales":"sum"}).groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).unstack()
store_dept_sales.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in store_dept_sales.columns]
#Plotting the sales distribution
store_dept_sales.plot(kind='bar', stacked=True)
plt.title("Sales Distrubution for each store across departments",fontweight = "bold")
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), shadow=True, ncol=2)

In [None]:
print("Sales distribution(in %) in each store accross different departments")
sales_data.groupby(['store_id','dept_id']).agg({"total_sales":"sum"}).groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).unstack()

What we see:
* Store WI_3 has the most skewed distribution of sales where 54% of sales come from food-3, ~7% sales in hobbies
* Store CA_2 has lowest sales in FOODS_3 when compared to other stores(42%) and the highest in FOODS_1 and HOUSEHOLD_2 when compared to other stores(12% and 8%) where the averages are quite low

In [None]:
dept_store_sales = sales_data.groupby(['dept_id','store_id']).agg({"total_sales":"sum"}).groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).unstack()
dept_store_sales.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in dept_store_sales.columns]
dept_store_sales.plot(kind='bar', stacked=True)
plt.title("Sales Distrubution for each state across categories",fontweight = "bold")
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.5), shadow=True, ncol=3)

### 1.2 Price Data 

In [None]:
#Let's have a look at the price data
price_data.head()

Let's first look at the distribution of prices across Categories

In [None]:
price_data["Category"] = price_data["item_id"].str.split("_",expand = True)[0]
plt.figure(figsize=(12,6))
p1=sns.kdeplot(price_data[price_data['Category']=='HOBBIES']['sell_price'], shade=True, color="b")
p2=sns.kdeplot(price_data[price_data['Category']=='FOODS']['sell_price'], shade=True, color="r")
p3=sns.kdeplot(price_data[price_data['Category']=='HOUSEHOLD']['sell_price'], shade=True, color="g")
plt.legend(labels=['HOBBIES','FOODS',"HOUSEHOLD"])
plt.xscale("log")
plt.xlabel("Log of Prices")
plt.ylabel("Density")
plt.title("Density plot of log of prices accross Categories")

What we see:
* Most of the prices for food products lie between 1 dollars and 10 dollars. As we can see the high peak between 10^0 and 10^1
* Hobbies show a pretty wide range of prices
* Households are costlier than Food

In [None]:
#Let's look at items with the maximum price change and minimum price change over the years
item_store_prices = price_data.groupby(["item_id","store_id"]).agg({"sell_price":["max","min"]})
item_store_prices.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in item_store_prices.columns]                                               
item_store_prices["price_change"] = item_store_prices["sell_price_max"] - item_store_prices["sell_price_min"]
item_store_prices_sorted = item_store_prices.sort_values(["price_change","item_id"],ascending=False).reset_index()
item_store_prices_sorted["category"] = item_store_prices_sorted["item_id"].str.split("_",expand = True)[0]


In [None]:
print("Items sorted by maximum price change over the years (top 10)")
item_store_prices_sorted.head(10)

What we get:
* We see that household items specially HOUSEHOLD_2 department has shown the maximum price changes specially in Wisconsin.
* The price changed the most for HOUSEHOLD_2_406 item in WI_3 store where the min price was just  \$3.26 and had rised 32 times to  \$107

In [None]:
print("Items sorted by least price changes over the years (top 10)")
item_store_prices_sorted.tail(10)

What we get:
* FOODS_1_014 item hasnt changed prices over the years. Also, the price is fixed in all stores

### Plotting boxplot for price changes

In [None]:
#Plotting boxplot
sns.boxplot(x="price_change", y="category", data=item_store_prices_sorted)
title = plt.title("Boxplot for maximum price change for each item over the years across all categories")

What we get:
* Most products dont change prices at all and most changes are restricted to 10-15\\$
* Household items have the highest price changes over the years 

## 1.3 Calender

In [None]:
#Let's look at the calender data
calender_data.head()

In [None]:
print("The calender dataset has {} rows and {} columns".format(calender_data.shape[0],calender_data.shape[1]))

What we see:
* The calender data is given for all the 1913 days in the sales data (actually we have 1969 days)
* We have at max 2 events in a day for which the event names and the event types are given
* We also SNAP days flags for each state separately i.e. all states have different SNAP days.

Knowing a bit about SNAP won't harm :- [SNAP](https://www.feedingamerica.org/take-action/advocate/federal-hunger-relief-programs/snap)

**What is SNAP?**

SNAP stands for the Supplemental Nutrition Assistance Program. SNAP is a federal program that helps millions of low-income Americans put food on the table. Across the United States there are 9.5 million families with children on SNAP. It is the largest program working to fight hunger in America.

**What kinds of groceries can be purchased with SNAP?**

Households can use SNAP to buy nutritious foods such as breads and cereals, fruits and vegetables, meat and fish and dairy products. SNAP benefits cannot be used to buy any kind of alcohol or tobacco products or any nonfood items like household supplies and vitamins and medicines.
N.B. So we can expect SNAP can help sales in food items 

In [None]:
# Event names for each event type
events1 = calender_data[['event_type_1','event_name_1',]]
events2 = calender_data[['event_type_2','event_name_2',]]
events2.columns = ["event_type_1","event_name_1"]
events = pd.concat([events1,events2],ignore_index = True)
events = events.dropna().drop_duplicates()
events
events_dict = {k: g["event_name_1"].tolist() for k,g in events.groupby("event_type_1")}
print("Event Names across different Event Types")
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in events_dict.items()]))

We have 10 National and Religious events, 6 Cultural Events and 3 Sporting events (30 events in total) in a year

In [None]:
snap_days = calender_data.groupby(['year','month'])['snap_CA','snap_TX','snap_WI'].sum().reset_index()
print("SNAP days for each month across the years for all the states")
snap_days.pivot(index="month",columns = "year",values = ["snap_CA","snap_TX","snap_WI"])

So every month we have 10 SNAP days for all the 3 states and it has been consistent througout the years which might fall on different days in different states

Let's look at the unique days at which the 10 SNAP days of a month exists over the years

In [None]:
#Setting the start date
base = datetime.datetime(2011,1,29)
#Calculating the total sales in a day
sales_sum = pd.DataFrame(sales_data[date_col].sum(axis =0),columns = ["sales"])
#Adding the date column
sales_sum['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
sales_sum.set_index('datum', drop=True, inplace=True)
sales_sum.sort_index(inplace=True)

In [None]:
#Joining the calender data with the sales data to see the impact of events
calender_data['date'] = pd.to_datetime(calender_data['date'])
overall_sales_special = pd.merge(calender_data,sales_sum, left_on = "date", right_on = "datum",how = "right")

For CA:

In [None]:
overall_sales_special.loc[overall_sales_special.snap_CA==1,"date"].dt.day.unique()

We can see that we 10 unique values. So, all the years SNAP days fall on the sam 10 days i.e. 1st of every month to 10th of every month in CA

For TX:

In [None]:
overall_sales_special.loc[overall_sales_special.snap_TX==1,"date"].dt.day.unique()

We also see 10 unique values dates that means in TX also SNAP falls on the same dates over the months. These dates are different from CA

For WI:

In [None]:
overall_sales_special.loc[overall_sales_special.snap_WI==1,"date"].dt.day.unique()

We also see 10 unique values dates that means in WI also SNAP falls on the same dates over the months. These dates are different from CA and TX

# 4. Time Series Views

### 4.1 Across Years

### Plotting daily sales time series

In [None]:
#Plotting daily states
sales_sum.iplot(title = "Daily Overall Sales")

What we see:
* We see an increasing overall trend
* We also see a monthly seasonality peaking at August.

Now these things will be even clearer when we look at the time series decomposition.

Before that a little context:

A given time series is thought to consist of three systematic components including level, trend, seasonality, and one non-systematic component called noise.

These components are defined as follows:

    Level: The average value in the series.
    Trend: The increasing or decreasing value in the series.
    Seasonality: The repeating short-term cycle in the series.
    Noise: The random variation in the series.

In [None]:
result = seasonal_decompose(sales_sum, model='additive')
result.plot()
pyplot.show()

What we see:
* The first graph shows the actual time series
* The second graph shows the trend. The overall trend seems to be increasing over the years
* The third graph shows the seasonality. We see a strong weekly seasonality
* We also see lots of residuals in the last graph

### Plotting monthly sales time series across the 3 states

In [None]:
state_level = sales_data.groupby("state_id")[date_col].sum().reset_index().set_index('state_id').T
state_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
state_level.set_index('datum', drop=True, inplace=True)
state_level.sort_index(inplace=True)
state_level.head()
state_month_level = state_level.groupby(pd.Grouper(freq='1M')).sum()
state_month_level.iplot(title = "Monthly Sales accross States")

What we see:
* CA sales has always been the highest. The peaks in August are the most evident at CA in comparison to other states. Seasonality impacts CA sales the most
* WI have shown the highest increase in sales over the years.The sales were lower than TX before 2013. TX and WI sales were similar in 2013 to 2015 August. It has shown higher sales after 2015 August than TX. It will be interesting to look at what has caused WI to increase significantly over the years.

In [None]:
#Plotting the sales time series decomposition for each state
res1 = seasonal_decompose(state_month_level["CA"], model='additive')
res2 = seasonal_decompose(state_month_level["TX"], model='additive')
res3 = seasonal_decompose(state_month_level["WI"], model='additive')
def plotseasonal(res, axes ):
    res.observed.plot(ax=axes[0], legend=False)
    axes[0].set_ylabel('Observed')
    res.trend.plot(ax=axes[1], legend=False)
    axes[1].set_ylabel('Trend')
    res.seasonal.plot(ax=axes[2], legend=False)
    axes[2].set_ylabel('Seasonal')
    res.resid.plot(ax=axes[3], legend=False)
    axes[3].set_ylabel('Residual')

fig, axes = plt.subplots(ncols=3, nrows=4, sharex=True, figsize=(12,5))

plotseasonal(res1, axes[:,0])
axes[0,0].set_title("CA")
plotseasonal(res2, axes[:,1])
axes[0,1].set_title("TX")
plotseasonal(res3, axes[:,2])
axes[0,2].set_title("WI")
plt.tight_layout()
plt.show()

What we see:
* CA and WI show a gradual increasing trend while TX grew up quite fast and then became a bit stagnant
* CA and TX show a similar seasonality peaking at July, August and dipping at Dec ,Jan
* WI shows a different seasonality peaking at March and dipping at April and then peaking again at August

In [None]:
store_level = sales_data.groupby("store_id")[date_col].sum().reset_index().set_index('store_id').T
store_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
store_level.set_index('datum', drop=True, inplace=True)
store_level.sort_index(inplace=True)
store_level.head()
store_month_level = store_level.groupby(pd.Grouper(freq='1M')).sum()
store_month_level.head()

### Plotting monthly sales time series across different stores

In [None]:
cf.Figure(cf.subplots([store_month_level[['CA_1','CA_2','CA_3','CA_4']].figure(),store_month_level[['TX_1','TX_2','TX_3']].figure(),store_month_level[['WI_1','WI_2','WI_3']].figure()],shape=(1,3),subplot_titles=('CA', 'TX', 'WI'))).iplot()

What we see:
* In CA, CA_1 and CA_3 shows the highest seasonality in sales and have an increasing trend overall.CA_2 shows a stark decrease in 2015 but has then peaked up in 2016. CA_4 shows a similar increasing trend. Overall, CA stores show the same trend
* In TX, TX_2 was really performing well till 2014 but has dipped after that drastically which is bad.TX_1 shows a moderate increasing trend. The good news in TX is TX_3 is consistently showing an increasing trend
* In WI, WI_1 and WI_2 shows drastic increase in 2012 and 2013 from 50k sales to 100k sales. WI_3 which was the best store in 2012 dipped a lot during  2014 and 2015 but has increased a bit in 2016 

### Plotting sales time series accross categories

In [None]:
cat_level = sales_data.groupby("cat_id")[date_col].sum().reset_index().set_index('cat_id').T
cat_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
cat_level.set_index('datum', drop=True, inplace=True)
cat_level.sort_index(inplace=True)
cat_level.head()
cat_level_level = cat_level.groupby(pd.Grouper(freq='1M')).sum()
cat_level_level.iplot(title = "Monthly Sales accross Categories")

What we see:
* Food sales have always remained much higher than household and hobbies
* Hobbies show a preety much flat trend and less seasonality
* Food show the highest seasonality and household tend the show the highest increase in sales overall

Let's see the time series decomposition for each of the categories as well to be more clear

In [None]:
#Plotting the sales time series decomposition for each state
res1 = seasonal_decompose(cat_level_level["FOODS"], model='additive')
res2 = seasonal_decompose(cat_level_level["HOBBIES"], model='additive')
res3 = seasonal_decompose(cat_level_level["HOUSEHOLD"], model='additive')
def plotseasonal(res, axes ):
    res.observed.plot(ax=axes[0], legend=False)
    axes[0].set_ylabel('Observed')
    res.trend.plot(ax=axes[1], legend=False)
    axes[1].set_ylabel('Trend')
    res.seasonal.plot(ax=axes[2], legend=False)
    axes[2].set_ylabel('Seasonal')
    res.resid.plot(ax=axes[3], legend=False)
    axes[3].set_ylabel('Residual')

fig, axes = plt.subplots(ncols=3, nrows=4, sharex=True, figsize=(12,5))

plotseasonal(res1, axes[:,0])
axes[0,0].set_title("FOODS")
plotseasonal(res2, axes[:,1])
axes[0,1].set_title("HOBBIES")
plotseasonal(res3, axes[:,2])
axes[0,2].set_title("HOUSEHOLD")
plt.tight_layout()
plt.show()

What we see:
* FOOD sales have increased quite much till 2012 and then is somewhat stagnant frim 2012 to 2016. And show a bit of spike in March and a huge spike in Auguts
* HOBBIES sales have increased in each alternate years from 2012 Aug to 2013 Aug and then 2014 Aug to 2015 Aug. In has remained pretty stagnant in the rest two years. Here, the spikes in March is much more evident.
* HOUSEHOLD sales have a better increasing trend and clean seasonality in March and August

### Plotting monthly sales accross departments

In [None]:
dept_level = sales_data.groupby("dept_id")[date_col].sum().reset_index().set_index('dept_id').T
dept_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
dept_level.set_index('datum', drop=True, inplace=True)
dept_level.sort_index(inplace=True)
dept_level.head()
dept_monthly_level = dept_level.groupby(pd.Grouper(freq='1M')).sum()

In [None]:
cf.Figure(cf.subplots([dept_monthly_level[['FOODS_1','FOODS_2','FOODS_3']].figure(),dept_monthly_level[['HOBBIES_1','HOBBIES_2']].figure(),dept_monthly_level[['HOUSEHOLD_1','HOUSEHOLD_2']].figure()],shape=(1,3),subplot_titles=('FOODS', 'HOBBIES', 'HOUSEHOLD'))).iplot()

What we see:
* In foods, FOODS_3 have the highest sales but also have a high seasonality ranging from 500k to 600k back and forth in the same year. FOODS_2 has also increased over the year.FOODS_1 has pretty much remained stagnant over the years
* HOUSEHOLD_1 has shown the best increase in sales over the years as compared to other departmenrs

### Plotting sales for each category in each of the state

In [None]:
dept_cat_level = sales_data.groupby(["state_id","cat_id"])[date_col].sum().reset_index().set_index(["state_id","cat_id"]).T
dept_cat_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
dept_cat_level.set_index('datum', drop=True, inplace=True)
dept_cat_level.sort_index(inplace=True)
dept_cat_level.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in dept_cat_level.columns]
dept_cat_monthly_level = dept_cat_level.groupby(pd.Grouper(freq='1M')).sum()
cf.Figure(cf.subplots([dept_cat_monthly_level[['CA_FOODS','TX_FOODS','WI_FOODS']].figure(),dept_cat_monthly_level[['CA_HOBBIES','TX_HOBBIES','WI_HOBBIES']].figure(),dept_cat_monthly_level[['CA_HOUSEHOLD','TX_HOUSEHOLD','WI_HOUSEHOLD']].figure()],shape=(1,3),subplot_titles=('FOODS', 'HOBBIES', 'HOUSEHOLD'))).iplot()

What we see:
* Sales of Food items in CA have shown very less of an increase over the years although it has a high monthly seasonality peaking at August each year and dipping towards Year end and start (Nov, Dec, Jand and Feb)
* Sales of Food items in TX also shows a similar trend although at a lower scale (High monthly seasonality peaking at August and dipping at Jan, Feb and overall trend remaining somewhat same)
* Sales of Food items in WI has quite surpassed that of WI in 2015 and 2016. It's the only state showing an increase in Food items over the years. Also we dont see an obvious monthly seasonality in Food items at WI where peaks are at the month of March or July
* Sales of Hobbies items doesnt show any monthly seasonalities as such
* Rather in case of CA, sales of Hobbies items tends to increase every alternate year and dip at alternate year (aka cyclicity) with the dip in Nov,2012 as the worst but increase back quite well. Sales of Hobbies items at CA also shows an increasing trend
* The sales of Hobbies items in TX and WI both show an increasing trend and have quite similar sales with a bit better sales increase in TX
* Household sales shows the best increasing trend in all the states.
* CA shows the best increase in Household items over the years and also has a monthly seasonality again peaking at August and dipping at December and January
* TX shows a bit better sales than WI in Household throughout the years but both show similar increasing trend. They show an increase in sales of Household in August but it's not that evident enough to show a monthly seasonality. Infact, both also show a peak in Mar, 2013 

### 4.2 Across Week

### Plotting sales over the week

In [None]:
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
sales_sum_weekday = sales_sum.groupby(sales_sum.index.weekday_name).mean().reindex(days)
sns.set(rc={'figure.figsize':(15,5)})
sns.barplot(x= sales_sum_weekday.index, y='sales', data=sales_sum_weekday)


Saturday sees the highest overall sales probably because of the first day of weekend and people rushing to buy groceries followed by Sunday also being a weekend

### Plotting sales over the week for the 3 categories

In [None]:
cat_level = sales_data.groupby("cat_id")[date_col].sum().reset_index().set_index('cat_id').T
cat_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
cat_level.set_index('datum', drop=True, inplace=True)
cat_level.sort_index(inplace=True)
cat_level.head()
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
sales_cat_weekday = cat_level.groupby([cat_level.index.weekday_name]).mean().reindex(days)
sales_cat_weekday.iplot( kind="bar",title = "Avg. Sales across day of week")

What we see:
* Sunday (~28.3K units) followed by Saturday (~28K units) are the dominant days for sales of Food Items 
* Saturday has more sales than Sunday in the case of Household and Hobbies although the differences are quite small
* We do see a sense of weekly seasonality with highest sales in Saturday and Sunday and lowest sales in Wednesday and Thursday

### 4.3 Across Months

### Plotting sales over the months

In [None]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul','Aug', 'Sep', 'Oct', 'Nov', 'Dec'] 
monthly_sales = sales_sum.groupby(sales_sum.index.strftime('%b')).mean().reindex(months)
monthly_sales.iplot( kind="bar",title = "Avg. Sales across months")

August sees the highest overall sales in a year (~ 36K units sold in a month)

### Plotting sales over the months across the 3 categories

In [None]:
cat_level = sales_data.groupby("cat_id")[date_col].sum().reset_index().set_index('cat_id').T
cat_level['datum'] = [base + datetime.timedelta(days=x) for x in range(1913)]
cat_level.set_index('datum', drop=True, inplace=True)
cat_level.sort_index(inplace=True)
cat_level.head()
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul','Aug', 'Sep', 'Oct', 'Nov', 'Dec'] 
monthly_sales = cat_level.groupby(cat_level.index.strftime('%b')).mean().reindex(months)
monthly_sales.iplot( kind="bar",title = "Avg. Sales across months")

What we see:
* August sees the highest sales in Foods which might be because of holidays or other climate conditions like high temperature and high precipitation
* August also sees the highest sales in Household items
* April and June sees the highest sales in Hobbies

### 4.4 Across Days of Month

### Plotting sales over the days of month

In [None]:
monthly_sales = sales_sum.groupby(cat_level.index.strftime('%d')).mean()
sales_list = np.array(monthly_sales.values.tolist())
sales_list = np.append(sales_list, np.repeat(np.nan, 4)).reshape(5,7)
labels = range(1,32)
labels = np.append(labels, np.repeat(np.nan, 4)).reshape(5,7)
heat_map= sns.heatmap(sales_list,cmap = "YlGnBu",annot = labels, yticklabels = ("Week 1","Week 2","Week 3","Week 4","Week 5"))
plt.title("Avg. Sales on day of month over the years")
plt.show()

Overall sales are primarily higher in the first two weeks of a month

### Plotting sales over the month for the 3 categories

In [None]:
cat_monthly_sales = cat_level.groupby(cat_level.index.strftime('%d')).mean()
foods_list = np.array(cat_monthly_sales['FOODS'].tolist())
foods_list = np.append(foods_list, np.repeat(np.nan, 4)).reshape(5,7)
hobbies_list = np.array(cat_monthly_sales['HOBBIES'].tolist())
hobbies_list = np.append(hobbies_list, np.repeat(np.nan, 4)).reshape(5,7)
household_list = np.array(cat_monthly_sales['HOUSEHOLD'].tolist())
household_list = np.append(household_list, np.repeat(np.nan, 4)).reshape(5,7)
labels = range(1,32)
labels = np.append(labels, np.repeat(np.nan, 4)).reshape(5,7)


fig, (ax1, ax2 , ax3) = plt.subplots(1,3)
foods_map= sns.heatmap(foods_list,cmap = "YlGnBu",annot = labels, yticklabels = ("Week 1","Week 2","Week 3","Week 4","Week 5"), ax =ax1)
hobbies_map= sns.heatmap(hobbies_list,cmap = "YlGnBu",annot = labels, yticklabels = ("Week 1","Week 2","Week 3","Week 4","Week 5"), ax =ax2)
household_map= sns.heatmap(household_list,cmap = "YlGnBu",annot = labels, yticklabels = ("Week 1","Week 2","Week 3","Week 4","Week 5"), ax =ax3)
ax1.set_title('FOODS')
ax2.set_title('HOBBIES')
ax3.set_title('HOUSEHOLD')
plt.suptitle("Avg. Sales on day of month over all years across different categories ")
plt.show()

What we get:
* Food Items are bought in both the first and second week of a month
* Hobbies and Household are only primarily bought in the first 3 days of the months.

# 5. Impact of Events and SNAP days on sales

In [None]:
overall_sales_special.head()

## Gauging impact of events

### Plotting daily sales for the year 2012 to see the pattern of impact of events

In [None]:
overall_sales_special[overall_sales_special.year == 2012].groupby("date")["sales"].sum().iplot(title = "Daily Overall Sales")

In [None]:
print("Event days in 2012")
overall_sales_special[(overall_sales_special.year == 2012) & ((overall_sales_special.event_name_1.notnull()) | (overall_sales_special.event_name_2.notnull()))]

If we closely look at the sales for 2012 we see that people** prefer buying on the weekends(Saturdays preferably) *before the Events rather on the days of the events***. So we don't see a increase in the sales on the days of the events. But the increase in sales on the weekends before that can be attributed to that event. We see some exception though like Labour Day which is a Monday, we still see a peak on that day. Another one is Thanksgiving which is on a Thurday but we see a peak on the day before which is a wednesday

In [None]:
#Function for tagging events to the preceding weekend 
event_days_sales = overall_sales_special[((overall_sales_special.event_name_1.notnull()) | (overall_sales_special.event_name_2.notnull()))]
overall_sales_special["weekend_precede_event"] = np.nan

def update_weekend_precede_event(week_e,wday,e1,e2):
    e2 = '_' + e2 if type(e2) == str else ''
    drift = e1 + e2
    if wday == 1:
        overall_sales_special.loc[(overall_sales_special['wm_yr_wk']==week_e)&(overall_sales_special['wday']==1),"weekend_precede_event"] = drift
    else:
        overall_sales_special.loc[(overall_sales_special['wm_yr_wk']==week_e)&((overall_sales_special['wday']==1)|(overall_sales_special['wday']==2)),"weekend_precede_event"] = drift
        
_ = event_days_sales.apply(lambda row : update_weekend_precede_event(row['wm_yr_wk'],row['wday'],row['event_name_1'], row['event_name_2']),axis = 1)

In [None]:
print("Events data with added weekend_prece_event column which marks the weekend before each of the event along with the event name")
overall_sales_special.head(10)

We can see that Super Bowl event was on Sunday and we have mapped it to the previous Saturday and Sunday via the column weekend_precede_event

### First let's look at impact of different types of events and then we will look at specific events

In [None]:
#adding event type column
events.columns = ["weekend_precede_event_type","weekend_precede_event"]
overall_sales_special = pd.merge(overall_sales_special,events,how= "left",on="weekend_precede_event")

### Plotting sales of weekends before different event types

In [None]:
#Calculating sales impact of each event on preceding weekend
event_type_impact = overall_sales_special.groupby(['weekend_precede_event_type'])['sales'].mean().reset_index()
event_type_impact = event_type_impact.sort_values("sales",ascending = False)
event_type_impact.columns = ["event_type","avg_sales_preceding_weekend"]
#Plotting a bar graph of avg. sales on the weekend days before the event to see the impact
chart = sns.barplot(y= "event_type", x='avg_sales_preceding_weekend', data=event_type_impact)
chart.axvline(sales_sum.sales.mean(),label = "Avg. sales in a day",c='red', linestyle='dashed')
plt.title("Avg. Sales on preceding event of each type of event", fontweight ="bold")
leg = plt.legend()

What we see:
* Weekends before all the event types have much higher sales than the avg. sales per day
* Religious events have the highest impact to sales of the preceding weekend 
* National events have the lowest impact to sales of the preceding weekend

### Plotting sales of weekends preceding each event

In [None]:
#Calculating sales impact of each event on preceding weekend
event_impact = overall_sales_special.groupby(['weekend_precede_event'])['sales'].mean().reset_index()
event_impact = event_impact.sort_values("sales",ascending = False)
event_impact.columns = ["events","avg_sales_preceding_weekend"]
# Plotting a bar graph of avg. sales on the weekend days before the event to see the impact
sns.set(rc={'figure.figsize':(15,3)})
chart = sns.barplot(x= "events", y='avg_sales_preceding_weekend', data=event_impact)
chart.axhline(sales_sum.sales.mean(),label = "Avg. sales in a day",c='red', linestyle='dashed')
var = chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.title("Avg. sales on preceding weekend of each event",fontweight = "bold")
leg = plt.legend()

What we see:
* Almost all weekends have higher sales than the average sales i.e 34k so events do impact sales
* The highest sales is in the weekend before Easter as expected about 44k each day followed by EidAlAdha which are both Religious events (which follows our event type finding of religious event having the highest impact)
* The lowest sales is in the weekend before New Year 

## Impact of SNAP days

Different states have different SNAP dates so we will have to view them separetly

In [None]:
#Joining the state wise sales with the events table
overall_sales_special = pd.merge(overall_sales_special,state_level.reset_index(),how = "left",left_on="date",right_on="datum")
overall_sales_special.drop("datum",axis = 1,inplace =True)

In [None]:
#Comparing the days with and w/o SNAP for all 3 states
ca_snap = overall_sales_special.groupby("snap_CA")["CA"].mean().reset_index()
tx_snap = overall_sales_special.groupby("snap_TX")["TX"].mean().reset_index()
wi_snap = overall_sales_special.groupby("snap_WI")["WI"].mean().reset_index()
ca_snap.columns = ["Snap","CA"]
tx_snap.columns = ["Snap","TX"]
wi_snap.columns = ["Snap","WI"]
snap_impact = pd.merge(ca_snap,tx_snap,on = "Snap")
snap_impact = pd.merge(snap_impact,wi_snap,on = "Snap")
snap_impact = pd.melt(snap_impact, id_vars=['Snap'], value_vars=['CA','TX','WI'],var_name='State', value_name='Avg Sales')
#Plotting bar plots for sales comparison
sns.set(rc={'figure.figsize':(10,7)})
chart=sns.barplot(x= "State", y='Avg Sales',hue = 'Snap' ,data=snap_impact)
chart.axhline(overall_sales_special.CA.mean(),label = "Avg. sales CA",c='red', linestyle='dashed')
chart.axhline(overall_sales_special.TX.mean(),label = "Avg. sales TX",c='blue', linestyle='dashed')
chart.axhline(overall_sales_special.WI.mean(),label = "Avg. sales WI",c='black', linestyle='dashed')
var = chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.title("Avg. Sales in each state on SNAP days(1) and on other days(0)",fontweight="bold")
leg = plt.legend()

What we see:
* All the states have higher sales on SNAP days. We can see the max increase in WI (~2k more sales on each day) while CA and TX have ~1k more sales on each day of SNAP
* We have seen higher sales during the first 10-15 days of a month.We have also seen that SNAP days fall in the first 15 days. So, there might be a combined effect.

# 6. Analysis on prices changes

Sales data of each item at a weekly level

In [None]:
item_sales = sales_data.groupby("item_id")[date_col].sum().reset_index().set_index('item_id').T
#Setting the start date
base = datetime.datetime(2011,1,29)
#Adding the date column
item_sales['date'] = [base + datetime.timedelta(days=x) for x in range(1913)]
item_sales = pd.merge(item_sales,overall_sales_special[["date","wm_yr_wk"]],on = "date")
item_sales=item_sales.groupby(["wm_yr_wk"]).sum().reset_index()
item_sales['date'] = [base + datetime.timedelta(days=x) for x in range(0,1913,7)]
item_sales = item_sales.melt(id_vars=['wm_yr_wk',"date"], value_vars=item_sales.columns.drop(["wm_yr_wk","date"]), var_name='item_id', value_name='sales')
item_sales.head()

Distribution of median prices of items over the years

In [None]:
item_mean_prices = price_data.groupby("item_id")["sell_price"].median().reset_index()
item_mean_prices.describe()

Let's use a threshold of 3.42 to as the median to bucket each item as Cheap or Costly

In [None]:
labels = ["Cheap","Costly"]
#Bucketing each item as cheap or costly
item_mean_prices["item_price_bucket"] =  pd.cut(item_mean_prices.sell_price, [0,3.42,np.inf], include_lowest=True,labels = labels)
item_mean_prices.head()
#Joining with the actual table
price_data_bucketed = pd.merge(price_data,item_mean_prices[["item_id","item_price_bucket"]], on = "item_id",how = "left")
#Joining with Sales data
price_data_bucketed = pd.merge(price_data_bucketed,item_sales,on = ["wm_yr_wk","item_id"],how = "left")

base = datetime.datetime(2011,1,29)
#Adding the date column

price_data_bucketed.head()

Let's look at the changes in prices over the weeks for each Category and Price Bucket level

In [None]:
#Creating table at category, price bucket level
mean_table = price_data_bucketed.groupby(["date","Category","item_price_bucket"]).agg({"sell_price":"mean","sales":"sum"}).reset_index()
mean_table["cat-bucket"] = mean_table["Category"].astype(str) + '-'+mean_table["item_price_bucket"].astype(str)


#PLotting the graph
sns.set(rc={'figure.figsize':(20,5)})
fig, (ax1, ax2) = plt.subplots(1,2)
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
sns.set_palette(flatui)
prices_plot = sns.lineplot(x = "date",y = "sell_price",hue = "cat-bucket",data = mean_table,ax =ax1)
sales_plot = sns.lineplot(x = "date",y = "sales",hue = "cat-bucket",data = mean_table,ax=ax2)
ax1.title.set_text("Change in avg. prices over the years for cheap and costly items of each category")
ax2.title.set_text("Change in total sales over the years for cheap and costly items of each category")
fig.suptitle('Price and sales changes over the years',fontweight="bold")
ax2.set_yscale('log')
leg = ax2.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), shadow=True, ncol=3)
leg = ax1.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), shadow=True, ncol=3)

What we see:
* The prices for products in the cheaper products have increased very less over the years for all the 3 Categories
* In the costlier bucket, we could see increase in avg.prices specially in HOBBIES in 2012 and 2013. FOODS also show some increase in prices 
* The increase in prices in Costly Hobbies product didnt impact the sales in 2012 and 2013, it actually grew quite well in those years.What we actually see is a dip in sales of Cheaper Hobbies products between 2012 and 2013
* The gap between the selling price of cheaper products and costlier products of Food products is lowest, but the same gap is the highest when compared in terms of sales (i..e Betwee Cheaper Food and and Costlier Food).
* The gap between the selling price of cheaper products and costlier products of Hobbies products is highest, but gap is quite low when compared in terms of sales (i..e Betwee Cheaper Hobbies products and and Costlier Hobbies products).

Highly Influenced by :
https://www.kaggle.com/headsortails/back-to-predict-the-future-interactive-m5-eda

**Please Upvote my notebook if you like it** üôè