# Introduction
Welcome to the "M5 Forecasting - Accuracy" competition! In this competition, contestants are challenged to forecast future sales at Walmart based on heirarchical sales in the states of California, Texas, and Wisconsin.

# Task in hand
In this competition, we need to forecast the sales for [d_1942 - d_1969]. These rows form the test set.

The rows  [d_1914 - d_1941] form the validation set.

Remaining rows form the training set.

# Appeal to fellow Kagglers:)
This is my first attempt towards a time series problem, so, please upvote this kernel,your upvote will be like a reward for my work.

# This notebook will cover only EDA

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt
from tqdm.notebook import tqdm as tqdm
import statsmodels.api as sm
import gc
plt.style.use('fivethirtyeight')
from pylab import rcParams
import random
import seaborn as sns
from lightgbm import LGBMRegressor


In [None]:
# to display all the columns in the dataset
pd.pandas.set_option('display.max_columns', None)

# Lets check the datasets

In [None]:
train_sales = pd.read_csv("../input/m5-forecasting-accuracy/sales_train_evaluation.csv")
calendar = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv")
sell_prices = pd.read_csv("../input/m5-forecasting-accuracy/sell_prices.csv")

In [None]:
train_sales.shape, calendar.shape,sell_prices.shape

In [None]:
train_sales.info()

In [None]:
calendar.info()

In [None]:
sell_prices.info()

In [None]:
train_sales.head()

There are lots of zeros in the datasets for "d_x" columns, these are nothing bul sale values on any given day, zero here signfies, either the item was not available on that day or was not sold because of no demand.

In [None]:
calendar.head()

In [None]:
sell_prices.head()

# Lets check for null values

In [None]:
train_sales.isnull().sum().sort_values(ascending = False)

In [None]:
sell_prices.isnull().sum().sort_values(ascending = False)

In [None]:
calendar.isnull().sum().sort_values(ascending = False)

# Memory Reduction

We have a huge dataset to work on, and before feeding this dataset into the model, we are going to "Melt" it which would the data fram would be converted from wide format to a long format. I have kept the id variables as id, item_id, dept_id, cat_id, store_id and state_id. They have in total 30490 unique values when compunded together. Now the total number of days for which we have the data is 1969 days. Therefore the melted dataframe will be having 30490x1969 i.e. 60034810 rows.

In order to process to such huge dataset, we would need to reduce the memor usage. 

In [None]:
# memory usage reduction
def downcast(df):
    cols = df.dtypes.index.tolist()
    types = df.dtypes.values.tolist()
    for i,t in enumerate(types):
        if 'int' in str(t):
            if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max:
                df[cols[i]] = df[cols[i]].astype(np.int8)
            elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max:
                df[cols[i]] = df[cols[i]].astype(np.int16)
            elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max:
                df[cols[i]] = df[cols[i]].astype(np.int32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.int64)
        elif 'float' in str(t):
            if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max:
                df[cols[i]] = df[cols[i]].astype(np.float16)
            elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max:
                df[cols[i]] = df[cols[i]].astype(np.float32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.float64)
        elif t == np.object:
            if cols[i] == 'date':
                df[cols[i]] = pd.to_datetime(df[cols[i]], format='%Y-%m-%d')
            else:
                df[cols[i]] = df[cols[i]].astype('category')
    return df  

In [None]:
# calling memory reduction function for each data set
train_sales = downcast(train_sales)
sell_prices = downcast(sell_prices)
calendar = downcast(calendar)

# Exploratory Data Analysis

In [None]:
# let's save the list of date variables to a list
d_cols = [c for c in train_sales.columns if 'd_' in c]

In [None]:
# lets save top 3 selling items to be analysed later
top3 = train_sales.set_index("id")[d_cols].sum(1).sort_values(ascending  = False)[:3].index

# Melting the dataframe

In [None]:
grid_df = pd.melt(train_sales, 
                  id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], 
                  var_name = 'd', 
                  value_name = "sales")

In [None]:
group = grid_df.groupby(['state_id','store_id','cat_id','dept_id'],as_index=False)['sales'].sum().dropna()
group['USA'] = 'United States of America'
group.rename(columns={'state_id':'State','store_id':'Store','cat_id':'Category','dept_id':'Department','item_id':'sales'},inplace=True)
fig = px.treemap(group, path=['USA','State', 'Store', 'Category', 'Department'], values='sales',
                  color='sales',
                  title='Sum of sales across whole USA/different States/Stores/Categories/Departments')
fig.update_layout(template='seaborn')
fig.show()

In [None]:
del train_sales
gc.collect()

In [None]:
# lets drop the columns we are not going to use for EDA
calendar.drop(['wm_yr_wk','weekday','wday','month','year','event_name_1','event_type_1', 'event_name_2','event_type_2'],1,inplace=True)

# Create a master dataset by merging melted dataset and the calendar dataset

In [None]:
master = pd.merge(grid_df,calendar, on = "d")
master.head()

In [None]:
del grid_df
gc.collect()

# Helper Functions
    1. sales: To plot graphs for sales of different categories
    2. decompose: This function will decompose the given time series into three parts, "seasonal", "trend" and "observed"
    3. random_color: This function will pick a random color for the graph calling this function.

In [None]:
def sales(feat,param):
    sales_df = master.loc[master[feat] == param]
    sales_df['date'] = pd.to_datetime(sales_df['date'])
    sales_df =sales_df.groupby('date')['sales'].sum().reset_index()
    sales_df = sales_df.set_index('date')
    return sales_df

In [None]:
from itertools import cycle, islice
def decompose(y):
    rcParams['figure.figsize'] = 18, 8
    decomposition = sm.tsa.seasonal_decompose(y, model='additive')
    fig = decomposition.plot()
    plt.show()

In [None]:
def random_color():
    colors = ["blue","black","brown","red","yellow","green","orange","turquoise","magenta","cyan"]
    random.shuffle(colors)
    return colors[0]

# STATE WISE SALES

Let's take a look at the state wise sales

We will preprocess our data a little bit before moving forward. Daily data can be tricky to work with since it’s a briefer amount of time, so let’s use monthly averages instead. We’ll make the conversion with the resample function.

In [None]:
# list of unique states
master.state_id.unique()

In [None]:
CA = sales("state_id","CA") # create a dataframe for the state CA
y_ca = CA['sales'].resample('MS').mean() # taking monthly average
colour = random_color()
y_ca.plot(figsize=(15, 6),color = colour,title = ("Sales for the state of CA"))
plt.ylabel = ("Sales")
plt.show()

Some distinguishable patterns appear when we plot the data. 
The time-series has seasonality pattern, such as sales are always low at the beginning of the year and high at the mid of the year. 
There is always an upward trend within any single year.

We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components: trend, seasonality, and noise.

In [None]:
decompose(y_ca)

The plot above clearly shows that the sales of state CA is unstable, along with its obvious seasonality.

In [None]:
WI = sales("state_id","WI")
y_wi = WI['sales'].resample('MS').mean()
colour = random_color()
y_wi.plot(figsize=(15, 6),color = colour,title = ("Sales for the state of WI"))
plt.ylabel = ("Sales")
plt.show()

Some distinguishable patterns appear when we plot the data. The time-series has seasonality pattern, such as sales are always low at the beginning of the year and high at the mid of the year. There is always an upward trend within any single year.

In [None]:
decompose(y_wi)

The plot above clearly shows that the sales of state CA is unstable, along with its obvious seasonality.

In [None]:
TX = sales("state_id","TX")
y_tx = TX['sales'].resample('MS').mean()
colour = random_color()
y_tx.plot(figsize=(15, 6),color = colour,title = ("Sales for the state of TX"))
plt.show()

Sales are not very different from the other states.

In [None]:
decompose(y_tx)

In [None]:
del CA,WI,TX
gc.collect()

# CATEGORY WISE SALES

Let's take a look at the category wise sales

In [None]:
# list of unique categories
master.cat_id.unique()

In [None]:
foods = sales("cat_id","FOODS")
y_f = foods['sales'].resample('MS').mean()
colour = random_color()
y_f.plot(figsize=(15, 6),color = colour,title = ("Sales for the category:FOODS"))
plt.show()

In [None]:
decompose(y_f)

In [None]:
hobbies = sales("cat_id","HOBBIES")
y_hb = hobbies['sales'].resample('MS').mean()
colour = random_color()
plt.ylabel = ("Sales")
y_hb.plot(figsize=(15, 6),color = colour,title = ("Sales for the category:HOBBIES"))
plt.show()

In [None]:
decompose(y_hb)

In [None]:
household = sales("cat_id","HOUSEHOLD")
y_hh = household['sales'].resample('MS').mean()
colour = random_color()
y_hh.plot(figsize=(15, 6),color = colour,title = ("Sales for the category:HOUSEHOLD"))
plt.show()

In [None]:
decompose(y_hh)

In [None]:
del foods,hobbies,household,y_f,y_hb,y_hh
gc.collect()

# STORE WISE SALES

In [None]:
master.store_id.unique

In [None]:
CA_1 = sales("store_id","CA_1")
y_CA1 = CA_1['sales'].resample('MS').mean()
colour = random_color()
y_CA1.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:CA_1"))
plt.show()

In [None]:
decompose(y_CA1)

In [None]:
CA_2 = sales("store_id","CA_2")
y_CA2 = CA_2['sales'].resample('MS').mean()
colour = random_color()
y_CA2.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:CA_2"))
plt.show()

In [None]:
decompose(y_CA2)

In [None]:
CA_3 = sales("store_id","CA_3")
y_CA3 = CA_3['sales'].resample('MS').mean()
colour = random_color()
y_CA3.plot(figsize=(15, 6),color = colour,title = "Sales for the store:CA_3")
plt.show()

In [None]:
decompose(y_CA3)

In [None]:
CA_4 = sales("store_id","CA_4")
y_CA4 = CA_4['sales'].resample('MS').mean()
colour = random_color()
y_CA4.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:CA_4"))
plt.show()

In [None]:
decompose(y_CA4)

In [None]:
TX_1 = sales("store_id","TX_1")
y_TX1 = TX_1['sales'].resample('MS').mean()
colour = random_color()
y_TX1.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:TX_1"))
plt.show()

In [None]:
decompose(y_TX1)

In [None]:
TX_2 = sales("store_id","TX_2")
y_TX2 = TX_2['sales'].resample('MS').mean()
colour = random_color()
plt.ylabel = ("Sales")
y_TX2.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:TX_2"))
plt.show()

In [None]:
decompose(y_TX2)

In [None]:
TX_3 = sales("store_id","TX_3")
y_TX3 = TX_3['sales'].resample('MS').mean()
colour = random_color()
plt.ylabel = ("Sales")
y_TX3.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:TX_3"))
plt.show()

In [None]:
decompose(y_TX3)

In [None]:
WI_1 = sales("store_id","WI_1")
y_WI1 = WI_1['sales'].resample('MS').mean()
colour = random_color()
plt.ylabel = ("Sales")
y_WI1.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:WI_1"))
plt.show()

In [None]:
decompose(y_WI1)

In [None]:
WI_2= sales("store_id","WI_2")
y_WI2 = WI_2['sales'].resample('MS').mean()
colour = random_color()
plt.ylabel = ("Sales")
y_WI2.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:WI_2"))
plt.show()

In [None]:
decompose(y_WI2)

In [None]:
WI_3= sales("store_id","WI_3")
y_WI3 = WI_3['sales'].resample('MS').mean()
colour = random_color()
plt.ylabel = ("Sales")
y_WI3.plot(figsize=(15, 6),color = colour,title = ("Sales for the store:WI_3"))
plt.show()

In [None]:
decompose(y_WI3)

In [None]:
del CA_1,CA_2,CA_3,CA_4,TX_1,TX_2,TX_3,WI_1,WI_2,WI_3
gc.collect()

# TOP SELLING PRODUCTS

In [None]:
top = sales("id",top3[0])
y_top = top['sales'].resample('MS').mean()
colour = random_color()
y_top.plot(figsize=(15, 6),color = colour,title = ("Sales for the Product:" + top3[0]))
plt.show()

In [None]:
top = sales("id",top3[1])
y_top = top['sales'].resample('MS').mean()
colour = random_color()
y_top.plot(figsize=(15, 6),color = colour,title = ("Sales for the Product:" + top3[1]))
plt.show()

In [None]:
top = sales("id",top3[2])
y_top = top['sales'].resample('MS').mean()
colour = random_color()
y_top.plot(figsize=(15, 6),color = colour,title = ("Sales for the Product:" + top3[2]))
plt.show()

In [None]:
del top3,y_top
gc.collect()

# PRICE DISTRIBUTION

In [None]:
colour = random_color()
sns.distplot(sell_prices["sell_price"],color = colour).set_title("Price Distribution")

In [None]:
colour = random_color()
CA_1= sell_prices[sell_prices["store_id"] == "CA_1"]
sns.distplot(CA_1["sell_price"],color = colour).set_title("Price Distribution for CA_1")

In [None]:
colour = random_color()
CA_2= sell_prices[sell_prices["store_id"] == "CA_2"]
sns.distplot(CA_2["sell_price"],color = colour).set_title("Price Distribution for CA_2")

In [None]:
colour = random_color()
CA_3= sell_prices[sell_prices["store_id"] == "CA_3"]
sns.distplot(CA_3["sell_price"],color = colour).set_title("Price Distribution for CA_3")

In [None]:
colour = random_color()
CA_4= sell_prices[sell_prices["store_id"] == "CA_4"]
sns.distplot(CA_4["sell_price"],color = colour).set_title("Price Distribution for CA_4")

In [None]:
colour = random_color()
TX_1= sell_prices[sell_prices["store_id"] == "TX_1"]
sns.distplot(TX_1["sell_price"],color = colour).set_title("Price Distribution for TX_1")

In [None]:
colour = random_color()
TX_2= sell_prices[sell_prices["store_id"] == "TX_2"]
sns.distplot(TX_2["sell_price"],color = colour).set_title("Price Distribution for TX_2")

In [None]:
colour = random_color()
TX_3= sell_prices[sell_prices["store_id"] == "TX_3"]
sns.distplot(TX_3["sell_price"],color = colour).set_title("Price Distribution for TX_3")

In [None]:
colour = random_color()
WI_1= sell_prices[sell_prices["store_id"] == "WI_1"]
sns.distplot(WI_1["sell_price"],color = colour).set_title("Price Distribution for WI_1")

In [None]:
colour = random_color()
WI_2= sell_prices[sell_prices["store_id"] == "WI_2"]
sns.distplot(WI_2["sell_price"],color = colour).set_title("Price Distribution for WI_2")

In [None]:
colour = random_color()
WI_3= sell_prices[sell_prices["store_id"] == "WI_3"]
sns.distplot(WI_3["sell_price"],color = colour).set_title("Price Distribution for WI_3")

In [None]:
del sell_prices,CA_1,CA_2,CA_3,CA_4,TX_1,TX_2,TX_3,WI_1,WI_2,WI_3
gc.collect()

In [None]:
import gc
gc.collect()

Link to model building notebook: https://www.kaggle.com/jagdmir/m5-forecasting-part-two-lgbm-regressor