# Introduction
The fifth iteration of M Competitions. This competition contains two complimentary competitions, one involving point forecast and other one estimating the uncertainity of the forecast. 

### Data
- The data provided in this competition is of hierarchical nature, starting at the item level and aggregating to that of departments, product categories, stores in three geographical areas of the US: California, Texas, and Wisconsin. The data is made available by Walmart Labs. 
- Besides the time series data, it also includes explanatory variables such as price, promotions, day of the week, and special events (e.g. Super Bowl, Valentineâ€™s Day, and Orthodox Easter) that affect sales which are used to improve forecasting accuracy.
- The majority of the more than 42,840 time series display intermittency (sporadic sales including zeros).

Following are the datasets available for this competition:-

|S. no. |Dataset| Description |
|-------|-------|-------------|
|1| calender.csv|Contains information about the dates on which the products are sold |
|2| sales_train_validation.csv | Contains the historical daily unit sales data per product and store (d_1 - d_1913) |
|3| sample_submission.csv | submission file |
|4| sell_prices.csv | Contains information about the price of the products sold per store and date |
|5| sales_train_evaluation.csv | Available once month before competition deadline. Will include sales [d_1 - d_1941] |

In [None]:
# importing the libraries

import os
import math
import numpy as np                                 # linear algebra
import pandas as pd                                # dataframes
import matplotlib.pyplot as plt                    # visualizations
import seaborn as sns
import ipywidgets as widgets                       # interative jupyter
from IPython.display import clear_output

from scipy import stats                            # statistics
from datetime import datetime, date, timedelta     # time
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.seasonal import STL           # time-series decomposition

from matplotlib.patches import Polygon

In [None]:
# setting up the notebook parameters

root_dir = "/kaggle/input/m5-forecasting-accuracy"
print("root directory = {}".format(root_dir))

plt.rcParams["figure.figsize"] = (16, 8)
sns.set_style("darkgrid")
pd.set_option("display.max_rows", 20, "display.max_columns", None)

In [None]:
######################################
######## Helper Functions ############
######################################


def info_df(df):
    """
    returns the dataframe describing nulls and unique counts
    inp: dataframe
    returns: dataframe with unique and null counts
    """
    return pd.DataFrame({
        "uniques": df.nunique(),
        "nulls": df.isnull().sum(),
        "nulls (%)": df.isnull().sum() / len(df)
    }).T


def reduce_mem_usage(df, verbose=True):
    """
    reduces the mem usage by performing certain coercion operations
    inp: dataframe,
         verbose (whether to print the info regarding mem reduction or Not)
    returns: dataframe
    """
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

---
# 01. Importing the Datasets

In [None]:
# defining the datasets
path_calendar_df = os.path.join(root_dir, "calendar.csv")
path_sales_train_validation_df = os.path.join(root_dir, "sales_train_validation.csv")
path_sell_prices_df = os.path.join(root_dir, "sell_prices.csv")
path_sample_submission_df = os.path.join(root_dir, "sample_submission.csv")

# importing the dataset
calendar_df = pd.read_csv(path_calendar_df, parse_dates = ["date"])
sales_sample_df = pd.read_csv(path_sales_train_validation_df)
sell_prices_df = pd.read_csv(path_sell_prices_df)
sample_submission_df = pd.read_csv(path_sample_submission_df)

# optimising the mem usage
calendar_df = reduce_mem_usage(calendar_df, verbose = True)
sales_sample_df = reduce_mem_usage(sales_sample_df, verbose = True)
sell_prices_df = reduce_mem_usage(sell_prices_df, verbose = True)
sample_submission_df = reduce_mem_usage(sample_submission_df, verbose = True)

## 1.1. Calendar

The calendar dataset contains the information about dates on which a products are sold. The information captured contins what event fell on that particular day or there were two events on the same day. They also captures SNAP days in all the three states. SNAP stood for *Suppliment Nutrition Assistance Program*. SNAP provides a monthly supplement for purchasing nutritious food. This information can be very useful as it might turn out to be driving factor for sales.

In [None]:
calendar_df.head(5)

In [None]:
info_df(calendar_df)

### Column Descriptions
The following table lists down various attributes found in the dataset along with their descriptions:-

|S.no.|Column Name| Descriptions|
|-----|-----------|-------------|
|01. | date | Date |
|02. | wm_yr_wk | (not sure) looks like some sort of combination of year and week |
|03. | weekday | Day of the week |
|04. | wday| weekday encoded |
|05. | month | month of the year |
|06. | d | signifying which day it is in absolute term. All values are unique |
|07. | event_name_1 | Name of the primary event, e.g. Super Bowl etc. 29 events and 1 null |
|08. | event_type_1 | Type of event. Whether it is Sporting or Cultural or National or Religious |
|09. | event_name_2 | Second event, if any. Only 5 values, rest is null|
|10. | event_type_2 | Type of event |
|11. | snap_CA | Whether SNAP food stamp is there or not for California state |
|12. | snap_TX | Whether SNAP food stamp is there or not for Texas state |
|13. | snap_WI | Whether SNAP food stamp is there or not for Wisconsin state |

## 1.2. Sales Train Validation
The sales train validation contains the historical unit sales data per product and per store. This dataset list down unit sales for 1913 days across various stores. The other level of product hierarchy provided are department, category and state. One more attribute `id` is provided which seems like combination of above said hierarchical attributes along with `item_id` along with a validation flag. The data provided to us in sales train validation is in wide format and should be converted to long form for further analysis.

In [None]:
sales_sample_df.head()

In [None]:
info_df(sales_sample_df)

### Column Description
The following table lists down various attributes found in the dataset along with their descriptions:-

| S.no. | Column Name | Description |
|-------|-------------|-------------|
| 01. | id | combination of below IDs and a validation flag|
| 02. | item_id | Item ID |
| 03. | dept_id | Department ID |
| 04. | store_id | Store ID |
| 05. | state_id | State ID |
| 06. | d_1 to d_1969 | day wise units sold | 

## 1.3. Sell Prices
This dataset contains the information regarding the price of the product. We are also provided with `store_id` and `item_id` as hierarchy levels. The datetime column in this dataset is `wm_yr_wk`, which provide us with the week along with the year 

In [None]:
sell_prices_df.head()

In [None]:
info_df(sell_prices_df)

### Column Description
The following table lists down various attributes found in the dataset along with their descriptions:-

|S.no.| Column Name | Description|
|-----|-------------|------------|
| 01. | store_id | maps to store_id of Sales Train Validation table |
| 02. | item_id | item id |
| 03. | wm_yr_wk | described above |
| 04. | sell prices | Price during that particular wm_yr_wk |

### 1.4. Sample Submission File

In [None]:
sample_submission_df.head()

---
# 02. Exploratory Data Analysis - Univariate

As we wish to do the time series analysis on this data, a better representation would be to long form. Currently, the dataset is in wide format. To convert the dataset into long form from wide form, we can use `pd.melt()` funtion. Before proceeding with conversion, we need to divide given attributes into two sets of variables, identifiers and value variables. To convert the table from long to wide format, we can use `pd.pivot()` function.

In [None]:
# # create a smaller sample of sales_train_validation_df, to comply with memory demands
# sales_sample_df = sales_sample_df.sample(100)    

In [None]:
print("Final len of the Dataset after unpivoting would be = {}".format(sales_sample_df.shape[0] * 1913))

# for unpivoting, we need to define the variables into two sets, id_vars and value_vars
value_variables = [col for col in sales_sample_df.columns if col.startswith("d_")]
identifier_variables = [col for col in sales_sample_df.columns if col not in value_variables] 

# converting the df from wide to long
sales_sample_df = pd.melt(sales_sample_df, 
                          id_vars = identifier_variables, 
                          value_vars = value_variables)

print("Actual Shape after unpivoting = {}".format(sales_sample_df.shape))

# changing the variable name to apt names
sales_sample_df = sales_sample_df.rename(columns = {"variable": "day_number", "value": "units_sold"})

In [None]:
# creating a date column
earliest_date = date(2011, 1, 29)
date_dict = {}                    # a dictionary to map the day_number values to real dates
for i in list(sales_sample_df["day_number"].unique()):
    dn_int = int(i[2:]) - 1                                   # indexing the string value to delete "d_" from the day_number and converting it to int
                                                              # subtracting 1 because "d_1" would be our zeroth day. 
    date_ = earliest_date + timedelta(days = dn_int)
    date_dict[i] = date_

# mapping the dictionary to dataframe
sales_sample_df["date"] = sales_sample_df["day_number"].map(date_dict)
sales_sample_df["date"] = pd.to_datetime(sales_sample_df["date"])

In [None]:
sales_sample_df.head()

### 2.1. Item_ID level Time series aggregation

In this section, we will check the Item ID level time series aggregation of the series. This will help us to understand how the time series' trend on the item level. Below code snippet utilizes a dropdown widget to cleanly display the visualizations. 

In [None]:
ALL = "ALL"
def unique_sorted_value_fn(array):
    """
    returns unique sorted values
    inp: array
    return array with unique values
    """
    unique_arr = array.unique().tolist()
    unique_arr.sort()
    unique_arr.insert(0, ALL)   # if all values are to be selected
    return unique_arr

# initialize the dropdown
dropdown_item_id = widgets.Dropdown(options = unique_sorted_value_fn(sales_sample_df["item_id"]))

item_id_plot = widgets.Output()

def dropdown_item_id_eventhandler(change):
    item_id_plot.clear_output()
    with item_id_plot:
        if (change.new == ALL):
            display(sns.lineplot(x = "date", y = "units_sold", hue = "item_id", data = sales_sample_df))
        else:
            display(sns.lineplot(x = "date", y = "units_sold", hue = "item_id", 
                                 data = sales_sample_df[sales_sample_df["item_id"] == change.new]))
            plt.show()
            
dropdown_item_id.observe(dropdown_item_id_eventhandler, names='value')

In [None]:
display(dropdown_item_id)

In [None]:
display(item_id_plot)

### Insights:-
1. The scale of most of the time series is small, with units sold being less than 10 on most of the days.
2. Many time series are dominated by zeros sales.
3. The data doesn't include time series that are uniform all along. For example, `item_id == HOBBIES_1_114` have zeros for whole of 2011 and start of 2012, which might signify that the product might have been launched in early part of 2012, hence no data before that exists.
4. So, we have to deal with two challenges in this problem:-
    - The data has alot of zeros and sudden spikes, hence we need a model that is robust to noise and can learn from intermittent data
    - Deal with long gaps at the start of time series

### 2.2 Aggregated Data

Checking the trends and patterns on the aggregated data. Because the data at `item_id` level doesn't give us much info, we might find some useful information and statistics at the aggregated level.

In [None]:
sns.lineplot(x = "date", y = "units_sold", data = sales_sample_df)
plt.title("M5 - aggregated data")
plt.show()

### Insights:
1. There is a clear upward trend in the aggregated sales.
2. Year on Year, the time series goes through similar crests and troughs, highlighting some seasonal patterns.
3. Every year, X-mas have sales as zero, which is due to the fact that stores are closed on that particular day.

### 2.3. Sales by State

In [None]:
sns.lineplot(x = "date", y = "units_sold", hue = "state_id", data = sales_sample_df)
plt.title("State wise aggregated data")
plt.show()

The above graph doesn't provide a very clear picture. It is better to view this data with bigger resolution.

In [None]:
# creating a new dataframe for statewise aggregation
statewise_df = sales_sample_df.groupby(["state_id", "date"]).agg({
    "units_sold": "sum"
}).reset_index()

# extracting month and year from date for group by purposes
statewise_df["day"] = statewise_df["date"].dt.day
statewise_df["month"] = statewise_df["date"].dt.month
statewise_df["year"] = statewise_df["date"].dt.year

# aggregating on month level for each state
statewise_df = statewise_df.groupby(["month", "year", "state_id"]).agg({
    "units_sold": "sum", 
    "day": "first"
}).reset_index()

statewise_df["date"] = pd.to_datetime(statewise_df["year"].astype("str") + "-" + \
                                      statewise_df["month"].astype("str") + "-" + \
                                      statewise_df["day"].astype("str"))

In [None]:
sns.lineplot(x = "date", y = "units_sold", hue = "state_id", data = statewise_df)
plt.title("Statewise sales trend")
plt.show()

### Insights:
1. The sales in California is clearly leading other two states, while Wisconsin is doing better than Texas in recent times
2. Barring the peak in first half of 2014, the sales across CA remained more or less the same. 

In [None]:
del statewise_df

### 2.4. Store-wise Aggregation

As stated below, there are total of 10 stores, 4 in CA, 3 in both TX and WI.

In [None]:
sales_sample_df.groupby("state_id").agg({"store_id": "nunique"})

In [None]:
# creating a new dataframe for storewise aggregation
storewise_df = sales_sample_df.groupby(["state_id", "store_id", "date"]).agg({
    "units_sold": "sum"
}).reset_index()

# extracting month and year from date for group by purposes
storewise_df["day"] = storewise_df["date"].dt.day
storewise_df["month"] = storewise_df["date"].dt.month
storewise_df["year"] = storewise_df["date"].dt.year

# aggregating on month level for each state and store
storewise_df = storewise_df.groupby(["month", "year", "state_id", "store_id"]).agg({
    "units_sold": "sum", 
    "day": "first"
}).reset_index()

storewise_df["date"] = pd.to_datetime(storewise_df["year"].astype("str") + "-" + \
                                      storewise_df["month"].astype("str") + "-" + \
                                      storewise_df["day"].astype("str"))

In [None]:
state_list = list(storewise_df["state_id"].unique())
for i in range(1, 4):
    plt.subplot(3, 1, i)
    sns.lineplot(x = "date", 
                 y = "units_sold", 
                 hue = "store_id", 
                 data = storewise_df[storewise_df["state_id"] == state_list[i - 1]])
    plt.title("Store wise trend in {}".format(state_list[i - 1]))
    plt.show()

### Insights:-
1. The sudden peak in first half of California sales was mostly contributed by CA_1. This could be due to some offer or discount going on.
2. The TX_3 is clearly ahead of other stores in the state.
3. The WI_2 peaks around 2013 and 2014 but level up with others by 2016.

In [None]:
del storewise_df

### 2.5. Category-wise Aggregation

In the dataset, we have 3 categories, Households, Foods and Hobbies. 

In [None]:
# creating a new dataframe for storewise aggregation
catwise_df = sales_sample_df.groupby(["state_id", "cat_id", "date"]).agg({
    "units_sold": "sum"
}).reset_index()

# extracting month and year from date for group by purposes
catwise_df["day"] = catwise_df["date"].dt.day
catwise_df["month"] = catwise_df["date"].dt.month
catwise_df["year"] = catwise_df["date"].dt.year

# aggregating on month level for each state and store
catwise_df = catwise_df.groupby(["month", "year", "state_id", "cat_id"]).agg({
    "units_sold": "sum", 
    "day": "first"
}).reset_index()

catwise_df["date"] = pd.to_datetime(catwise_df["year"].astype("str") + "-" + \
                                    catwise_df["month"].astype("str") + "-" + \
                                    catwise_df["day"].astype("str"))

In [None]:
sns.lineplot(x = "date", y = "units_sold", hue = "cat_id", data = catwise_df)
plt.title("Catgory-wise Sales")
plt.show()

In [None]:
for i in range(1, 4):
    plt.subplot(3, 1, i)
    sns.lineplot(x = "date", 
                 y = "units_sold", 
                 hue = "cat_id", 
                 data = catwise_df[catwise_df["state_id"] == state_list[i - 1]])
    plt.title("Category wise trend in {}".format(state_list[i - 1]))
    plt.show()

### Insights
1. The "Foods", lead sales in every state. The difference is consderable in WI and TX but in CA, the sales of "Foods" and "Households" are quite close. CA is the only state where "Households" led sales, doing so before 2013. 
2. "Hobbies" have similar trend in almost every state.

### 2.6. Department-wise Aggregation
In the dataset, we have 7 different categories, with three belonging to "Foods" and "Hibbies" and "Households" having couple under their hood each.

In [None]:
sales_sample_df.groupby("cat_id").agg({"dept_id": "nunique"})

In [None]:
# creating a new dataframe for storewise aggregation
deptwise_df = sales_sample_df.groupby(["cat_id", "dept_id", "date"]).agg({
    "units_sold": "sum"
}).reset_index()

# extracting month and year from date for group by purposes
deptwise_df["day"] = deptwise_df["date"].dt.day
deptwise_df["month"] = deptwise_df["date"].dt.month
deptwise_df["year"] = deptwise_df["date"].dt.year

# aggregating on month level for each state and store
deptwise_df = deptwise_df.groupby(["month", "year", "cat_id", "dept_id"]).agg({
    "units_sold": "sum", 
    "day": "first"
}).reset_index()

deptwise_df["date"] = pd.to_datetime(deptwise_df["year"].astype("str") + "-" + \
                                     deptwise_df["month"].astype("str") + "-" + \
                                     deptwise_df["day"].astype("str"))

In [None]:
cat_list = list(sales_sample_df["cat_id"].unique())
for i in range(1, 4):
    plt.subplot(3, 1, i)
    sns.lineplot(x = "date", 
                 y = "units_sold", 
                 hue = "dept_id", 
                 data = deptwise_df[deptwise_df["cat_id"] == cat_list[i - 1]])
    plt.title("Category wise trend in {}".format(cat_list[i - 1]))
    plt.show()

### Insights:-
1. Foods_1 forms the major chunks of sales in Foods category. Similarly, Households_1 and Hobbies_1 dominate their respective categories.
2. The Hobbies_2 is very close to zero, meaning, it might have days where sales is zero in majority. 

---
# 03. Exploratory Data Analysis - Trend Series Decomposition

This section illustrates the use of `STL` to decompose a time series into three components: *trend*, *season(al)* and *residual*. STL uses **LOESS (locally estimated scatterplot smoothing)** to extract smooths estimates of the three components. The key inputs into STL are:
- `season` - The length of the seasonal smoother. Must be odd.
- `trend` - The length of the trend smoother, usually around 150% of season. Must be odd and larger than season.
- `low_pass` - The length of the low-pass estimation window, usually the smallest odd number larger than the periodicity of the data.

In [None]:
# creating a new dataframe with aggregated sales.
stl_df = sales_sample_df[["date", "units_sold"]].set_index("date")
stl_df = stl_df.resample("D").sum()
stl_df.head()

In [None]:
stl = STL(stl_df, seasonal = 7)
res = stl.fit()
fig = res.plot()

In [None]:
del stl_df

### Finding more evidence for weekly seasonality
According to our assumption, the seasonality factor that we have taken into consideration is 7, i.e. weekly. We can explore this seasonality factor again by checking how `day_of_week` are performing and whether their performance is consistent over time.

In [None]:
sales_sample_df["day_of_week"] = sales_sample_df["date"].dt.weekday
sales_sample_df["month"] = sales_sample_df["date"].dt.month
sales_sample_df["year"] = sales_sample_df["date"].dt.year

In [None]:
week_month_pivot = sales_sample_df.pivot_table(index = "day_of_week", 
                                               columns = "month", 
                                               values = "units_sold", 
                                               aggfunc = "sum")

week_month_pivot.columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
week_month_pivot.index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

sns.heatmap(week_month_pivot, linewidth = 0.2, cmap="YlGnBu")
plt.title("Performance of sales for Day of Week aggregated on monthly basis")
plt.show()

#### Insights:-
1. The sales on weekends is considerable and consistently larger than that on weekdays.
2. The months of Jan, May, and July shows a clear dip in sales for every day.

In [None]:
year_month_pivot = sales_sample_df.pivot_table(index = "month", 
                                               columns = "year", 
                                               values = "units_sold", 
                                               aggfunc = "sum")
year_month_pivot.index = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

sns.heatmap(year_month_pivot, linewidth = 0.2, cmap="YlGnBu")
plt.title("Performance of sales for Month aggregated on yearly basis")
plt.show()

#### Insights:-
1. The data we have is till Apr, 2016, hence the heatmap after that is empty.
2. August and December delivers consistently high numbers. 
3. The sales usually peaks around the later months of the year. This could be attributes festivities and various promotion events falling in that part of the year.

In [None]:
del week_month_pivot, year_month_pivot

In [None]:
temp_list = ["day_of_week", "month"]
for i in range(0, 2):
    plt.subplot(2, 1, i + 1)
    tempdf = sales_sample_df.groupby([temp_list[i], "state_id"]).agg({
        "units_sold": "mean"
    }).reset_index()
    sns.lineplot(x = temp_list[i], y = "units_sold", hue = "state_id", data = tempdf)
    plt.title("units sold trend - {}".format(temp_list[i]))
    plt.show()

#### Insights:
1. The pattern of sales for a week is similar for all the states
2. The Texas usually shows a increasing trend in the first half of the year and decreasing trend in the second half of the year. For others, the sales remain more or less constant throughout the year
---

# 04. Exploring Price and Calender Events
This section explores the additional datasets provided to us, i.e. calender and prices dataset. The prices dataset contains the information about the price of the products sold per store and date and the calender dataset contains information about the dates on which the products are sold.

In [None]:
calendar_df.head()

In [None]:
event1_bool = []   # boolean list. Captures whether an event exist or not
for i in range(0, len(calendar_df)):
    if calendar_df["event_name_1"].iloc[i] == calendar_df["event_name_1"].iloc[i]:
        event1_bool.append("True")
    else:
        event1_bool.append("False")
        
# inserting the above list in calendar_df
calendar_df.insert(loc = 9, column = "event_bool_1", value = event1_bool)

In [None]:
# plot distribution of event days
plt.subplot(1, 2, 1)
sns.countplot(calendar_df["event_bool_1"], palette = "Set2")
plt.title("Frequency of events and non-events")
plt.xlabel("Whether event is there or Not")

plt.subplot(1, 2, 2)
sns.countplot(y = calendar_df["event_type_1"], palette = "Set2")
plt.title("Frequency of the types of events")
plt.ylabel("Event Type")

plt.show()

In [None]:
# plot distribution of snap days across three states
plt.subplot(1, 3, 1)
sns.countplot(x = calendar_df["snap_CA"], palette = "RdBu")
plt.title("Frequency plot - Snap days across California")

plt.subplot(1, 3, 2)
sns.countplot(x = calendar_df["snap_TX"], palette = "RdBu")
plt.title("Frequency plot - Snap days across Texas")

plt.subplot(1, 3, 3)
sns.countplot(x = calendar_df["snap_WI"], palette = "RdBu")
plt.title("Frequency plot - Snap days across Wisconsin")

plt.show()

#### Insights:-
1. There are total of 162 event days in the dataset, which forms 8.2% of the dataset
2. The distribution of snap days is uniform in each state

In [None]:
def generate_data(df, date_col, data_col):
    """
    converts the pd dataframe into numpy arrays of data and respective dates
    
    inp: df (dataframe)
         date_col (column which contains the dates)
         data_col (data to be mapped)
    returns: data_arr (array of data)
             dates (array of dates)
    """
    data_arr = np.array(df[data_col])
    data_len = len(data_arr)
    start_date = df[date_col].iloc[0]
    dates = [start_date + timedelta(days = i) for i in range(data_len)]
    return data_arr, dates


def calendar_array_fn(date_arr, data_arr):
    """
    returns an array of shape (-1, 7)
    
    inp: date_arr (array of dates)
         data_arr (array of data)
    returns: i, j (indices)
             calendar (array of data of shape (-1, 7))
    """
    
    # return the date as an ISO calendar (year, week, day)
    i, j = zip(*[date.isocalendar()[1:] for date in date_arr])
    i = np.array(i) - min(i) 
    j = np.array(j) - 1
    max_i = max(i) + 1
    calendar = np.nan * np.zeros((max_i, 7))  # creating empty arrays
    calendar[i, j] = data_arr                 # creating a data matrix
    
    return i, j, calendar


def label_days(ax, date_arr, i, j, calendar):
    """
    creates label for days
    
    inp: ax,
         date_arr (array of dates),
         i, j (indices),
         calendar (calendar array, returned by calendar_arr_fn())
    returns: nothing
    """
    ni, nj = calendar.shape                          # len and width of the matrix
    day_of_month_arr = np.nan * np.zeros((ni, 7))    # initializing day_of_month array
    day_of_month_arr[i, j] = [date.day for date in date_arr]
    
    # ndenuerate - multi index iterator
    for (i, j), day in np.ndenumerate(day_of_month_arr):
        # following condition checks if the thing is not NaN
        if np.isfinite(day):
            ax.text(j, i, 
                    int(day), 
                    ha = "center", 
                    va = "center")
    # defining x-axis labels
    ax.set(xticks = np.arange(7), 
           xticklabels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
    ax.xaxis.tick_top()
    
    
def label_months(ax, date_arr, i, j, calendar):
    """
    creates label for days
    
    inp: ax,
         date_arr (array of dates),
         i, j (indices),
         calendar (calendar array, returned by calendar_arr_fn())
    returns: nothing
    """
    months_labels = np.array([
        "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
        "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
    ])                                                        # month names
    months_arr = np.array([date.month for date in date_arr])     # extracting months from dates
    unique_months = sorted(set(months_arr))                   # get unique months                  
    yticks = [i[months_arr == m].mean() for m in unique_months]   
    labels = [months_labels[m - 1] for m in unique_months]
    ax.set(yticks = yticks)
    ax.set_yticklabels(labels, rotation = 90)
    
    
def calendar_heatmap(ax, date_arr, data_arr):
    i, j, calendar = calendar_array_fn(date_arr, data_arr)
    im = ax.imshow(calendar, interpolation = "none", cmap = "summer")
    label_days(ax, date_arr, i, j, calendar)
    label_months(ax, date_arr, i, j, calendar)
    # uncomment following line if you want colorbars
    #ax.figure.colorbar(im)         
    
def plot_calmap(ax, year, data_col):
    """
    main function for ploting calendar heatmaps
    
    inp: year (which year to be plotted)
         data_col (data column)
    returns nothing
    """
    data_arr, date_arr = generate_data(df = calendar_df[calendar_df["year"] == year], 
                                       date_col = "date", 
                                       data_col = data_col)
    calendar_heatmap(ax, date_arr, data_arr)
    ax.set_title("{} distribution in the year {}".format(data_col, year))

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 3, figsize = (19, 20))
plot_calmap(ax[0], year = 2011, data_col = "snap_CA")
plot_calmap(ax[1], year = 2011, data_col = "snap_TX")
plot_calmap(ax[2], year = 2011, data_col = "snap_WI")

#### Insights:-
1. The above plot conveys us that the snap days are usually occuring on same dates throughout the year. For California, the snap days occurs during first 10 days. For Texas, they occur on 1st, 3rd, 5th, 6th, 7th, 9th, 11th, 12th, 13th, and 15th of the month. Similarly, for Wisconsin, snap days are 2nd, 3rd, 5th, 6th, 8th, 9th, 11th, 12th, 14th and 15th of every month.
2. The snap days happens before 15th of every month in every state.

## 4.2. sell_prices_df

In [None]:
sell_prices_df.head()

In [None]:
# creating a few additional columns to aid in analysis below

sell_prices_df["state"] = sell_prices_df["store_id"].str[:2]
sell_prices_df["cat_id"] = sell_prices_df["item_id"].str[:-4]

In [None]:
# plotting the distribution of various stores in a state

plt.figure(figsize = (20, 16))
plt.subplots_adjust(hspace = 0.5)

plt.subplot(4, 3, 1)
for i in ["CA_1", "CA_2", "CA_3", "CA_4"]:
    sns.distplot(sell_prices_df[sell_prices_df["store_id"] == i]["sell_price"], label = i)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Store-wise price distribution - California")

plt.subplot(4, 3, 2)
for j in ["TX_1", "TX_2", "TX_3"]:
    sns.distplot(sell_prices_df[sell_prices_df["store_id"] == j]["sell_price"], label = j)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Store-wise price distribution - Texas")

plt.subplot(4, 3, 3)
for k in ["WI_1", "WI_2", "WI_3"]:
    sns.distplot(sell_prices_df[sell_prices_df["store_id"] == k]["sell_price"], label = k)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Store-wise price distribution - Wisconsin")
    
plt.subplot(4, 3, 4)
for i in ["HOBBIES_1", "HOBBIES_2"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == i) & (sell_prices_df["state"] == "CA")]["sell_price"], label = i)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in California- Hobbies")
    
plt.subplot(4, 3, 5)
for i in ["HOBBIES_1", "HOBBIES_2"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == i) & (sell_prices_df["state"] == "TX")]["sell_price"], label = i)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in Texas- Hobbies")
    
plt.subplot(4, 3, 6)
for i in ["HOBBIES_1", "HOBBIES_2"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == i) & (sell_prices_df["state"] == "WI")]["sell_price"], label = i)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in Wisconsin- Hobbies")

plt.subplot(4, 3, 7)
for j in ["HOUSEHOLD_1", "HOUSEHOLD_2"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == j) & (sell_prices_df["state"] == "CA")]["sell_price"], label = j)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in California - Households")

plt.subplot(4, 3, 8)
for j in ["HOUSEHOLD_1", "HOUSEHOLD_2"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == j) & (sell_prices_df["state"] == "TX")]["sell_price"], label = j)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in Texas - Households")  
    
plt.subplot(4, 3, 9)
for j in ["HOUSEHOLD_1", "HOUSEHOLD_2"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == j) & (sell_prices_df["state"] == "WI")]["sell_price"], label = j)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in Wisconsin - Households")

plt.subplot(4, 3, 10)
for k in ["FOODS_1", "FOODS_2", "FOODS_3"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == k) & (sell_prices_df["state"] == "CA")]["sell_price"], label = k)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in California - Foods")
    
plt.subplot(4, 3, 11)
for k in ["FOODS_1", "FOODS_2", "FOODS_3"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == k) & (sell_prices_df["state"] == "TX")]["sell_price"], label = k)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in Texas - Foods")
    
plt.subplot(4, 3, 12)
for k in ["FOODS_1", "FOODS_2", "FOODS_3"]:
    sns.distplot(sell_prices_df[(sell_prices_df["cat_id"] == k) & (sell_prices_df["state"] == "WI")]["sell_price"], label = k)
    plt.legend()
    plt.xlabel("Sell Price")
    plt.title("Category-wise price distribution in Wisconsin - Foods")

#### Insights:-
1. The probability distribution of the `sell_price` is almost identical in all the three states. The difference being is tail of distribution is long in case of Wisconsin. This might be due to the retailing strategy of selling many unique items with relatively small quantities sold of each, usually in addition to selling fewer popular items in large quantities. This [wired article](https://www.wired.com/2004/10/tail/) is a good introduction to long tail phenomenon
2. The distribution of food prices have strikingly similar price distribution, with both range and peaks occuring at the same places. This might be due to the fact that food prices are generally similar across states and somewhat regulated.
3. The Households have largest variations in the prices. They might be the ones which are contributing to long tails in the PDFs of prices.