# Walmart Recruitng: Store Sales Forecasting
Technical exercise for Zé Delivery selection process. <img style="padding-left:20px; width:75px; display: inline;" src="https://courier-images-web.imgix.net/static/img/small-logo.png?auto=compress,format&fit=max&w=undefined&h=undefined&dpr=2&fm=png">

***Otávio Vasques*** | September 2020

[Github](https://github.com/otaviocv) | [LinkedIn](https://www.linkedin.com/in/otaviocv/) | [Email](mailto:otaviocv.deluqui@gmail.com)



## Introduction

The problem we want to solve is the Sales Values for Walmart stores and departments, it is a *regression* problem. We want to forecast future sales values based on the store configuration, environmental variables regarding a given store, not just previous historical data.

My solution will follow a very standard outline: a basic exploratory data analysis for data understanding, to capture particular characteristics of features, identify which feature engineering techniques will be required and understand what will be the information available at the prediction environment. The next step is to make a dummy baseline to establish the worst case scenario. Then I will build a basic benchmark model with a very simple linear model with regularization to really define the reference I will build upon. The last modeling step will be a more powerful and complex model that has a huge adoption among the machine learning practitioners due to its amazing features: XGBoost.

These three "models" will be then compared in a results section and a final conclusions section will end my analysis and summarize all my thoughts.


### Challenges

**Stores and Departments**

Predictions are divided in subcategories within the stores, their departments. This is a problem because all feature values are related explicitly with a store and not just a department. It will be necessary to include this distinction in the prediction process.

**Time Axis**

The data is naturally indexed in time, which leaves space for multiple strategies based on time series feature extraction and time series models such as the AR, MA and ARIMA models.


### Approaches and solution candidates

My solution will follow three basic steps with minor feature engineering. As my first step I will make a basic, very grounded benchmark based on mean and standard values to establish our starting point. In a real world scenario, where the team would be first worried with integrations and data collection, this could be a dummy model to fill the gaps while a more proper solution is being developed.

My second step will be a linear model, more specific an Elastic Net, the linear model with two regularization terms, L1 and L2. This is a very established and well known solution, with very low resistance even within the most conservative teams. This will serve as our real baseline model which we will improve from.

My third step will be the Kaggle standard for data without a particular information structure (such as images or audio data): tree based gradient boosting. My choice is the XGBoost implementation which provides multiple features for controlling overfitting, trees growth and has a great implementation for high performance prediction scenarios.

I will make use of an explanation technique called [SHAP](https://github.com/slundberg/shap), based on game theory. It is very handful for understanding local contributions of some features. It works really well for linear models and tree based models.

### Metrics and perfomance measures

The competition metric is a *Weighted Mean Absolute Error* where the holidays weight 5 times the usual error. This means that predicting sales for holidays is very important. I will also use the classic *Mean Absolute Error* for a general comparison and understand general effects of the models without taking too much attention on the holidays.

### Data

Data description ([link](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data)). **This is just a copy and paste from the data description page of the original problem for easier reference**.

**stores**

This file contains anonymized information about the 45 stores, indicating the type and size of store.

**feaures**

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

    Store - the store number
    Date - the week
    Temperature - average temperature in the region
    Fuel_Price - cost of fuel in the region
    MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
    CPI - the consumer price index
    Unemployment - the unemployment rate
    IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

* Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
* Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
* Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
* Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

**train**

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

    Store - the store number
    Dept - the department number
    Date - the week
    Weekly_Sales -  sales for the given department in the given store
    IsHoliday - whether the week is a special holiday week

**test**

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

### Imports

In [None]:
import os
import itertools
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np

from sklearn.metrics import mean_absolute_error

from sklearn.linear_model import ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import KFold, GridSearchCV, ParameterGrid
import xgboost as xgb
import catboost as cat

import shap

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Parameters

In [None]:
data_path = '/kaggle/input/walmart-recruiting-store-sales-forecasting/'

### Understanding Kaggle environment

In [None]:
!ls {data_path}

In [None]:
for dirname, _, filenames in os.walk(data_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
!pwd

### Data

In [None]:
stores = pd.read_csv(data_path + "stores.csv")
features = pd.read_csv(data_path + "features.csv.zip")
train_data = pd.read_csv(data_path + "train.csv.zip")
test_data = pd.read_csv(data_path + "test.csv.zip")
sample_submission = pd.read_csv(data_path + "sampleSubmission.csv.zip")

In [None]:
features.head()

In [None]:
train_data.head()

In [None]:
test_data.head()

All data has been loaded correctly from the zip files.

#### Holiays

* Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
* Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
* Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
* Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13


In [None]:
holidays = pd.to_datetime(["2010-02-12", "2011-02-11", "2012-02-10", "2013-02-08", "2010-09-10", "2011-09-09", "2012-09-07", "2013-09-13",
                           "2010-11-26", "2011-11-25", "2012-11-23", "2013-11-29", "2010-12-31", "2011-12-30", "2012-12-28", "2013-12-27"])
holidays_dict = {
    "2010-02-12": "Super Bowl",
    "2011-02-11": "Super Bowl",
    "2012-02-10": "Super Bowl",
    "2013-02-08": "Super Bowl",
    "2010-09-10": "Labor Day",
    "2011-09-09": "Labor Day",
    "2012-09-07": "Labor Day",
    "2013-09-13": "Labor Day",
    "2010-11-26": "Thanksgiving",
    "2011-11-25": "Thanksgiving",
    "2012-11-23": "Thanksgiving",
    "2013-11-29": "Thanksgiving",
    "2010-12-31": "Christmas",
    "2011-12-30": "Christmas",
    "2012-12-28": "Christmas",
    "2013-12-27": "Christmas"
}

### Weighted Mean Absolute Error Metric function

Just checking how to use the scikit mean absolute error with weights.

In [None]:
y_true = np.array([1, 2, 3, 4, 5, 6])
y_pred = np.array([2 ,3, 0, 5, 6, 1]) # 1 + 1 + 3 + 1 + 1 + 5 = 12
                                      # 1 + 1 + 15 + 1 + 1 + 25 = 44
weights = np.array([1, 1, 5, 1, 1, 5]) # 1 + 1 + 5 + 1 + 1 + 5 = 14

assert mean_absolute_error(y_true, y_pred) == 2
assert mean_absolute_error(y_true, y_pred, sample_weight=weights) == 44/14

### Plot Holidays Function

In [None]:
f, ax = plt.subplots(figsize=(5,1))

def plot_holidays(holidays, ax=plt, color="red", linestyle="--", linewidth="0.5", **kwargs):
    for hday in holidays:
        ax.axvline(x=hday, color=color, linestyle=linestyle, linewidth=linewidth, **kwargs)
        
plot_holidays(holidays, ax=ax)

## Exploratory Data Anlysis

In this section, I will make my best effort to understand and gain intuition about data. This will help my future modeling decisions and will provide material to understand the models' behavior.

### Stores

In [None]:
stores.head()

In [None]:
stores.dtypes

In [None]:
stores.shape

In [None]:
stores.Type.value_counts(normalize=True, dropna=False)

In [None]:
f, ax = plt.subplots(figsize=(8, 3), dpi=140)
sns.distplot(stores.Size);

In [None]:
stores.Size.describe()

There are 45 stores, every one of them has a *type* and a *size*. The meaning of these variables aren't provided. We can just say that the type is one of three values, A, B and C. The size ranges from ~63 825 to 220000.

### NANS Viz

In [None]:
f, ax = plt.subplots(1, 2, figsize=(10, 5), dpi=150)
sns.heatmap(features.sort_values("Store").isna(), ax=ax[0], cmap="plasma", cbar=None, yticklabels=False);
ax[0].set_title("NANs sorted by store")
sns.heatmap(features.sort_values("Date").isna(), ax=ax[1], cmap="plasma", yticklabels=False);
ax[1].set_title("NANs sorted by date");

In [None]:
features.isna().mean()

Almost half of the MarkDown values are missing. What does this mean? The data description says that this data is only available after Nov. 2011. Since 2011 falls relatively in the middle of our time window, almost half of these values are missing. A dedicated approach for these values will be necessary for the linear model.

### Features

This is the main table. The information in this table has two main indexes that will match the prediction dataset, the week of prediction, and the store number. All values from these tables are the same for a given store in a given week, there is still the department distinction in the train data. I will have to figure out how to include this department's information in our final models.

All feature values, both from the train and test datasets, are mixed in a single table. In a future section, we will need to split train features, train target, test features, and test target.

In [None]:
features.head()

#### Time Axis

In [None]:
train_data.Date = pd.to_datetime(train_data.Date)
test_data.Date = pd.to_datetime(test_data.Date)
features.Date = pd.to_datetime(features.Date)
features_dates = features.groupby("Date", as_index=False).agg({"Store": "nunique"})
#features_dates.Date = pd.to_datetime(features_dates.Date)
features_dates.plot(x="Date", y="Store");

In [None]:
print("Min: ", features_dates.Date.min(), "Max: ", features_dates.Date.max())

#### Temperature

In [None]:
f, ax = plt.subplots(figsize=(12, 4), dpi=150)

for year in range(features_dates.Date.min().year, features_dates.Date.max().year+1):
    plot_data = features[features.Date.dt.year == year].groupby("Date", as_index=False).agg({"Temperature": "mean"})
    ax.plot(plot_data.Date.dt.dayofyear, plot_data.Temperature, label=year)

ax.legend()
ax.set_title("Temperature over Time")
ax.set_ylabel("Temperature [°F]")
ax.set_xlabel("Day of year");

The problem description doesn't state if the stores are inside or restricted to the US. Assuming that the Walmart stores are located in the US, or in Northern Hemisphere, the graph show the expected temperature curve for the Northern Hemisphere.

#### Unemployment

In [None]:
f, ax = plt.subplots(figsize=(12, 4), dpi=150)

for year in range(features_dates.Date.min().year, features_dates.Date.max().year +1):
    plot_data = features[features.Date.dt.year == year].groupby("Date", as_index=False).agg({"Unemployment": "mean"})
    ax.plot(plot_data.Date.dt.dayofyear, plot_data.Unemployment, label=year)

ax.legend()
ax.set_title("Unemployment over Time")
ax.set_ylabel("Unemployment")
ax.set_xlabel("Day of year");

It is clear from the graph that the unemployment is constant for all stores and is measured every 3 months.

In [None]:
features.groupby(["Date", "Store"], as_index=False).agg({"Unemployment": "nunique"}).Unemployment.value_counts()

#### Fuel Price

In [None]:
f, ax = plt.subplots(figsize=(12, 4), dpi=150)

for year in range(features_dates.Date.min().year, features_dates.Date.max().year+1):
    plot_data = features[features.Date.dt.year == year].groupby("Date", as_index=False).agg({"Fuel_Price": "mean"})
    ax.plot(plot_data.Date.dt.dayofyear, plot_data.Fuel_Price, label=year)

ax.legend()
ax.set_title("Fuel Price over Time")
ax.set_ylabel("Fuel Price [USD]") # I'm assuming US Dollars
ax.set_xlabel("Day of year");

#### Consumer Price Index

> The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Indexes are available for the U.S. and various geographic areas. Average price data for select utility, automotive fuel, and food items are also available.

From https://www.bls.gov/cpi/

In [None]:
f, ax = plt.subplots(figsize=(12, 4), dpi=150)

for year in range(features_dates.Date.min().year, features_dates.Date.max().year+1):
    plot_data = features[features.Date.dt.year == year].groupby("Date", as_index=False).agg({"CPI": "mean"})
    ax.plot(plot_data.Date.dt.dayofyear, plot_data.CPI, label=year)
    ax.axhline(y=plot_data.CPI.max(), color="grey", linestyle="--", linewidth=0.2)

ax.legend()
ax.set_title("Consumer Price Index over Time")
ax.set_ylabel("Consumer Price Index")
ax.set_xlabel("Day of year");

All the plots above show the common values across stores and are fixed for each date. My choice to plot the years in the same axis was to identify some kind of seasonality but apart from temperature, there is no clear seasonality in each of the other features.

#### Markdowns

After a quick search on the internet on "What is a Marketing Markdown?" I'm not sure exactly what kind of Markdowns Walmart refers to, and therefore it is unclear what the numbers mean for each case. Probably they refer to the amount of money spent on the MarkDown type for the given store but there isn't a way to be sure about this.

In [None]:
f, ax = plt.subplots(figsize=(12, 4), dpi=150)

for i in range(1, 6):
    plot_data = features.groupby("Date", as_index=False).agg({"MarkDown" + str(i): "mean"})
    ax.plot(plot_data.Date, plot_data["MarkDown" + str(i)], label="MarkDown" + str(i))
    #ax.axhline(y=plot_data.CPI.max(), color="grey", linestyle="--", linewidth=0.2)

ax.legend()
ax.set_title("Markdowns over Time")
ax.set_ylabel("Markdowns")
ax.set_xlabel("Day of year");
plot_holidays(holidays[holidays > pd.to_datetime("2011-11-01")], ax)

It is clear that some kinds of MarkDowns are just used in some specific circumstances like Markdown 3 (green), it stays flat the entire 2012 year. It is also clear that they heavily used Markdowns the week before or at holiday weeks to improve sales. One way to check their relationship is to plot the correlation matrix between markdowns to understand a bit more how their combined usage behaves.

In [None]:
markdown_corr = np.zeros((5,5))
markdown_cols = ["MarkDown" + str(i) for i in range(1, 6)]
nan_mask = features.loc[:, markdown_cols].isnull().sum(axis=1) > 0
filtered_markdowns = features.loc[~nan_mask, markdown_cols]

In [None]:
sns.heatmap(filtered_markdowns.corr());

MarkDown 1 and 4 are used a bit more combined than other pairs.

### Target(s) Observation

So far I have analyzed the feature values by itself, without considering their relationship with sales values. This next session we explore a bit the sales values itself and some relations with feature values.

In [None]:
f, ax = plt.subplots(figsize=(12, 4), dpi=130)


scatter_data = train_data.groupby(["Date", "Dept"], as_index=False).Weekly_Sales.agg("median")
palette = sns.color_palette("hls", scatter_data.Dept.nunique())
color_pallete = {d: palette[i] for i, d in enumerate(scatter_data.Dept.unique())}
xjitter = np.random.randint(0, 5, size=len(scatter_data)).astype("timedelta64[D]") # points are falling in the same spot, let's given them a little jitter no more than 5 days
color = scatter_data.Dept.apply(lambda x: color_pallete.get(x))
ax.scatter(scatter_data.Date + xjitter, scatter_data.Weekly_Sales, s=0.3, color=color);
ax.set_ylim(0,80000);
ax.set_facecolor('black')

In [None]:
bla = sns.palplot(palette)

Not much of a useful plot, there are a lot of departments and it is hard to identify each of them.

#### Homegeinity between stores and departments

There a bunch of stores and departments. We know that there are 45 stores and they have different size some questions can be posed:
* How many departments are there?
* Do store sales are affected by the Store size?
    * Bigger Stores sells more?
* Are departments sell about the same in all stores? Are there stores that some department are preferred than in others?
* Do the department respect any rule of sasonality?


*Let's answer all this questions!*

**Departments**

In [None]:
train_data.groupby("Store").agg({"Dept": "nunique"}).sort_values("Dept", ascending=False).describe()

In [None]:
f ,ax = plt.subplots(figsize=(12, 2), dpi=180)
train_data.groupby("Store").agg({"Dept": "nunique"}).sort_values("Dept", ascending=False).plot.bar(ax=ax);

There are 79 different departments. All stores have at least 61 departments. At least 75% have 74 departments or more. Do they all have the same departments? Which stores have which departments?

In [None]:
dept_fill_matrix = train_data.pivot_table(index="Store", columns="Dept", values="Date", aggfunc="count")
dept_fill_matrix_mask = dept_fill_matrix > 0

In [None]:
f, ax = plt.subplots(figsize=(12, 6), dpi=160)
sns.heatmap(dept_fill_matrix_mask);

The Stores have pretty much the same departments. It is clear that all stores have all departments from 1 to 18 and there some *rare* departments such as 43 adn 61.

**Do store sales are affected by the Store size?**

Let's take a look to their total year sales and compare to their sizes.

In [None]:
year_sales = train_data.groupby(["Store", train_data.Date.dt.year]).agg({"Weekly_Sales": "sum"}).reset_index()
year_sales = year_sales.merge(stores, on="Store", how="left")

In [None]:
f, axs = plt.subplots(3, 1, figsize=(10, 15), dpi=120)
color_pallete = sns.color_palette("plasma", year_sales.Store.nunique())

for i, year in enumerate(range(features_dates.Date.min().year, features_dates.Date.max().year)):
    ax = axs[i]
    data = year_sales[year_sales.Date == year]
    colors = data.Store.apply(lambda s: color_pallete[s-1])
    ax.scatter(data.Size, data.Weekly_Sales, color=colors)
    ax.set_title(year)

The store size affects sales and, in general, a bigger store sells more.

**Department Stores Sales**

Let's remake the Stores Department Matrix but this time let's put the average sales value in color.

In [None]:
dept_sales_matrix = train_data.pivot_table(index="Store", columns="Dept", values="Weekly_Sales", aggfunc="mean")

In [None]:
columns_sort = dept_sales_matrix.sum(axis=0).sort_values(ascending=False).index.values
rows_sort = dept_sales_matrix.sum(axis=1).sort_values(ascending=False).index.values

In [None]:
f, ax = plt.subplots(figsize=(12, 6), dpi=160)
sns.heatmap(dept_sales_matrix.loc[rows_sort, columns_sort], ax=ax, cmap="plasma");


We can see that the upper left corner gets a lot brighter after sorting values. There are some departments much bigger than others and there are stores much bigger than others. This will guide us when interpreting the department's impact on the models' behavior. Let's see how departments behave with time. More than just the variation with time I want to identify which departments oscillate the most on holidays.

In [None]:
dept_sales_over_time = train_data.groupby(["Date", "Dept"]).agg({"Weekly_Sales": "sum"}).reset_index()
dept_sales_avg = dept_sales_over_time.groupby("Dept", as_index=False).agg(dept_mean=pd.NamedAgg(column='Weekly_Sales', aggfunc='mean'))
dept_sales_over_time = dept_sales_over_time.merge(dept_sales_avg, on="Dept")
dept_sales_over_time["variance"] = (dept_sales_over_time["Weekly_Sales"] - dept_sales_over_time["dept_mean"])
dept_sales_over_time["proportional_variance"] = dept_sales_over_time["variance"]/dept_sales_over_time["dept_mean"]

In [None]:
dept_sales_over_time

In [None]:
f, ax = plt.subplots(3, 1, figsize=(12, 9), dpi=160)
dept_sales_over_time.groupby("Dept").plot(x="Date", y="Weekly_Sales", ax=ax[0]);
ax[0].set_title("Absolute Values")
ax[0].legend([]);
plot_holidays(holidays, ax[0])

dept_sales_over_time.groupby("Dept").plot(x="Date", y="variance", ax=ax[1]);
ax[1].set_title("Absolute Variance")
ax[1].legend([]);
plot_holidays(holidays, ax[1])

dept_sales_over_time.groupby("Dept").plot(x="Date", y="proportional_variance", ax=ax[2]);
ax[2].set_title("Proportional Variance")
ax[2].legend([]);
ax[2].set_ylim([-20, 20])
plot_holidays(holidays, ax[2])

**Dispersion Analysis**

So far all my plots aggregate sales values in some way, either by summing or taking averages. This procedure masks the natural dispersion of sales values. By sneak peaking some other Kaggle kernels I found [this](https://www.kaggle.com/simonstochholm/walmart-sales-forecast-gammel) very interesting box plot view of sales values for stores and departments. I will reproduce these plots below.

In [None]:
plt.figure(figsize=(14,5), dpi=120)
sns.boxplot(train_data["Store"], train_data["Weekly_Sales"],showfliers=False);

In [None]:
plt.figure(figsize=(14,5), dpi=120)
sns.boxplot(train_data["Dept"], train_data["Weekly_Sales"],showfliers=False);

We can see from the Store's box plots that the sales values have an asymmetric distribution, with compact values to the left of the median and the peak values stretching to the 4th quantile bar. The same is not as present in the Departments plot, in the Departments plot we can see more balanced distributions and a greater difference between departments.

### EDA Conclusions and general thoughts

* Environmental features such as Temperature, CPI, Unemployment Rate, and Fuel Price are common to all departments inside a single store, they will not help much in the department distinctions. They will help to build a base notion of sale values.
* Almost half of Markdown values are missing. The markdown values will require special attention for filling nans.
* Store Size matters
* Department type matters
* Holidays play a huge role in increasing sales
* Almost all stores have almost all departments. The departments with lower frequency will have their generalization compromised due to fewer data.

Characteristics related to the store and departments seem to be more important to environmental features.

## Datasets and splits

To proceed with thee models I will build my canonical dataset that will be used for all models. They may introduce particular feature engineering techniques but the train, test, optimization, and validation datasets will be the same for all models. Since the names train and test are already taken by the Kaggle competition and we still have to make discussions in the notebook I will split the training dataset into two subsets: the optimization dataset, that will be used for cross-validation and hyperparameter optimization, and the validation dataset, that will be used to compare all the models and serve as the common ground for all of our results.

My final set of feature attributes will be:
* The Store attributes
    * Store ID
    * Store Type
    * Store Size
* The features attributes
    * Temperature
    * Fuel_Price
    * Markdows
    * CPI
    * Unemployment
    * Holiday Flag (it is always possible to know beforehand if a given week will contain a Holiday)
* Holiday Type (This is an implicit feature that may be helpful)
* Date features (basic decomposition of the date as an attempt of extracting time-related behavior)
    * Year
    * Month
    * Day
    * Weekday (Monday, Tuesday, Wednesday, etc.)
    * Week of Year
    * Day of year

Since this a time-based problem, it makes the most sense to split our training dataset based on a time split. I will take about the first 80% (the stores and departments will make this an uneven split) as our optimization dataset and the last 20% as the validation dataset.

A final submission with our best model will be made just as a sanity check to confirm that our conclusions and discussions in the notebook are correct.

In [None]:
all_features = features.merge(stores, on="Store")
all_features["HolidayType"] = all_features.Date.apply(lambda x: holidays_dict.get(x.date().isoformat(), "Not Holiday"))
all_features["Year"] = all_features.Date.dt.year
all_features["Month"] = all_features.Date.dt.month
all_features["Day"] = all_features.Date.dt.day
all_features["DayOfWeek"] = all_features.Date.dt.dayofweek
all_features["WeekOfYear"] = all_features.Date.dt.isocalendar().week
all_features["DayOfYear"] = all_features.Date.dt.dayofyear

In [None]:
full_train_data = train_data.drop("IsHoliday", axis=1).merge(all_features, on=["Store", "Date"])
full_test_data = test_data.drop("IsHoliday", axis=1).merge(all_features, on=["Store", "Date"])

In [None]:
test_ids = full_test_data.Store.astype(str) + '_' + full_test_data.Dept.astype(str) + '_' + full_test_data.Date.astype(str)

In [None]:
full_train_data.head()

In [None]:
full_test_data.head()

In [None]:
print("Min: ", full_train_data.Date.min(), "Max: ", full_train_data.Date.max())

In [None]:
target = "Weekly_Sales"
continuous_features = ["Temperature", "Fuel_Price", "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5",
                       "CPI", "Unemployment", "Size", "Year", "Day", "Month", "WeekOfYear", "DayOfYear"]
categorical_features = ["Store", "Dept", "IsHoliday", "Type", "HolidayType", "DayOfWeek"]
features = continuous_features + categorical_features
drop_features = ["Date"]

In [None]:
split_date = (full_train_data.Date.min() + (full_train_data.Date.max() - full_train_data.Date.min()) * 0.8)
split_date

In [None]:
optimization_data = full_train_data[full_train_data.Date < split_date]
optimization_target = optimization_data[target]
optimization_features = optimization_data[features]
validation_data = full_train_data[full_train_data.Date >= split_date]
validation_target = validation_data[target]
validation_features = validation_data[features]

In [None]:
print(f"Optimization proportion: {len(optimization_data)/len(full_train_data)}")
print(f"Validation proportion: {len(validation_data)/len(full_train_data)}")

A tiny bit more for validation but it is ok.

In [None]:
print("Optimization shapes: ", optimization_data.shape, optimization_features.shape, optimization_target.shape)
print("Validation shapes: ", validation_data.shape, validation_features.shape, validation_target.shape)

Two columns missing in the features: Date and Weekly Sales.

*Let's build some models!*

## Dummy Benchmark (Mean and Standard deviation)

This section will serve as our very basic baseline, it will serve to guide us with basic boundaries and ensure that we are not making any weird errors in the true models. The Kaggle leaderboard states that the All Zeros benchmark in the test set is **22265.71813**. I will make two basic dummy models:
* Make a general Mean prediction.
* Make a more specialized Mean prediction by using averages for stores and departments.

### General Mean Dummy Model

In [None]:
f, ax = plt.subplots(1, 1, figsize=(10, 3), dpi=120)
sns.distplot(optimization_target, label="Optimization")
sns.distplot(validation_target, label="Validation")
ax.legend()

In [None]:
optimization_target.describe()

In [None]:
validation_target.describe()

Not exactly sure if this negative values of sales are wrong or just under profit operations. Since I have not found any explicit disclaimer in the data description neither any discussion in the Kaggle discussion board I will keep these values in my solution.

In [None]:
pred_mean = optimization_target.mean()
pred_mean

In [None]:
y_mean_pred_train = np.zeros(len(optimization_data))
y_mean_pred_train[:] = pred_mean

y_mean_pred = np.zeros(len(validation_data))
y_mean_pred[:] = pred_mean

In [None]:
optimization_weights = optimization_data["IsHoliday"].apply(lambda x: 5 if x else 1).values
weights = validation_data["IsHoliday"].apply(lambda x: 5 if x else 1).values
weights

In [None]:
optimization_performance_dummy_1 = mean_absolute_error(optimization_target, y_mean_pred_train, sample_weight=optimization_weights)
optimization_performance_dummy_1

In [None]:
validation_performance_dummy_1 = mean_absolute_error(validation_target, y_mean_pred, sample_weight=weights)
validation_performance_dummy_1

## Stores and Departments Mean

In [None]:
stores_and_deps_mean = optimization_data.groupby(["Store", "Dept"], as_index=False).agg({"Weekly_Sales": "mean"})
stores_and_deps_mean

In [None]:
stores_and_dept_preds_train = optimization_data.loc[:, ["Store", "Dept"]].merge(stores_and_deps_mean, on=["Store", "Dept"], how="left").Weekly_Sales
stores_and_dept_preds_train[stores_and_dept_preds_train.isna()] = pred_mean

stores_and_dept_preds = validation_data.loc[:, ["Store", "Dept"]].merge(stores_and_deps_mean, on=["Store", "Dept"], how="left").Weekly_Sales
stores_and_dept_preds[stores_and_dept_preds.isna()] = pred_mean

In [None]:
optimization_performance_dummy_2 = mean_absolute_error(optimization_target, stores_and_dept_preds_train, sample_weight=optimization_weights)
optimization_performance_dummy_2

In [None]:
validation_performance_dummy_2 = mean_absolute_error(validation_target, stores_and_dept_preds, sample_weight=weights)
validation_performance_dummy_2

In [None]:
mean_absolute_error(validation_target, stores_and_dept_preds)

### Submission

In [None]:
stores_and_dept_preds_test = test_data.loc[:, ["Store", "Dept", "Date"]].merge(stores_and_deps_mean, on=["Store", "Dept"], how="left")
stores_and_dept_preds_test.loc[stores_and_dept_preds_test.Weekly_Sales.isnull(), "Weekly_Sales"] = pred_mean
test_ids = stores_and_dept_preds_test.Store.astype(str) + '_' + stores_and_dept_preds_test.Dept.astype(str) + '_' + stores_and_dept_preds_test.Date.astype(str)
sample_submission['Id'] = test_ids.values
sample_submission['Weekly_Sales'] = stores_and_dept_preds_test.Weekly_Sales.values
sample_submission.to_csv('submission_dummy_2.csv',index=False)

### Comments

Wow! This is an astonishing result! By comparing directly to the leaderboard this would be a top 20 submission! It is clear that there are just a few entries for each store and department pairs and would not generalize over time. But it is still impressive how better the solution got by just restricting the mean.

This makes clear that department and store identification plays a huge role in the predictions. If this was a model to be applied to new stores, stores absent from our data, the department configuration of the store is critical information and we would not be able to use the store id number. Since there is no explicit recommendation on the problem description and a bunch of solutions makes use of this information I will keep this in my solution.

> Just for the record, the following code:
``` python
stores_and_dept_preds_test = test_data.loc[:, ["Store", "Dept", "Date"]].merge(stores_and_deps_mean, on=["Store", "Dept"], how="left")
stores_and_dept_preds_test.loc[stores_and_dept_preds_test.Weekly_Sales.isnull(), "Weekly_Sales"] = pred_mean
test_ids = stores_and_dept_preds_test.Store.astype(str) + '_' + stores_and_dept_preds_test.Dept.astype(str) + '_' + stores_and_dept_preds_test.Date.astype(str)
sample_submission['Id'] = test_ids.values
sample_submission['Weekly_Sales'] = stores_and_dept_preds_test.Weekly_Sales.values
sample_submission.to_csv('submission_dummy_2.csv',index=False)
```
>Yields public and private scores of 5351.79373 and 5103.70395 respectively. This would rank 437 in the private leaderboard. Amazing result, in my opinion, given that we just have taken the stores and department means.

*Let's build some real models!*

## Linear Model

My choice of a linear model is based on its simplicity. Linear models are well known for a long time and yield pretty solid results as the first attempt. Without much effort, with a very simple (simple in this case meaning fast and reliable) model, it is possible to deliver value and result in a problem. Given these comments I will make a linear model as my first model, it will serve as our *real* benchmark.

Linear models require special treatments for some features, One Hot Encoding and Scaling, and therefore the total number of features will increase. Due to this new number of features regularization terms will play an essential role in the final model. 

### Preprocessing

As stated above we got some categorical features and some continuous features. Categorical features will need labeling and one-hot encoding, continuous features will require scaling. We will also try to make some kind of nan handling for markdowns and others.

In [None]:
print("Contnuous features:", continuous_features)
print("Categorical features:", categorical_features)

In [None]:
optimization_data.loc[:, categorical_features].dtypes

In [None]:
linear_preprocessor = ColumnTransformer([
    
    ('scaled_continous',
     Pipeline([
         ('imputer', SimpleImputer()), # This is not strictly necessary for well behaved features but it will serve as a general protection under bad data.
         ('scaler', StandardScaler())
     ]),
    ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Size', 'Year', 'Day', 'Month', 'WeekOfYear', 'DayOfYear']
    ),
    
    ('markdowns',
     Pipeline([
         ('imputer', SimpleImputer(strategy="constant", fill_value=0)), # Since markdowns has a lot of missing values and change a lot over time I will simply fill it with zeros
         ('scaler', StandardScaler())
     ]),
     ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
    ),
    
    ("categorical",
     Pipeline([
         ("one_hot", OneHotEncoder(handle_unknown='ignore'))
     ]),
     (['Store', 'Dept', 'Type', 'HolidayType', 'DayOfWeek'])
    ),
    
    ("others",
     "passthrough",
     ['IsHoliday'] # IsHoliday is not actually categorical, it isn't necessary to put it in a OneHotEncoder
    )
    
])

In [None]:
linear_preprocessor.fit_transform(optimization_features)

### Model

In [None]:
linear_estimator = ElasticNet()

### Complete Pipeline

In [None]:
linear_model = Pipeline([
    ('preprocessor', linear_preprocessor),
    ('estimator', linear_estimator)
])

### Hyperparameters

In [None]:
linear_hyperparameters = {
    "estimator__l1_ratio": [0.2, 0.5, 0.8, 1],
    "estimator__alpha": [1e-1, 1, 1e1, 1e2, 1e3, 1e4, 1e5]
}

### Splits

In [None]:
kf = KFold(5)
splits = kf.split(optimization_features)

### Full Optimization

In [None]:
linear_optimizer = GridSearchCV(
    linear_model,
    linear_hyperparameters,
    scoring="neg_mean_absolute_error",
    cv=5,
    n_jobs=4,
    return_train_score=True,
    verbose=10
)

In [None]:
linear_optimizer.fit(optimization_features, optimization_target)

In [None]:
linear_optimizer.best_params_

In [None]:
pd.DataFrame(linear_optimizer.cv_results_).sort_values(by="rank_test_score").loc[:, ["mean_test_score", "std_test_score", "mean_train_score", "std_train_score"]].head(8)

The best model after a hyperparameter search is a Lasso Regression with a penalty coefficient of 10. There is a small difference between train and test scores which are, compared with the absolute value of the error, fine. One problem with my optimization setup is the holiday weights, the optimizer is not taking into account that holidays weight 5 times more in the final score than regular days, this way some other solution that performs better in these cases will be ignored.

### Evaluation

Holidays worth 5 times a regular prediction. Let's take this into account.

In [None]:
y_linear_pred_train = linear_optimizer.predict(optimization_features)
y_linear_pred = linear_optimizer.predict(validation_features)

In [None]:
optimization_performance_linear = mean_absolute_error(optimization_target, y_linear_pred_train, sample_weight=optimization_weights)
optimization_performance_linear

In [None]:
validation_performance_linear = mean_absolute_error(validation_target, y_linear_pred, sample_weight=weights)
validation_performance_linear

In [None]:
mean_absolute_error(optimization_target, y_linear_pred_train)

In [None]:
mean_absolute_error(validation_target, y_linear_pred)

### Submission

In [None]:
test_ids = full_test_data.Store.astype(str) + '_' + full_test_data.Dept.astype(str) + '_' + full_test_data.Date.astype(str)
sample_submission["Id"] = test_ids
y_linear_test = linear_optimizer.predict(full_test_data.loc[:, features])
sample_submission["Weekly_Sales"] = y_linear_test
sample_submission.to_csv('submission_linear.csv',index=False)

> The submission results for the linear model are 11891.07291 and 11520.97285 for the public and private leaderboards respectivelly.

### Explanation

#### Preparations for explanation

In [None]:
best_model = linear_optimizer.best_estimator_

In [None]:
linear_preprocessed_feature_names = []
for transformation, transformer, columns in best_model.named_steps["preprocessor"].transformers_:
    if transformer == "passthrough":
        linear_preprocessed_feature_names += columns
        continue
    last_transformer = transformer.steps[-1]
    if last_transformer[0] == "scaler":
        linear_preprocessed_feature_names += columns
    if last_transformer[0] == "one_hot":
        categories = last_transformer[1].categories_
        for column, category in zip(columns, categories):
            linear_preprocessed_feature_names += [column + "_" + str(i) for i in category]

In [None]:
len(linear_preprocessed_feature_names)

#### SHAP

In [None]:
optimization_features_linear_preprocessed = best_model.named_steps["preprocessor"].transform(optimization_features).toarray()
validation_features_linear_preprocessed = best_model.named_steps["preprocessor"].transform(validation_features).toarray()
linear_explainer = shap.LinearExplainer(best_model.named_steps["estimator"],
                                        optimization_features_linear_preprocessed,
                                        feature_perturbation="interventional")
linear_shap_values = linear_explainer.shap_values(validation_features_linear_preprocessed)

In [None]:
shap.summary_plot(linear_shap_values, validation_features_linear_preprocessed, linear_preprocessed_feature_names)

The summary plot confirms most of our ideas so far. Size is the most important variable, indicating that the store size matters. We can see that, after Size up to Year, Department variables appear with a single red dot, indicating the 1 value from the one-hot encoding. If we recover our column sort from the Department X Stores sales matrix we can check what is our prior position to these departments positions:

In [None]:
columns_sort

They apper exactly in the same position up to the 13 label, in the 9th position! The first inversion happens with the swap of 94 and 8 labels.

#### Regression Weights

In [None]:
linear_model_weights_df = pd.DataFrame({"column": linear_preprocessed_feature_names, "weights": best_model.named_steps["estimator"].coef_}).sort_values("weights", ascending=False).reset_index(drop=True)
linear_model_weights_df.head()

In [None]:
f, ax = plt.subplots(figsize=(15, 3), dpi=200)
linear_model_weights_df[~(linear_model_weights_df["weights"].abs() < 100)].plot.bar(x="column", y="weights", ax=ax)

One cool feature of SHAP analysis for Linear Models is that, since SHAP takes into account the mean average value, we get some interesting sortings. If we just looked at the weights sorting we would just see *Size* after a bunch of department categories. With SHAP it is possible to identify that there is a big mass distributed over the variable values.

## Gradient Boosting Model (XGBoost)

XGBoost is a way more complex model, this requires a detailed usage of its API to take out the most it can offer.[](http://)

### Preprocessing

Since we will use the tree-based XGBoost, gbtree booster option, we won't need scaling preprocessing either one-hot encoders. But, since we are removing the one-hot encoder, we have to turn the string categorical features back again to some interpretable data type, we will perform this operation through an OrdinalEncoder. I will also not care about nan handling since XGBoost has a default tree path for missing data, every sample with missing will reach a leaf and these samples in the training dataset will be accounted for in the loss function.

In [None]:
xgb_preprocessor = ColumnTransformer([
    
    ('labeler',
     OrdinalEncoder(),
    ['WeekOfYear', 'Type', 'HolidayType']
    ),
    
    ("others",
     "passthrough",
     ['Temperature','Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5',
      'CPI', 'Unemployment', 'Size', 'Year', 'Day', 'Month', 'DayOfYear', 'Store',
      'Dept', 'IsHoliday', 'DayOfWeek']
    )
    
])

xgb_preprocessed_feature_names = ['WeekOfYear', 'Type', 'HolidayType'] + ['Temperature','Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5',
      'CPI', 'Unemployment', 'Size', 'Year', 'Day', 'Month', 'DayOfYear', 'Store',
      'Dept', 'IsHoliday', 'DayOfWeek']

In [None]:
xgb_preprocessed_feature_names

Since I will try to make use of the learning API of xgboost it will be necessary to decouple preprocessing from the actual training, this opens space for data leakage problems. Since the only preprocessing step that I'm performing is OrdinalEncoder, that could be done beforehand without any specific knowledge of the sample, there will be no problem to preprocess all data and set them ready to get into the model.

### DMatrices

In [None]:
optimization_features_xgb_preprocessed = xgb_preprocessor.fit_transform(optimization_features)
validation_features_xgb_preprocessed = xgb_preprocessor.transform(validation_features)

In [None]:
dmatrix_optimization = xgb.DMatrix(optimization_features_xgb_preprocessed, optimization_target)
dmatrix_validation = xgb.DMatrix(validation_features_xgb_preprocessed, validation_target)

In [None]:
dmatrix_optimization.num_col()

### Hyperparameters

As mentioned before my choice for the base estimator will be trees (booster is gbtree). Boosting methods are extremely susceptible to overfitting. Optimizing hyperparameters for a model as complex as XGBoost is a hard task and requires a lot of effort and computation time. With my previous experiences in mind, I will set some general hyperparameters that certainly will help to reduce overfitting. These parameters are:
* Subsample (0.5)
    * Half of the dataset will be sampled for each boosting round
* Colsample by tree (0.8)
    * 80% of the features will be selected for each boosting round. Since I will also pick a colsample by level we must not set this number too low or no features will be available for some deep trees.
* Colsample by level (0.95)
    * Every new tree level we reduce the number of features by 5%, this will be around 1 feature for the first level.

In [None]:
base_xgb_params = {
    "objective": "reg:squarederror",
    "booster": "gbtree",
    "subsample": 0.5,
    "colsample_bytree": 0.8,
    "colsample_bylevel": 0.95,
    "num_parallel_tree": 3, # Forests for the Win! This parameter controls the number of simultaneous tree trained at each boosting round.
    "eval_metric": "mae"
}

xgb_hyperparameters = {
    "max_depth": [3, 5, 10],
    "reg_alpha": [1e1, 1e3, 1e5],
    "reg_lambda": [1e2, 1e3, 1e4, 1e5], # let's choose some heavy regularization parameters
}

parameter_grid = [dict(**base_xgb_params, **i) for i in list(ParameterGrid(xgb_hyperparameters))]

### Booster

In [None]:
results = []
for params in tqdm(parameter_grid[:]):
    print(params)
    xgb_cv_results = xgb.cv(params, dmatrix_optimization, num_boost_round=20, nfold=5, early_stopping_rounds=1, verbose_eval=False)
    rounds = len(xgb_cv_results)
    performance_values = xgb_cv_results.iloc[-1, :].to_dict()
    print(performance_values)
    result_dict = dict(**params, **performance_values, rounds=rounds, params=params)
    results.append(result_dict)

We can see from the logs that the difference between test and train splits are very close to each other, indicating that the new tree being added have some generalization power. We still can overfit hyperparameters, but keeping the differences low and the std deviation of folds low gives us a track on building a good boosting model.

I've tried other hyperparameter grids, with deeper trees and more boosting rounds, but after checking the submission results they clearly indicated overfitted models.

In [None]:
results_df = pd.DataFrame(results).sort_values("test-mae-mean")
best_model = results_df.iloc[0, :]
best_params = best_model["params"]

results_df

In [None]:
best_params

In [None]:
xgb_estimator = xgb.train(best_params, dmatrix_optimization, num_boost_round=best_model["rounds"], verbose_eval=True)

In [None]:
y_xgb_preds_train = xgb_estimator.predict(dmatrix_optimization)
y_xgb_preds = xgb_estimator.predict(dmatrix_validation)

In [None]:
optimization_performance_xgboost = mean_absolute_error(optimization_target, y_xgb_preds_train, sample_weight=optimization_weights)
optimization_performance_xgboost

In [None]:
validation_performance_xgboost = mean_absolute_error(validation_target, y_xgb_preds, sample_weight=weights)
validation_performance_xgboost

In [None]:
mean_absolute_error(optimization_target, y_xgb_preds_train)

In [None]:
mean_absolute_error(validation_target, y_xgb_preds)

### Submission

In [None]:
test_ids = full_test_data.Store.astype(str) + '_' + full_test_data.Dept.astype(str) + '_' + full_test_data.Date.astype(str)
sample_submission['Id'] = test_ids.values
y_xgb_test = xgb_estimator.predict(xgb.DMatrix(xgb_preprocessor.transform(full_test_data.loc[:, features])))
sample_submission['Weekly_Sales'] = y_xgb_test
sample_submission.to_csv('submission_xgboost.csv',index=False)

> The submissions results are 6364.36197 and 6111.85871 for the private and public leaderboard respectvely.

### SHAP Explanations

In [None]:
model_bytearray = xgb_estimator.save_raw()[4:]
def myfun(self=None):
    return model_bytearray

xgb_estimator.save_raw = myfun

In [None]:
explainer = shap.TreeExplainer(xgb_estimator, feature_perturbation='tree_path_dependent')
shap_values = explainer.shap_values(validation_features_xgb_preprocessed, check_additivity=False) # The additivty test is failing for less than 1e3 of difference.

In [None]:
shap.summary_plot(shap_values, validation_features_xgb_preprocessed, xgb_preprocessed_feature_names)

Again, departments, sizes, and stores appear as top features. Since we don't have an OneHotEncoder it is a bit harder to read the department's impact, but we can see that bigger values of department ids yield bigger SHAP values. The top departments from the linear models were 92, 95, 38, 72, 90, 40,  2, 91, 13,  8, 94,  4, 93, confirming that some departments are really important, especially the ones with high id values. Let's take a detailed look at the department's impact.

In [None]:
shap.dependence_plot("Dept", shap_values, validation_features_xgb_preprocessed, xgb_preprocessed_feature_names)

We see the vertical bars indicating SHAP value for each department value. We can see again that departments around 90 have the most impact. We can also see that the high department number associated with the store size really contributes to high sales values.

## CatBoost

As we have seen in the previous section our heuristic model based on stores and department is really good! We got a real effort on training our xgboost model to beat it, and just by a small margin. Given that these two variables encode so much information I will make one last model that brings a key feature: categorical encoding. CatBoost is a really nice, very mature, [boosting model from Yandex](https://github.com/catboost/catboost), it has a categorical encoding technique that already encodes target information, this will be a combination of a powerful tree-based boosting algorithm with the information hidden in the stores and departments variables.

I will not perform any kind of hyperparameter optimization and will tell CatBoost to encode just the stores' and departments' variables. I expect to beat our heuristic model by a significant margin. Due to this additional category encoding technique, that is based on target values, it introduces a permutation and sampling technique across boosting rounds to reduced overfitting, this increases significantly the amount of computational time necessary to train the model.

Since I'm trying to capture the information encoded on target and departments I will only use the top 3 variables from XGBoost.

In [None]:
features

In [None]:
reduced_features = ['Size', 'Store', 'Dept',]
cat_cat_feaures = ['Store', 'Dept']

In [None]:
cat_estimator = cat.CatBoostRegressor(cat_features=cat_cat_feaures)

In [None]:
cat_estimator.fit(optimization_features.loc[:, reduced_features], optimization_target)

In [None]:
cat_estimator.tree_count_

In [None]:
y_cat_preds_train = cat_estimator.predict(optimization_features.loc[:, reduced_features])
y_cat_preds = cat_estimator.predict(validation_features.loc[:, reduced_features])

In [None]:
optimization_performance_cat = mean_absolute_error(optimization_target, y_cat_preds_train, sample_weight=optimization_weights)
optimization_performance_cat

In [None]:
validation_performance_cat = mean_absolute_error(validation_target, y_cat_preds, sample_weight=weights)
validation_performance_cat

In [None]:
mean_absolute_error(validation_target, y_cat_preds)

In [None]:
mean_absolute_error(optimization_target, y_cat_preds_train)

### Submission

In [None]:
test_ids = full_test_data.Store.astype(str) + '_' + full_test_data.Dept.astype(str) + '_' + full_test_data.Date.astype(str)
sample_submission['Id'] = test_ids.values
y_cat_test = cat_estimator.predict(full_test_data.loc[:, reduced_features])
sample_submission['Weekly_Sales'] = y_cat_test
sample_submission.to_csv('submission_cat.csv',index=False)

> The submissions results are 6347.62468 and 6115.17444 for the private and public leaderboard respectvely.

## Results

Here is our summary of performance results:

In [None]:
performances = pd.DataFrame(
    [[optimization_performance_dummy_1, optimization_performance_dummy_2, optimization_performance_linear, optimization_performance_xgboost, optimization_performance_cat],
     [validation_performance_dummy_1, validation_performance_dummy_2, validation_performance_linear, validation_performance_xgboost, validation_performance_cat]],
    columns = ["Dummy Model 1", "Dummy Model 2", "Linear Model", "XGBoost", "CatBoost"],
    index = ["Optimization", "Validation"]
)
performances

*Units in USD*

All three real models beat up the dummy model by a significant margin. The result from the expected path is an improvement from USD 7989.3774 from our linear model to a USD 3337.6414 from our XGBoost model, an improvement of approximately 58%. The CatBoost model with only three features got a very similar result. The really unexpected result is our Dummy Model 2, based on the combined mean of stores and departments, none of the models was able to beat it.

The dummy model had the best overall performance even in the submission leaderboard.

The explanations plots and analysis confirm some of the observations made in the EDA such as:
* Store Size is important.
* Departments and store configuration are some of the most important things to predict sales values.
* Not just store size, but department size also matters.
* Since the competition metric was not directly optimized, IsHoliday played a minor role.

The heuristic model based on stores and departments means was the best model. The best machine learning model was XGBoost.

Although holidays worth 5 times more their frequency is really, almost vanishing the effect of the weights, minor losses were observed for all the models.

### Sample visualization

As a sanity check let's pick a store at random and look at the sum of predicted values over departments and compare all models.

In [None]:
store_candidate = stores.sample(1).iloc[0, 0]
store_candidate

optimization_preds_df = pd.DataFrame({
    "Store": optimization_data.Store.values,
    "Dept": optimization_data.Dept.values,
    "Date": optimization_data.Date.values,
    "Weekly_Sales": optimization_data.Weekly_Sales.values,
    "dummy_1": y_mean_pred_train,
    "dummy_2": stores_and_dept_preds_train,
    "linear": y_linear_pred_train,
    "xgboost": y_xgb_preds_train,
    "catboost": y_cat_preds_train,
})

validation_preds_df = pd.DataFrame({
    "Store": validation_data.Store.values,
    "Dept": validation_data.Dept.values,
    "Date": validation_data.Date.values,
    "Weekly_Sales": validation_data.Weekly_Sales.values,
    "dummy_1": y_mean_pred,
    "dummy_2": stores_and_dept_preds,
    "linear": y_linear_pred,
    "xgboost": y_xgb_preds,
    "catboost": y_cat_preds,
})

test_preds_df = pd.DataFrame({
    "Store": full_test_data.Store.values,
    "Dept": full_test_data.Dept.values,
    "Date": full_test_data.Date.values,
    "dummy_2": stores_and_dept_preds_test.Weekly_Sales.values,
    "linear": y_linear_test,
    "xgboost": y_xgb_test,
    "catboost": y_cat_test,
})

summary_optimization = optimization_preds_df[optimization_preds_df.Store == store_candidate].groupby(["Date"], as_index=False).sum()
summary_validation = validation_preds_df[validation_preds_df.Store == store_candidate].groupby(["Date"], as_index=False).sum()
summary_test = test_preds_df[test_preds_df.Store == store_candidate].groupby(["Date"], as_index=False).sum()

plots = ["Weekly_Sales", "dummy_1", "dummy_2", "linear", "xgboost", "catboost"]
colors = {
    "Weekly_Sales": "red",
    "dummy_1": "green",
    "dummy_2": "orange",
    "linear": "brown",
    "xgboost": "blue",
    "catboost": "cyan"
}

linestyle = {
    "Weekly_Sales": "-."
}

f, ax = plt.subplots(figsize=(12, 4), dpi=130)
for p in plots:
    kwargs = {
        "label": p,
        "color": colors.get(p),
        "linewidth": 0.8,
        "linestyle": linestyle.get(p, "-")
    }
    ax.plot(summary_optimization.Date, summary_optimization[p], **kwargs)
    ax.plot(summary_validation.Date, summary_validation[p], **kwargs)
    if p != "Weekly_Sales" and p != "dummy_1":
        ax.plot(summary_test.Date, summary_test[p], **kwargs)

ax.axvspan(optimization_data.Date.min(), optimization_data.Date.max(), color='grey',alpha=0.14)
ax.axvspan(validation_data.Date.min(), validation_data.Date.max(), color='green',alpha=0.14)
ax.axvspan(test_preds_df.Date.min(), test_preds_df.Date.max(), color='red',alpha=0.14)
ax.set_title("Store " + str(store_candidate))
ax.legend(loc="upper right")
plot_holidays(holidays, ax)

The dummy model presents now its weakeness.

## Conclusions

We get by the end of the notebook with a very unusual situation where a heuristic model is the best candidate so far. Before discussing which model to pick I believe it is important to discuss deployment scenarios and application purposes of these models.

This is a Kaggle competition focused on the Walmart selection process. There is no reference in the problem statement of the real purpose of the model. Given that we don't have a specific context of usage we can list possible situations:

**New Stores Placement**

One possible application for this model is to investigate which factors impact sales. With this short analysis, it was clear that some factors such as departments' configuration and store size impact sales. This analysis could be used to guide store changes or guide new store openings, fake feature values could be extracted from possible new store locations and inputted to the model to predict the future "store value". This application neglects the time index present on the data.

**Sales Forecast**

This is the title of the competition and the most probable use case. The model can be placed in a system that will take the store ids and departments and forecast sale values for better storage planning and the anticipation of critical events. It is important to notice that all of our models, including our superpower dummy model, depends on the previous historical data for the store, none of our models would perform well on new stores (with new ids). It is important to recognize that all models developed in this analysis are tightly coupled with the stores' ids. If we wanted to apply this model to new stores we would have to think carefully about how to use the store id information.

None of our models are suitable for the first case.

For the second case, a time of experimentation would put the generalization power of XGBoost model at proof. I am inclined, with the current results, to pick the XGBoost model with a clear generalization, but it is really impressive what the Dummy Model is capable of. In a context where shadow predictions and a period of experimentation is not allowed, the Dummy Model could be a reasonable solution while we tweak the other model until better performance is reached. It is clear that the dummy model generalization is limited, we can check this on the predictions plot.

Department Store Number, Size and Store Number encode most of the information, our CatBoost model, with only the top three XGBoost model features, had similar performance. This opens space for the question "What kind of model generalization are we searching for?", since the department number and store id play a so crucial role, what is exactly our predictions? Are we really learning where sales values come from? Are we capturing external effects from holidays and temperature or just memorizing which stores and departments sell more?

If the answer to the last question is yes, we have a very solid explanation of why our dummy model is so good. To understand better this results we really need to understand how this model would be used and what is its main objective.

## Further Work

We got curious results and the analysis opened a lot of future work paths. I will list some of these paths here:
* Understand why the dummy model was so good.
    * Powerful models weren't able to beat it, why?
    * We can make use of more advanced features of the boosting models and train them so they can beat our dummy model.
* Detailed usage of XGBoost boosting rounds
    * Clip new tress to avoid overfitting
    * Test the DART booster
* Better explore the target encoding from CatBoost. The dummy model is a very simple way to do this, a more powerful approach could lead to better results.
* Explore the time dimension.
    * No attention was given to feature engineering based on time, there is a role new world of time series methods that can enhance our models.
    * We can combine the multiple kinds of feature extraction and the different time series modeling technique to enhance our results.
* Optimize models with the desired metric.
    * All optimizations ware made with the standard MSE. Dedicated optimization using the desired metric could select a better model for the problem purposes.

Thank you for your time.

Sincerely,

Otávio Vasques