# <font color="navy">Walmart Recruiting Store Sales Forecasting</font>
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

# Future Sales Forecasting Problem
### Introduction
Trying to predict future sales is often realized as a Time Series <br>
forecasting problem where the seek for Trend and Seasonality are <br>
the main tasks. Knowing both of these elements with the assumption <br>
of steady external factors, it is a good starting point and models <br>
like Exponential Smoothing are frequently used. <br>
<br>
When seasonality is not clear enough, stochastic processes that use <br>
a fixed number of previous time states is used, such as ARIMA. <br>
However this kind of forcasting is likely to propagate errors over <br>
time for a long unknown time window. <br>

### My Strategy
In order to bring a new contribution, that can be ensembled with <br> 
traditional strategies to solve this problem, I have made some <br> 
assumptions: <br>
* Department Weekly Sales depend on the week order within the <br>
same month
* The tuple Week-Month can explain human behavior, shopping is <br>
a human behavior, right?
* The number of weeks in a month tend to repeat over the years, <br>
depending on how much Mondays a month have.
* The sales variation from the same Store-Dept between Year0 and <br>
Year1 is almost stable and can be replicated between Year1 and <br>
Year2.
* Store-Dept Weekly Sales is correlated to the Size of the Store the <br>
Dept is placed 

So, having made these assumptions, it is reasonable to understand <br>
that the Sales from a specific Store-Dept in the next year can be <br>
a function of the same Store-Dept in the current year. Of course <br>
further information should be add in order to improve accuracy. <br>
To perform this prediction, a **Linear Regression** is chosen and the <br>
x1 representing the Weekly_Sales of Store-Dept and x2, the current <br>
Size of the whole store.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

import my_dao
import time_utils
import pretties
import stats
import process
import evaluation
import plotter
import download

from bokeh.plotting import show, output_notebook

In [None]:
pretties.max_data_frame_columns()
pretties.decimal_notation()
output_notebook()

# **The dataset**

Some semantic enrichment on data were applied in order to have more <br> 
possibilities to explore data and, eventually, turn them into <br>
properly machine learning features. <br>
<br>
In this context, the following fields were built: <br>
<br>

### Train dataset
<ul>
  <li>sales_diff : a number, represents the absolute difference between current and previous week Sales</li>
  <li>sales_diff_p : a number, represents the relative difference between current and previous week Sales</li>
  <li>up_diff : a boolean indication whether there was a positive difference in sales_diff</li>
</ul>

<br>

### Features dataset
<ul>
  <li>pre_holiday : a boolean indication whether there is a Holiday in the following week</li>
  <li>pos_holiday : a boolean indication whether there was a Holiday in the previous week</li>
  <li>celsius : a number, represents the Temperature in Celsius scale</li>
  <li>celsius_diff : a number, represents the absolute difference between current and previous week Temperature in Celsius</li>
  <li>week_n : a number, represents the order of the week whitin its month</li>
  <li>month_n : a number, represents the month</li>
    
</ul>

### The <font color="navy">store_dept</font>  and <font color="navy">wm_date</font> composite key
As mentioned in the first section, my strategy uses the tuple <br>
(Store,Dept) and the element that explains shopping behavior, <br>
the tuple (Week_n,Month_n) to group Weekly Sales for each year. <br>

Got confused? <br>
It is easy! <br>

Go ahead and check help texts close to outputs. :)

### Loading...

In [None]:
train = my_dao.load_dataset("train")
train = train.groupby("store_dept").apply(process.train_sales_semantic_enrichment)

test = my_dao.load_dataset("test")

feat = my_dao.load_features()
feat = process.features_semantic_enrichment(feat)

stores = my_dao.load_stores()

Merging train and test datasets with features dataset

In [None]:
train = train.merge(feat, how="left", left_on=["Store", "Date"], right_on=["Store", "Date"], suffixes=["", "_y"])
del train["IsHoliday_y"]
del train["timestamp_y"]
train = train.merge(stores, how="left", left_on=["Store"], right_on=["Store"])

In [None]:
test = test.merge(feat, how="left", left_on=["Store", "Date"], right_on=["Store", "Date"], suffixes=["", "_y"])
del test["IsHoliday_y"]
del test["timestamp_y"]
test = test.merge(stores, how="left", left_on=["Store"], right_on=["Store"])

In [None]:
cols = ['Date', 'Store', 'Dept', 'Weekly_Sales', 'pre_holiday', 'IsHoliday', 'pos_holiday', 'Fuel_Price', 
        'CPI', 'Unemployment', 'celsius', 'datetime', 'Type', 'sales_diff', 'sales_diff_p',
        'Size', 'Temperature', 'timestamp', 'store_dept', "day_n", "week_n", "month_n", "wm_date", "up_diff", "celsius_diff", "year"]

train = train[cols]
print("Shape: {}".format(train.shape))
train.sample(6)

Train dataset time interval

In [None]:
print("Train\n")
print("Initial date: {}".format(train["Date"].iloc[0]))
print("Final date  : {}".format(train["Date"].iloc[-1]))
print("Time interval (months): {}".format(time_utils.time_interval_months(train["Date"])))
print("Time interval (years) : {}".format(time_utils.time_interval_months(train["Date"]) / 12))

In [None]:
print("Test\n")
print("Initial date: {}".format(test["Date"].iloc[0]))
print("Final date  : {}".format(test["Date"].iloc[-1]))
print("Time interval (months): {}".format(time_utils.time_interval_months(test["Date"])))
print("Time interval (years) : {}".format(time_utils.time_interval_months(test["Date"]) / 12))

### Train & Validation...
partitions

#### Partition timestamp threshold
A timestamp threshold to build **train** and **validation** partitions must be set. <br>
As the whole train dataset have a 32.7 months interval, I have decided to split fitting/validation into 24 months for training and the rest 8 months for validation.<br>
The reason is that there will be 2 entrys for each (Store,Dept,week_n,month_n) composite key for fitting stage. <br>
Fitting data is stored in use_train.<br>
Validation data is stored in use_valid.<br>

In [None]:
timestamp_threshold = time_utils.str_datetime_to_timestamp("2012-02-01", "%Y-%m-%d") #24 months from the first entry

use_train = train[train["timestamp"] <= timestamp_threshold]
use_valid = train[train["timestamp"] > timestamp_threshold]

In [None]:
print("Fitting dataset time interval\n")
print(use_train["Date"].head(1).append(use_train["Date"].tail(1)))
print()
print("Time interval (months): {}".format(time_utils.time_interval_months(use_train["Date"])))
print("Time interval (years) : {}".format(time_utils.time_interval_months(use_train["Date"]) / 12))

In [None]:
print("Validation dataset time interval\n")
print(use_valid["Date"].head(1).append(use_valid["Date"].tail(1)))
print()
print("Time interval (months): {}".format(time_utils.time_interval_months(use_valid["Date"])))
print("Time interval (years) : {}".format(time_utils.time_interval_months(use_valid["Date"]) / 12))

# Forecasting

### <font color="navy">Week-Month</font> data

### Transformation into WM Data...
The input data for fitting stage is described some cells bellow :)

In [None]:
try:
    wm_data_train = my_dao.load_week_month_data("wm_data_train")
except FileNotFoundError:
    wm_data_train = process.wm_data(use_train)
    my_dao.save_week_month_data(wm_data_train, "wm_data_train")

In [None]:
wm_data_train = process.format_wm_data_colnames(wm_data_train, "train")
wm_data_train.sample(4)

In [None]:
try:
    wm_data_valid = my_dao.load_week_month_data("wm_data_valid")
except FileNotFoundError:
    wm_data_valid = process.wm_data(use_valid)
    my_dao.save_week_month_data(wm_data_valid, "wm_data_valid")

In [None]:
wm_data_valid = process.format_wm_data_colnames(wm_data_valid, "valid")
wm_data_valid.sample(4)

Now that train and validation dataset are placed as tables with <br>
(Store, Dept, Week_n, Month_n) as its composite key (represented <br>
by the columns store_dept and wm_date), it is time to merge them <br>
on this key. <br>
The **inner** merge is chosen because the sklearn.LinearRegression <br> 
fit method doesn't work with NaN values. <br>

In [None]:
xy = pd.merge(wm_data_train, wm_data_valid, 
              left_on=["wm_date", "store_dept"], right_on=["wm_date", "store_dept"], 
              how="inner", suffixes=["_train", "_valid"])

xy["Store"] = xy["year1_sales_train"]
xy["Dept"] = xy["year1_size_train"]
xy["Date"] = xy["Date_train"]

print("Total groups: ", len(xy.drop_duplicates(["wm_date", "store_dept"])))
display(xy.head(4))
print("Hey, look the table above and check the first two columns (store_dept and wm_date)")
print("They mean Store 29, Dept 5, Month_n 7 and 4th week of the month.")
print("The following columns are the fields values for each year.")

### The input table description

Now that you can see the merged dataset, it is time to explain it. <br>
As I said before, each entry of this table represents all possible  <br>
field values from a (Store, Department, Week, Month). Well, even  <br>
though not all fields are placed there, it was designed to recieve  <br>
the ones you want to. <br>
This approach started simple, as it should be, there will be used  <br>
only two input columns in the fitting stage: a) the Weekly_Sales of  <br>
(Store,Dept) at each (Week_n, Month_n) time tag and b), the size of  <br>
the Store at that time tag. <br>
<br>
So, the Linear Regression fitting stage will be performed using  <br>
Weekly Sales and store Size, right? But how? <br>
Remember the train and validation partition split? Right. <br>
There were reserved two years for the trainning dataset, so it is  <br>
possible to train with year0's sales and size, targeting the year1  <br>
sales. <br>
Then, evaluate the prediction quality with the third year (year2)  <br>
that was reserved for the validation stage. =) <br>
<br>
Obs.: Unfortunately the MarkDown fields only have valid values from Nov 2011.

In [None]:
%matplotlib inline
xy.plot.scatter("year0_sales_train", "year1_sales_train", 
                title="scatter plot - sales year0 vs sales year1", 
               ylim=(0, 200000), xlim=(0, 200000), alpha=0.15, figsize=(6,6))

What the cart above tell us? <br>
Despite some outliers, the majority of the points lies close <br> 
to the 45 degree diagonal, which means that the Weekly Sales <br>
between year0 and year1 tend, in general, to be close one to <br>
another.

In [None]:
%matplotlib inline
xy.plot.scatter("year1_sales_train", "year0_sales_valid", color="magenta",
                title="scatter plot - sales year1 vs sales year2", 
               ylim=(0, 200000), xlim=(0, 200000), alpha=0.3, figsize=(6,6))

And this one in magenta color? <br>
It represents the same previous chart structure, but <bt>
this time with Weekly Sales from year1 to year2. <br>
In the right end of it the points seem to have a slight <br>
curve up, but most os the points tend to have the same <br>
spread as the chart before.

Ok, let's move forward. <br>
Time to revome missing data on the second year (year1_train). <br>
There are valid values from year0_train but not for all <br>
year1_train.

In [None]:
print("NAs count on first year")
stats.freq(xy["year0_sales_train"].isna())

In [None]:
print("NAs count on second year")
stats.freq(xy["year1_sales_train"].isna())

In [None]:
print("NAs count on third year")
stats.freq(xy["year0_sales_valid"].isna())

In [None]:
not_na_xy = xy[(xy["year0_sales_train"].notna()) & (xy["year1_sales_train"].notna()) & (xy["year0_sales_valid"].notna())]

In [None]:
key_colnames = ["Store", "Dept", "Date"] #column names need to build submission file

In [None]:
fitting_cols = ["year0_sales_train", "year0_size_train"] #first year data
x = not_na_xy[fitting_cols] 
x.head(4)

In [None]:
y = not_na_xy[["year1_sales_train"]] #first year target
y.head(4)

## Fitting

In [None]:
reg = LinearRegression().fit(x, y)
print(reg.score(x, y))
print("Function:")

print("y = {} * {} + {} * {} + {}".format(round(reg.coef_[0][0], 3), x.columns[0], round(reg.coef_[0][1], 3), x.columns[1], round(reg.intercept_[0]), 3))

## Applying to Validation dataset

Recap! <br>
Now it's time to apply the Linear Regression function <br>
for the second year data to predict the next year sales.

In [None]:
x_valid = not_na_xy#[["year1_sales_train", "year1_size_train"] + key_colnames + ["year1_isholiday_train"]]
x_valid.head(4)

In [None]:
y_pred = reg.predict(x_valid[fitting_cols])

In [None]:
%matplotlib inline
pd.DataFrame(y_pred)[0].plot.hist(title="Validation Predicted")
y_valid = not_na_xy[["year0_sales_valid"]]
y_valid.plot.hist(title="Validation Real", color="magenta")

Looking to both charts above, at least the values dispersion seem to be very close.

## Evaluation

In [None]:
x_valid[fitting_cols]

In [None]:
x_valid = x_valid.rename({"year0_isholiday_valid": "IsHoliday", "year0_sales_valid": "Weekly_Sales"}, axis=1)

subm = evaluation.build_submission_df(test_df=x_valid[fitting_cols + key_colnames], 
                                      target_predicted=y_pred)

print("Validation prediction evaluation:\n")
print(evaluation.evaluate(subm, x_valid))

## Test
Now it is time to apply prediction to the test dataset.

This colnames updating is to move time window in one year.

In [None]:
test_cols = [fitting_col.replace("0", "1") for fitting_col in fitting_cols]

In [None]:
test.sample(4)

In [None]:
test["Date"].head(1).append(test["Date"].tail(1))

In [None]:
try:
    wm_data_test = my_dao.load_week_month_data("wm_data_test")
except FileNotFoundError:
    wm_data_test = process.wm_data(test)
    my_dao.save_week_month_data(wm_data_test, "wm_data_test")

In [None]:
wm_data_test = process.format_wm_data_colnames(wm_data_test, "test")

In [None]:
try:
    wm_data_train_valid = my_dao.load_week_month_data("wm_data_train_valid")
except FileNotFoundError:
    wm_data_train_valid = process.wm_data(train)
    my_dao.save_week_month_data(wm_data_train_valid, "wm_data_train_valid")

In [None]:
wm_data_train_valid = process.format_wm_data_colnames(wm_data_train_valid, "train")

In [None]:
xy_test = pd.merge(wm_data_train_valid, wm_data_test, 
                   left_on=["wm_date", "store_dept"], right_on=["wm_date", "store_dept"], 
                   how="right", suffixes=["_train", "_test"])

xy_test.sample(5)

### Filling NaN values
While merging both train and test Week-Month dataset, on the <br>
composite key (Store, Dept, Week_n, Month_n) there were NA <br>
values within Sales and Size of previous year data. <br>


In [None]:
print("NAs frequency over predictive columns\n")
for test_col in test_cols:
    print(test_col)
    print(stats.freq(xy_test[test_col].isna()))
    print()

There are 10.45% of missing values that can be filled <br>
used some ways. <br>
Maybe we can replace all missing values from a <br>
(Store, Dept, Week_n, Month_n) with the median of all <br>
values from (Store, Dept), all Week_n and Dept_n. <br>
<br>
First question: how many (Store, Dept) with missing<br>
date are present with any Date?

In [None]:
not_na_xy_test = xy_test[xy_test[test_cols].notna().all(1)]
na_xy_test = xy_test[~xy_test.index.isin(not_na_xy_test.index)]

In [None]:
stats.freq(na_xy_test["store_dept"].isin(wm_data_train_valid["store_dept"]))

We are lucky! <br>
99.70% of all (Store, Dept, Week_n, Month_n) with <br>
missing data have at least one entry for valid data <br>
for Weekly Sales. <br>
<br>
So, let's take the median of all available Weekly <br>
Sales of this (Store, Dept) to fill missing values! <br>
The other 0.30% we can fill with the median of all <br>
its Store Sales.

filling

In [None]:
na_xy_test_filled = na_xy_test.apply(lambda row : process.dummy_fill_store_dept_median(row, wm_data_train_valid, test_cols), axis=1)
xy_test = not_na_xy_test.append(na_xy_test_filled)

print("NAs frequency over predictive columns\n")
for test_col in test_cols:
    print(test_col)
    print(stats.freq(xy_test[test_col].isna()))
    print()

In [None]:
not_na_xy_test = xy_test[xy_test[test_cols].notna().all(1)]
na_xy_test = xy_test[~xy_test.index.isin(not_na_xy_test.index)]

na_xy_test_filled = na_xy_test.apply(lambda row : process.dummy_fill_store_median(row, wm_data_train_valid, test_cols), axis=1)
xy_test = not_na_xy_test.append(na_xy_test_filled)

print("NAs frequency over predictive columns\n")
for test_col in test_cols:
    print(test_col)
    print(stats.freq(xy_test[test_col].isna()))
    print()

# Predicting Weekly_Sales for TEST dataset

In [None]:
x_test = not_na_xy_test.append(na_xy_test_filled)#[["year1_sales_train", "year1_size_train"] + key_colnames + ["IsHoliday_test"]]
x_test.head(4)

In [None]:
y_test_pred = reg.predict(x_test[test_cols])

In [None]:
subm = evaluation.build_submission_df(test_df=x_test, 
                                      target_predicted=y_test_pred,
                                      store_colname="Store_test", 
                                      dept_colname="Dept_test", 
                                      date_colname="Date_test")

# subm["Weekly_Sales"] = subm["Weekly_Sales"].apply(lambda ws : round(ws, 4))

In [None]:
subm.reset_index().to_csv("submission.csv", index=False)

In [None]:
download.create_download_link(subm, "Submission Download", "submission.csv")

## Results
My best score at Kaggle submission was <font color="navy">3710.35797</font> which <br>
would place me at position <font color="navy">334th</font>, over <font color="navy">688</font> people. <br>
It was not expected to be in the Top <font color="navy">48.5%</font> of participants as <br>
this approach was designed to be a simple contribution <br> 
aiming an ensemble to improve traditional approaches. <br>

Considering that only two features were used, it can be <br>
concluded that this strategy (initial assumptions and data <br>
transformations based on Store-Dept-Week_n-Month_n) has a <br>
great potential for ensembling or even be used on its own.<br>

As this case was required to be done in very short time, <br>
there are plenty of work that can be done as next steps. <br>

## Next Steps 

### Time Series Similarity
Store and Departaments that have similar Time Series shapes <br>
may me composing a semantic cluster. It is worth trying to <br>
check if a Linear Regression can be fitted for each of these <br>
clusters. <br>

### Further Human Behavior Modeling
As in the initial assumptions I have stated that <br>
(Week_n-Month_n) can explain human behavior, there may be <br>
some other human community behavior that is influenced by <br>
communities' geolocalzation. We don't know this information <br>
but fields like CPI, Temperature and Fuel Price indexes may <br>
be used for clusterization. And these groups may represent <br>
close communities. <br>

### Smoothing Filters and ETS Decomposition 
Hodrick-Prescott filter was tried but it can be more explored. <br>
ETS (Error-Trend-Seasonality) was not applied and it could be given <br>
a chance for it. <br>

### Ensemble Application With Traditional Models
This (Week_n, Month_) strategy may result into an awesome result if <br> 
used together with Exponential Smothing or ARIMA models. 


### Tracking Tests Execution
A Design of Experiment wasn't applied at all, and it could help <br>
indentify how to better switch between modules, models, parameters, <br>
features and so on...

### More features
Remember only two features were used. <br>
Holiday data must be used due to its wheight relevance. 
