# Corporación Favorita Store Sales Prediction
This notebook shows time-series forecasting for store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer.

It is divided into two sections
- Feature Engineering and Comparison
- Error Analysis


## Feature Engineering and Comparison
This section is built on top of this [notebook](https://www.kaggle.com/ryanholbrook/exercise-seasonality), which provided a set of well performing seasonality features, as shown in the following chart.

In [None]:
# Setup notebook
from pathlib import Path
from learntools.time_series.style import *  # plot style settings
from learntools.time_series.utils import plot_periodogram, seasonal_plot

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from sklearn.metrics import mean_squared_log_error, mean_absolute_error
from wordcloud import WordCloud


comp_dir = Path('../input/store-sales-time-series-forecasting')

holidays_events = pd.read_csv(
    comp_dir / "holidays_events.csv",
    dtype={
        'type': 'category',
        'locale': 'category',
        'locale_name': 'category',
        'description': 'category',
        'transferred': 'bool',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
holidays_events = holidays_events.set_index('date').to_period('D')

store_sales = pd.read_csv(
    comp_dir / 'train.csv',
    usecols=['store_nbr', 'family', 'date', 'sales'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()
average_sales = (
    store_sales
    .groupby('date').mean()
    .squeeze()
    .loc['2017']
)

oil = pd.read_csv(
    comp_dir / 'oil.csv',
    parse_dates=['date'],
    infer_datetime_format=True,
)
oil = oil.set_index('date').to_period('D')

# National and regional holidays in the training set
holidays = (
    holidays_events
    .query("locale in ['National', 'Regional']")
    .loc['2017':'2017-08-15', ['description']]
    .assign(description=lambda x: x.description.cat.remove_unused_categories())
)

X_holidays = pd.get_dummies(holidays)

In [None]:
y = average_sales.copy()

# YOUR CODE HERE
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    # YOUR CODE HERE
    additional_terms=[fourier],
    seasonal=True,
    drop=True,
)
X = dp.in_sample()

In [None]:
def get_LR_pred(X, y, model=None):
    model = model if model else LinearRegression().fit(X, y)
    
    if len(y.shape) == 1:
        return pd.Series(model.predict(X),index=X.index)
    else:
        return pd.DataFrame(model.predict(X), index=X.index, columns=y.columns)

#y_pred = get_LR_pred(X, y)

def plot_pred(y, y_pred=None, index=None, targ_ax=None, 
              plot_title=None, ylabel="items sold", 
              pred_label="Seasonal", no_pp=False):
    y = y[index] if index is not None else y
        
    pp = {} if no_pp else plot_params
    targ_ax = y.plot(**pp, alpha=0.5, title=plot_title, ylabel=ylabel, ax=targ_ax)

    if y_pred is not None:
        targ_ax = plot_pred(y_pred, index=index, targ_ax=targ_ax, ylabel=None, no_pp=True)
    
    targ_ax.legend()
    
    return  targ_ax
    
#plot_pred(y, y_pred);
    

In [None]:
# Join to training data
X2 = X.join(holidays.groupby('date').min(), on='date').fillna(0.0)

y_pred = get_LR_pred(X2, y)  
ax = plot_pred(y, y_pred, plot_title="Average Sales")

### Oil Price and Oil Price Trend
The blue seasonal pattern in the chart above, follows the average stores sales very well. However, there are many occassions where the peaks and dips are over- or undershot. The amount of items on promotion, and the oil price could be a good predictor to stretch or squeeze the seasonal pattern.

>Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.

The following charts allow to compare the daily oil price with the moving sum of the changes in the oil price. The oil price itself may not have a good predictive power. The perception of the current oil price depends on the previous prices. For instance in March an oil price of 50 was bad, since it dropped from 54. However in July it was great, since it rose from 44.

In order to visualize the trend, the oil data has been transformed. First the difference to the price of the previous day is calculated. Then the differences were summed up in a 14-day window. The following charts allow to compare the "raw" oil price and the resulting oil trend. In March when the price dropped from 54, the oil trend is below -4. In the beginning of August the same oil price has a trend value of around +2.

In [None]:
wdw=14
oil_moving_sum = oil.loc['2017'].diff().rolling(
    window=wdw,       # 365-day window
    center=True,      # puts the average at the center of the window
    min_periods=wdw//2,  # choose about half the window size
).sum()              # compute the mean (could also do median, std, min, max, ...)

plot_pred(oil.loc['2017'], plot_title='Oil Price', ylabel='');
plot_pred(oil_moving_sum, plot_title=f'Oil Trend', ylabel='trend');

The following charts compare the previous model with one model trained with the trend and another one with the raw oil price. The differences are very little, hence the mean absolute error (MAE) of each model is also given.

In [None]:
oiltrend = X.join(oil_moving_sum.groupby('date').min(), on='date').fillna(0.0)
oilraw = X.join(oil.loc['2017'], on='date').fillna(0.0)

fig, axes =  plt.subplots(3,1, figsize=(11, 14))

y_pred = get_LR_pred(X2, y)  
msle = mean_absolute_error(y, y_pred)
plot_pred(y, y_pred, plot_title=f'Without Oil Trend - MAE: {msle:.2f}', targ_ax=axes[0]);

y_pred = get_LR_pred(oiltrend, y)
msle = mean_absolute_error(y, y_pred)
plot_pred(y, y_pred, plot_title=f'With Oil Trend - MAE: {msle:.2f}', targ_ax=axes[1]);

y_pred = get_LR_pred(oilraw, y)
msle = mean_absolute_error(y, y_pred)
plot_pred(y, y_pred, plot_title=f'With Oil Raw - MAE: {msle:.2f}', targ_ax=axes[2]);

### Items on Promotion
The following chart shows the average amount of promo items along with the trend. In oppose to the oil data, the trend is calculated by taking the moving average in a 7-day window.

In [None]:
promo = pd.read_csv(
    comp_dir / 'train.csv',
    usecols=['store_nbr', 'family', 'date', 'onpromotion'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'onpromotion': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)

promo['date'] = promo.date.dt.to_period('D')
promo = promo.set_index(['store_nbr', 'family', 'date']).sort_index()
average_promo = (
    promo
    .groupby('date').sum()
    .squeeze()
    .loc['2017']
)

wdw=7
promo_ma = average_promo.loc['2017'].rolling(
    window=wdw,       # 365-day window
    center=True,      # puts the average at the center of the window
    min_periods=wdw//2,  # choose about half the window size
).mean()              # compute the mean (could also do median, std, min, max, ...)

plot_pred(average_promo, promo_ma, 
          plot_title=f"{wdw}-Day Moving Average", ylabel="items on promotion");

The following charts compare the initial model with one model trained with the trend and another one with the unprocessed items on promotion. The vanilla model outperforms the other models on the MAE metric. However, it seems to be more accurate in August.

In [None]:
promotrend = X.join(promo_ma.groupby('date').min(), on='date').fillna(0.0)
promoraw = X.join(average_promo, on='date').fillna(0.0)

fig, axes =  plt.subplots(3,1, figsize=(11, 14))

y_pred = get_LR_pred(X2, y)
msle = mean_absolute_error(y, y_pred)
plot_pred(y, y_pred, plot_title=f'Without Promo Trend - MAE: {msle:.2f}', targ_ax=axes[0]);

y_pred = get_LR_pred(promotrend, y) 
msle = mean_absolute_error(y, y_pred)
plot_pred(y, y_pred, plot_title=f'With Promotion Trend - MAE: {msle:.2f}', targ_ax=axes[1]);

y_pred = get_LR_pred(promoraw, y)
msle = mean_absolute_error(y, y_pred)
plot_pred(y, y_pred, plot_title=f'With Promotion Raw - MAE: {msle:.2f}', targ_ax=axes[2]);

## Error Analysis
The previous models were trained on the average sales over all customers and all product families, in order to identify an all-over seasonal pattern for the store sales. In this section the models will predict the sales for each store and product family combination. Then the error is used to search for hints on how to improve allover performance.

To compare the error of product families the MAE is not sufficient. Families with high sales will tend to have higher MAE than families that only sell a few items. Therefore the MAE is divided by the average sales per store and product family combination.

Here is the plot of a sample store product family combinations actual sales and predcites sales, followed by the error metric for this store.

In [None]:
y = store_sales.unstack(['store_nbr', 'family']).loc["2017"]
y_pred = get_LR_pred(promoraw, y) 
abse = (y - y_pred).abs() / y.mean().replace(0,1)

fig, axes =  plt.subplots(2,1, figsize=(11, 8))

plot_pred(y, y_pred, index=('sales', '1', 'PRODUCE'), targ_ax=axes[0],
          plot_title="Sales for Store 1 and PRODUCE");

ax = plot_pred(abse, index=('sales', '1', 'PRODUCE'), targ_ax=axes[1],
               plot_title="Error Metric for Store 1 and PRODUCE")
ax.set_ylim((-0.1,1));



The following table shows statistics for the calculated error, sorted by the mean. First we see the top five with the highest error. And at the bottom the five with the lowest error. The latter are probably product families that are not sold in the given store.

In [None]:
abse = (y - y_pred).abs() / y.mean().replace(0,1)
abse_srt = abse.describe().T.sort_values(by='mean', ascending=False)
abse_srt

The following charts show the top five error combinations along with their actual sales and their predictions. We can see that we have some extrem cases here. These stores sell only 1 or 2 items within several months.

In [None]:
fig, axes = plt.subplots(5,1, figsize=(11,18))
i = 0
for idx in abse_srt.nlargest(5, 'mean').index.values:
    ax = plot_pred(y, y_pred, index=idx, targ_ax=axes[i], 
              plot_title=f'Sales for Store {idx[1]} and {idx[2]}')
    #axes[i].set_ylim([-0.1,0.2])
    i +=1


In the follwoing approach the error is summed up for every store, as shown in the following bar chart.

In [None]:
eps = abse_srt.reset_index().groupby('store_nbr')['mean'].sum()
ax = eps.sort_values().plot(kind='bar',alpha=0.5, ylabel="",
                            title="Error per Store")
plt.xticks(rotation=0);

In order to find out if the error is related to product families, the five product families with the largest error are extracted for every store. Both wordclouds show the product families that have been the hardest to predict. The red one shows the words of the stores with the largest error and the green one for the stores with the smallest error.

In [None]:
top5 = eps.nlargest(5).index.values
min5 = eps.nsmallest(5).index.values

tmp = abse_srt.reset_index()
wc = []
for store_nbr in top5:
    q = (tmp.store_nbr==store_nbr)
    wc = wc+ [f.replace(' ', '_') for f in tmp[q].nlargest(5, 'mean').family.values]

wordcloud = WordCloud(width=1400, height=400, #background_color='red',
                      random_state=1, colormap='Reds', 
                      collocations=False).generate(' '.join(wc))# Plot
plt.imshow(wordcloud);
plt.axis("off");


In [None]:
wc = []
for store_nbr in min5:
    q = (tmp.store_nbr==store_nbr)
    wc = wc+ [f.replace(' ', '_') for f in tmp[q].nlargest(5, 'mean').family.values]

wordcloud = WordCloud(width=1400, height=400, #background_color='red',
                      random_state=1, colormap='Greens', 
                      collocations=False).generate(' '.join(wc))# Plot
plt.imshow(wordcloud);
plt.axis("off");

 Both wordclouds have the following product families in commomn (in alphabetical order)
  - BABY CARE
  - BOOKS
  - HARDWARE
  - HOME APPLIANCES
  - SCHOOL AND OFFICE SUPPLIES
  
These seem to have a different seasonality patterns. Therefore the datastet was split into "Top5 Error Product Families" and "Other". And each got their own model.

 * Top 5 Error Product Families: [Fourier transforms](https://www.kaggle.com/ryanholbrook/seasonality#Fourier-Features-and-the-Periodogram) for a yearly pattern
 * Other: [Fourier transforms](https://www.kaggle.com/ryanholbrook/seasonality#Fourier-Features-and-the-Periodogram) for a monthly pattern

In [None]:
top5_error_families = ['BABY CARE', 'BOOKS', 'HARDWARE', 'HOME APPLIANCES',  'SCHOOL AND OFFICE SUPPLIES']
other = [f for f in tmp.family.unique() if f not in top5_error_families]
store_sales.loc[:,top5_error_families,:]

tef = (
    store_sales.loc[:,top5_error_families,:]
    .groupby('date').mean()
    .squeeze()
    .loc['2017']
)

oth = (
    store_sales.loc[:,other,:]
    .groupby('date').mean()
    .squeeze()
    .loc['2017']
)

#plot_periodogram(tef);

In [None]:
fourier = CalendarFourier(freq='Y', order=4)
dp = DeterministicProcess(
    index=tef.index,
    constant=True,
    order=1,
    # YOUR CODE HERE
    additional_terms=[fourier],
    seasonal=True,
    drop=True,
)
Xtef = dp.in_sample()
y_pred = get_LR_pred(Xtef, tef)  
plot_pred(tef, y_pred, 
          plot_title=f"Average Sales Top5 Error Product Families");

In [None]:
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=oth.index,
    constant=True,
    order=1,
    # YOUR CODE HERE
    additional_terms=[fourier],
    seasonal=True,
    drop=True,
)
Xoth = dp.in_sample()
promoraw = Xoth.join(average_promo, on='date').fillna(0.0)
y_pred = get_LR_pred(promoraw, oth)  
plot_pred(oth, y_pred, 
          plot_title=f"Average Sales other");