## Abstract

This notebook seeks to find a good solution to the problem of predicting sales at Walmart stores and departments based on data provided by Kaggle. In the document, extraction, cleaning and exploratory data analysis were performed, as well as testing different machine learning algorithms, choosing the one with the least absolute error in the cross validation test.

## Notes

1. Some features are more important to the constructed model than others. It's nice implement models sometimes deleting/adding features to verify if the model's performance improves.
2. It's interesting adding columns based on other columns. Sometimes infos related to some feature may be better represented in another way.
3. In this notebook, I have called train data as the union of train and validation data. So, where I referece train data, you can read train + validation.
4. I have chosen random Forest Regressor and Python as base alghoritms for model predictions. There arte a lot of other models that could be tested (Decision Tree, Bossting Alghoritms, Neural Networks, Logistic Regression, etc). If you have the time, it is better test some of them.
5. I have not used cross validation to measure Prophet performance, but I have used to mesaure Random Forest Regressor. Although it is almost always recommend to use cross validation to reduce overfitting, the technique is different to Time Series because dates from train set have to be lower than dates from validation set, it is not possible to randomly select a folder from data. Due to the fact that it was provided only two years of data and I have identified yearly seasonality, it become impratctible apply cross validation for the data. To compare WMAE from Prophet and Random Forest, I have used mean WMAE from cross validation in Random Forest.

## Importing Libraries

In [None]:
#Importing general libraries
import numpy as np
import pylab as plt
import pandas as pd
from scipy.stats.stats import pearsonr
from matplotlib import pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns
from scipy import stats
from string import ascii_letters
import math
import random
from sklearn.ensemble import RandomForestRegressor


%matplotlib inline

## Loading Data

In [None]:
#Loading data from local csv files


df_stores = pd.read_csv('../input/data-files/stores.csv')
df_features = pd.read_csv('../input/data-files/features.csv')
df_train = pd.read_csv('../input/data-files/train.csv')
df_test = pd.read_csv('../input/data-files/test.csv')

### Merging Data Into a Single Dataframe

In [None]:
#Function to remove columns not desired and rename desired columns - It was created to modify dataframes resulted from left join operation between dataframes
def standadize_columns(df):
    df_aux = df;
    df_aux.columns = [x.replace('_left', '').replace('_x', '') if x.find('_left') != -1 or x.find('_x') != -1 else x for x in df_aux.columns]
    for x in df_aux.columns:
        if x.find('_right') != -1 or x.find('_y') != -1:
            del df[x]
    df_aux = df_aux.drop_duplicates()
    return df_aux;

def deduplicate_array(arr_elements):
    elements = []
    for x in arr_elements:
        if x not in elements:
            elements.append(x)
    return elements

#Comcatenate Train and test Data to obtain a more complete dataframe
df_train['Data_Type'] = 'Train'
df_test['Data_Type'] = 'Test'
df_train_test = pd.concat([df_train, df_test])

#Converting 'Date' column, for each dataframe, to datetime object
df_features['Date'] = pd.to_datetime(df_features['Date'])
df_train_test['Date'] = pd.to_datetime(df_train_test['Date'])
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_test['Date'] = pd.to_datetime(df_test['Date'])
df_features['Date'] = pd.to_datetime(df_features['Date'])

#Merging dataframe with features data
df_blended_data = df_train_test.merge(df_features, how='left', left_on=["Date", "Store"], right_on=["Date","Store"])
df_blended_data = standadize_columns(df_blended_data)

#Merging dataframe with stores data
df_blended_data = df_blended_data.join(df_stores.set_index('Store'), on = 'Store', how='left', 
                                          lsuffix='_left', rsuffix='_right')

df_blended_data = standadize_columns(df_blended_data)

### Adding New Columns

In [None]:
#Adding columns that shows week, month and year of row date
df_blended_data['Year_Month'] = df_blended_data['Date'].dt.to_period('M')
df_blended_data['Year'] = pd.DatetimeIndex(df_blended_data['Date']).year
df_blended_data['Month'] = pd.DatetimeIndex(df_blended_data['Date']).month
df_blended_data['Week'] = pd.DatetimeIndex(df_blended_data['Date']).week
df_blended_data['Day'] = pd.DatetimeIndex(df_blended_data['Date']).day
df_blended_data['Year_Week'] = df_blended_data['Year'] + df_blended_data['Week']

### Classify columns according their type

In [None]:
categorical_features = ['Data_Type', 'Type', 'Year_Month', 'Year_Week', 'Year', 'Month', 'Week', 'Store', 'Dept']
numerical_features = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 
                      'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Size'
                     ]

#Forcing Categorical Columns to be String
for feature in categorical_features:
    df_blended_data[feature] = df_blended_data[feature].astype('str')

## Inspecting Data

In [None]:
#Printing first lines of dataframe
display(df_blended_data.head())

#Printing Features general info
df_description = df_blended_data.groupby('Data_Type').describe()
df_T = df_description.T

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_T)

#Printing column variables type and null count
print(df_blended_data.info())

As it is shown by data above, markdown columns have a lot of null values. For column 'MarkDown2', more than 60% of values are null. If these columns are not highly correlated with the amount of sales, it is recommended to eliminate them.

In [None]:
#Obtain holidays
df_blended_data[df_blended_data['IsHoliday'] == True]['Date'].unique()

In [None]:
#Verify Days With More Weekly Sales: This is good because we can inspect these days closely
weekly_sales = df_blended_data.groupby(['Date'])['Weekly_Sales'].agg('sum').reset_index().copy()
weekly_sales = weekly_sales.sort_values(by=['Weekly_Sales'], ascending = False)
weekly_sales.head(10)

As we can see, the dates with more sales are weeks with holidays, in Christmas case, weeks that preced holidays. This is something interesting to observe because we can model as christmas holiday the weeks right before Christmas and not Christmas week itself. This algo give us an insight that people may get their supermarket products, for Christmas specifically, in advance.

### Adding More Columns

Now it is time to break holidays by type and consider weeks right before Christmas as holiday

In [None]:
#Inserting Holidat Type Column
df_blended_data['Holiday_Type'] = ''
df_blended_data.loc[(df_blended_data['IsHoliday'] == True) & (df_blended_data['Month'] == 2), 'Holiday_Type'] = 'Super Bowl'
df_blended_data.loc[(df_blended_data['IsHoliday'] == True) & (df_blended_data['Month'] == 9), 'Holiday_Type'] = 'Labor Day'
df_blended_data.loc[(df_blended_data['IsHoliday'] == True) & (df_blended_data['Month'] == 11), 'Holiday_Type'] = 'Thanksgiving'
df_blended_data.loc[(df_blended_data['Date'] == pd.to_datetime('2010-12-24')), 'Holiday_Type'] = 'Christmas W-1'
df_blended_data.loc[(df_blended_data['Date'] == pd.to_datetime('2011-12-23')), 'Holiday_Type'] = 'Christmas W-1'
df_blended_data.loc[(df_blended_data['Date'] == pd.to_datetime('2010-12-17')), 'Holiday_Type'] = 'Christmas W-2'
df_blended_data.loc[(df_blended_data['Date'] == pd.to_datetime('2011-12-16')), 'Holiday_Type'] = 'Christmas W-2'
df_blended_data.loc[(df_blended_data['Date'] == pd.to_datetime('2010-12-17')), 'Holiday_Type'] = 'Christmas W-3'
df_blended_data.loc[(df_blended_data['Date'] == pd.to_datetime('2011-12-16')), 'Holiday_Type'] = 'Christmas W-3'   
df_holidays = df_blended_data[df_blended_data['IsHoliday'] == True][['Date', 'IsHoliday', 'Holiday_Type']].drop_duplicates()

#Including Holiday Type in Array Of Categorical Features
categorical_features.append('Holiday_Type')

### Creating new columns that give numerical values to categorical features

It is important to give numerical values to categorical columns because we can use them in regression models. In the code below, numerical values, for each categorical feature, were marked according to the order of average weekly sales for each possible value in the features. For example, if Type = 'C' returns bigger average sales then Type = 'A', then std_Type (numerical value to column Type) for Type = 'A' < std_type for Type = 'C'. This order was constructed due to the fact that improves the probability of best performance by regresison models.

In [None]:
# For each categorical variable, create a column that lists a number for each value of the categorical variable. 
# The related values are in increasing order of impact on the amount of sales 
df_aux = df_blended_data
for feature in categorical_features:
    df_avg_sales = df_blended_data.groupby(feature)['Weekly_Sales'].agg('mean').reset_index().drop_duplicates().dropna()
    # Order categorical values according to weekly sales
    df_avg_sales = df_avg_sales.sort_values(by=['Weekly_Sales']).reset_index()
    new_col_name = 'std_' + feature
    numerical_features.append(new_col_name)
    df_avg_sales[new_col_name] = df_avg_sales.index
    df_aux = df_aux.join(df_avg_sales.set_index(feature), on = feature, how='left', 
                                              lsuffix='_left', rsuffix='_right')
    
    df_aux = standadize_columns(df_aux)

df_blended_data = df_aux.copy()
numerical_features = deduplicate_array(numerical_features)

## Correlation Between Features

In [None]:
# Code imported from https://seaborn.pydata.org/examples/many_pairwise_correlations.html

sns.set(style="white")
corr = df_blended_data[numerical_features].corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(20, 16))
cmap = sns.diverging_palette(220, 15, as_cmap=True)
plt.title('Correlation Matrix', fontsize=18)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, 
            center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

corr_weekly_sales = corr['Weekly_Sales'].loc[~corr['Weekly_Sales'].index.isin(['Weekly_Sales'])]
corr_weekly_sales.sort_values().tail(15)

fig = plt.figure(figsize=(16,8))
plt.bar(corr_weekly_sales.index, corr_weekly_sales.values)
plt.xticks(rotation=90)
plt.grid()

plt.show()

There are features correlated (|correlation| > 0.7) and therefore it is not interesting using them together in the models. For each pair of correlated features, it is recommended to use those that have biggest correlation with weekly sales and have bigger importances in model prediction. Something that is valid to observe is that information associated with Department seems to have high importance in model prediction.

## Categorical Variables Distribution

Observing categorical features distribution is nice to understand data distribution. For example, if, for a specific categorical feature, a certain value occurs way more than other, it is good to know because it can bias the model.

In [None]:
fig = plt.figure(figsize=(32,16))
for feat in ['Data_Type', 'Type', 'Store', 'Dept']:
    sns.catplot(x= feat, kind = "count", palette="ch:.25", data=df_blended_data)
    plt.xticks(rotation=90)

## Weekly Sales Distribution For Each Numerical feature

Below, i have plotted boxplot, breaking in 'Type', 'Store', 'Dept' and 'IsHoliday' for each numerical value. The graphs give us important infomation not only about data assimetry as give us too information about how each of the for categorical features are related to the numerical features. 

In [None]:
def order_cat_variables(cat_var, num_var):
    cat_sorted_average = df_blended_data.groupby([cat_var])[num_var].agg('mean').reset_index().sort_values(by=num_var)[cat_var]
    return cat_sorted_average.unique()

def print_pretty_subplot(df):
    qty_variables = len(numerical_features);
    fig, axes = plt.subplots(qty_variables, 2, figsize=(60, 300));
    i = 1;
    for feature in numerical_features:
        try:
            plt.subplot(2*qty_variables, 2, i)
            order_type = order_cat_variables('Type', feature)
            sns.boxplot(x='Type', y=feature, data=df_blended_data, showfliers=False, order = order_type)
            plt.subplot(2*qty_variables, 2, i + 1)
            order_store = order_cat_variables('Store', feature)
            sns.boxplot(x='Store', y=feature, data=df_blended_data, showfliers=False, order = order_store)
            plt.subplot(2*qty_variables, 2, i + 2)
            order_dept = order_cat_variables('Dept', feature)
            sns.boxplot(x='Dept', y=feature, data=df_blended_data, showfliers=False, order = order_dept)
            plt.subplot(2*qty_variables, 2, i + 3)
            order_holiday = order_cat_variables('IsHoliday', feature)
            sns.boxplot(x='IsHoliday', y=feature, data=df_blended_data, showfliers=False, order = order_holiday)
        except Exception as e:
            print("Error: ", str(e), " i = ", i)
        i = i + 4

numerical_features = deduplicate_array(numerical_features)
print_pretty_subplot(df_blended_data)

#### Graphs on the left side:

1. As shown in Boxplot 1, on average, Type C has presented more weekly sales. When comparing Types A and B, it is possible to say that they both have similar interquantile range, which means that 75% of their weekly sales are up to 16000.

2. Boxplot 2 shows that the departments with the largest weekly sales are 38, 95 and 92. It is worth to mention that department 65 has more consistent weekly sales, as its quantiles are close to its average.

3. Boxplot 3 shows that Type C has higher average and quantiles for Temperature. When comparing departments, it is noticeable on Boxplot 4 that only departments 77, 43 and 39 behave differently than other departments, with lower average and quantiles.

4. There isn't much difference between Types when compared with fuel price, as shown on Boxplot 5. When comparing departments, however, Boxplot 6 shows that some departments present lower averages: 51, 78, 45 and 65.

5. Boxplot 7 shows that Type A presents highest average and quantiles, while Type C presents significantly lower values, when compared with Markdown 1. Boxplot 8 compares departments with Markdown 1, and shows that departments with higher average are 39 and 78. Most departments have the similar average values, however, departments 65 and 43 have more consistent values, as their quantiles are closer to the average than other departments.

6. Boxplot 9 shows that all types have similar average but Type A presents more dispersed values. As for departments (Boxplot 10), 43 has the highest average and quantiles, while 51 has lower values.

7. Boxplot 11 shows that Type A has higher values, while Boxplot 12 shows that the only department that stood out was 77.

8. Boxplot 13 presents similar information as Boxplot 11. When comparing departments (Boxplot 14), the ones that stand out are 65, 51, with more consistent values, and 43, with higher average and quantiles.

9. Boxplot 15 presents similar information as Boxplot 13, where Type A has higher average and quantiles. When comparing the departments on Boxplot 16, the ones that stand out are 77, with the longest range, and 43, with lower values.

10. Boxplot 18 shows that Type A has the highest average, while Boxplot 19 shows thar departments 65, 43 and 39 have the lowest values.

11. Boxplot 20 doesn't show much difference between types. However, Boxplot 21 shows that department 65 has the highest average and quantiles, and 39 has the most consistent values.

12. Boxplot 22 shows that Type A has the highest values, while Type C has the lowest. When comparing departments (Boxplot 23), the ones that stand out are 65, 50, 99, 37 and 39.

#### Graphs on the right side

1. Boxplot 1 shows that Stores 5, 33, 44, 3 and 38 have the lowest weekly sales, while Stores 2, 13, 14, 4 and 20 have the highest weekly sales.

2. Boxplot 2 does not show difference in weekly sales, rather it is a holiday or not.

3. Boxplot 3 shows that Stores 7, 26, 16 have the lowest values, while 10, 42 and 33 have the highest values.

4. Boxplot 4 shows that values are high when it is a holiday.

5. Boxplots 5 compares stores weekly sales and fuel price, and shows that store 19 has the highest values.

6. Boxplot 7 shows that when compraring with Markdown 1, stores 37, 43, 36, 33, 42, 38, 44 and 30 have the lowest weekly sales, while 27 and 19 have the largest weekly sales on average. Boxplot 8 does not show much difference rather it is a holiday or not.

7. Boxplot 9 shows that when compraring with Markdown 2, stores 33, 37, 44, 38, 30 and 36 have the lowest weekly sales, while 27 and 13 have the longest range of weekly sales. Boxplot 10 shows that when it is a holiday, the range is longer.

8. Boxplot 11 shows that when compraring with Markdown 3, stores 36, 33, 30, 44, 43, 42 and 36 have the lowest weekly sales, while 27 and 13 have the longest range of weekly sales. Boxplot 12 shows that when it is a holiday, the range is longer.

9. Boxplot 13 shows that when compraring with Markdown 4, stores 36, 33, 30, 44, 43, 42 and 36 have the lowest weekly sales, while 12 and 4 have the longest range of weekly sales, and highest average weekly sales. Boxplot 14 shows that when it is a holiday, the range is slightly longer, although lower average.

10. Boxplot 15 shows that when compraring with Markdown 5, store39 stands out with  the longest range of weekly sales, and highest average weekly sales. Boxplot 16 does not show difference rather it is a holiday or not.

11. Boxplot 17 shows that CPI impacts directly on weekly sales of every store, however, Boxplot 18 shows that there is no difference rather it is a holiday or not.

12. Boxplot 18 shows that unemployment also impacts directly on weekly sales of every store, however, Boxplot 19 shows that there is no difference rather it is a holiday or not.


## Normalize data

Normalize data is important because improves model performance. Data not normally distributed tend to generate biggest errors in predictions

In [None]:
from sklearn import preprocessing

num_x = []
for feat in numerical_features:
    if feat not in num_x and feat != 'Weekly_Sales':
        num_x.append(feat)
        
scaler = preprocessing.StandardScaler().fit(df_blended_data[num_x])

df_blended_data = df_blended_data.sort_values(by=['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Data_Type',
       'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
       'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Type', 'Size',
       'Year_Month', 'Year', 'Month', 'Week', 'Day', 'Year_Week',
       'Holiday_Type', 'index', 'std_Data_Type', 'std_Type', 'std_Year_Month',
       'std_Year_Week', 'std_Year', 'std_Month', 'std_Week', 'std_Store',
       'std_Dept', 'std_Holiday_Type'])

#creating dataframe that will be used to recover original data at the end of notebook
df_original_data = df_blended_data.copy().sort_values(by=['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Data_Type',
       'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
       'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Type', 'Size',
       'Year_Month', 'Year', 'Month', 'Week', 'Day', 'Year_Week',
       'Holiday_Type', 'index', 'std_Data_Type', 'std_Type', 'std_Year_Month',
       'std_Year_Week', 'std_Year', 'std_Month', 'std_Week', 'std_Store',
       'std_Dept', 'std_Holiday_Type'])

df_changed_data = df_blended_data.copy().sort_values(by=['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Data_Type',
       'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3',
       'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment', 'Type', 'Size',
       'Year_Month', 'Year', 'Month', 'Week', 'Day', 'Year_Week',
       'Holiday_Type', 'index', 'std_Data_Type', 'std_Type', 'std_Year_Month',
       'std_Year_Week', 'std_Year', 'std_Month', 'std_Week', 'std_Store',
       'std_Dept', 'std_Holiday_Type'])

df_blended_data[num_x] = scaler.transform(df_blended_data[num_x])
df_changed_data[num_x] = scaler.transform(df_changed_data[num_x])

df_changed_data.columns = ['new_' + str(col) for col in df_changed_data.columns]

df_mix_data = pd.concat([df_original_data, df_changed_data], axis=1)

### Spliting Data and Removing MarkDown Columns

In [None]:
# The code below split data between train (train + validation) and test

markdown_columns = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']

#Drop MarkDown Columns
train_data = df_blended_data[df_blended_data['Data_Type'] == 'Train'][numerical_features].drop(markdown_columns, axis = 1).to_numpy()
train_columns = df_blended_data[df_blended_data['Data_Type'] == 'Train'][numerical_features].drop(markdown_columns, axis = 1).columns
train_columns = list(train_columns)

train_columns.remove('Weekly_Sales')

nrow, ncol = train_data.shape
y = train_data[:,-0]
X = train_data[:,1:ncol]

## Features Importances

Calculating features importance is recommended because give us an idea of which variables are important to the model and which are not.

In [None]:
#Codes imported from https://machinelearningmastery.com/calculate-feature-importance-with-python/ and https://jakevdp.github.io/PythonDataScienceHandbook/02.08-sorting.html

import math
from statsmodels.tools.eval_measures import rmse
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

p = 0.25 # fracao de elementos no conjunto de teste
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = p, random_state = 42)

rf = RandomForestRegressor()

np.random.seed(500)

#Commented code below was used to obtain best hyperparaeters of random forest
'''
# Number of trees in random forest
n_estimators = [20, 100, 500]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [10, 30, 50]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2, 5]
# Method of selecting samples for training each tree
bootstrap = [False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
'''

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, min_samples_split=5,min_samples_leaf= 2,
                           max_features=0.5,max_depth=50,bootstrap=False)


rf.fit(x_train, y_train)
predictions = rf.predict(x_test)

def selection_sort(x):
    for i in range(len(x)):
        swap = i + np.argmax(x[i:])
        (x[i], x[swap]) = (x[swap], x[i])
    return x

feats = {}
feats_cum = {}
cum_importance = 0
for feature, importance in zip(train_columns, rf.feature_importances_):
    feats[feature] = importance

feats = dict(sorted(feats.items(), key=lambda item: item[1], reverse = True))
for k, v in feats.items():
    cum_importance = cum_importance + v
    feats_cum[k] = cum_importance

plt.bar([x[0] for x in feats.items()], [x[1] for x in feats.items()], label = 'Feature Importance')
plt.plot([x[0] for x in feats_cum.items()], [x[1] for x in feats_cum.items()], label = 'Cum. Feature Importance', color = 'red')
plt.xticks(rotation=90)
plt.legend(loc='lower right')
plt.show()

## Time Series Behavior

In [None]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf 
%matplotlib inline

weekly_sales = df_blended_data[df_blended_data['Data_Type'] == 'Train'].groupby('Date')['Weekly_Sales'].agg('sum').reset_index().drop_duplicates().sort_values(by=['Date'])

tserie = df_blended_data[df_blended_data['Data_Type'] == 'Train'].groupby('Date')['Weekly_Sales'].agg('sum').reset_index().drop_duplicates()
tserie.columns = ['ds', 'y'];
tserie['ds'] = pd.to_datetime(tserie['ds'])
tserie = tserie[['ds', 'y']]
tserie = tserie[tserie['ds'] < pd.to_datetime('2020-09-01')]

weekly_sales.set_index('Date', inplace = True)
weekly_sales.plot(label = 'Sales', legend = True)

plot_acf(tserie.set_index('ds'))

Autocorrelation graph above show us that lags 1 and 2 are the ones that indeed influence values in a given week. This means, therefore, that, for a given week, weekly sales from W-1 and W-2 probably have impact in current week sales.
Time Series plot suggest us that we have periods with high seasonal effect (ThanksGiving and Christmas Holiday) and it is important to consider this in the model.

## Time Serie Components

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import pyplot

weekly_sales_2 = df_blended_data[(df_blended_data['Data_Type'] == 'Train')].groupby('Date')['Weekly_Sales'].agg('sum').reset_index().drop_duplicates().sort_values(by=['Date'])[['Weekly_Sales']]

result = seasonal_decompose(weekly_sales_2, model='additive', freq = 52)
fig = plt.figure(figsize=(20,20))
result.plot()
plt.xticks(rotation=90)

pyplot.show()

Time Series components validates our hypothesis from before: we have indeed have periods with high seasonal effect (ThanksGiving and Christmas Holiday). Weekly Sales trend is ascending and show us that we can expect more weekly sales in future

## Prophet

In [None]:
from fbprophet import Prophet
m = Prophet();
m.add_seasonality(name='yearly', period=365, fourier_order=5);

regressors = ['std_Dept', 'std_Store', 'std_Type', 'CPI', 'Unemployment', 'std_Week', 'std_Holiday_Type', 'Size']
for regressor in regressors:
    m.add_regressor(name = regressor, mode = 'multiplicative', standardize = False)

interest_columns = regressors
interest_columns.append('Date')
features_group = interest_columns
features_group.append('Data_Type')

tserie = df_blended_data.sort_values(by=['Date'])
tserie = tserie.rename(columns = {'Date': 'ds', 'Weekly_Sales': 'y'}, inplace = False)

tserie = df_blended_data.groupby(features_group)['Weekly_Sales'].agg('sum').reset_index().drop_duplicates()
tserie = tserie.rename(columns = {'Date': 'ds', 'Weekly_Sales': 'y'}, inplace = False)
tserie['ds'] = pd.to_datetime(tserie['ds'])

treino = tserie[tserie['Data_Type'] == 'Train']

teste = tserie[tserie['Data_Type'] == 'Test']

m.fit(treino)

periods_test = len(teste.ds.unique())

future = m.make_future_dataframe(periods=periods_test, freq = '7d')

future = future.merge(teste, how='inner', on = ['ds'])
future = future.sort_values(by=['ds'])
fcst = m.predict(future);

past_data = treino.groupby('ds')['y'].agg('sum')
fcst_day = fcst.groupby('ds')['yhat'].agg('sum')
past_data = past_data.reset_index()
fcst_day = fcst_day.reset_index()

plt.plot(past_data.ds, past_data.y, label = "Real")
plt.plot(fcst_day.ds, fcst_day.yhat, label = "Fitted Values")
plt.xticks(rotation=90)
plt.show()


Prophet Algo is not predicting well as we can we by grapgh above. I have to investigate why this happened once it were expected satisfatory results from this alghoritm.

## Random Forest Regressor

In [None]:
predictions = rf.predict(x_test)
print(y_test, ' ', type(y_test), ' ', len(y_test))
print(predictions, ' ', type(predictions), ' ', len(predictions))
print(fcst, ' ', type(fcst), ' ', len(fcst))

## Calculating Error and Selecting Best Model

Due to the fact that this Kaggle problem seems to use WMAE as metric to evaluate model performance, this notebook will show how to calculate WMAE, but I am going to use use RMSE as performance measure.

In [None]:
#Function that calculates WMAE between predicted values and test values
def WMAE(y, y_pred, is_holiday):
    #Give weight 5 for holidays
    hol_multiply = [1 if x == 0 else 5 for x in is_holiday]
    wae = np_absolute(np_subtract(y_pred, y))
    wae = np.multiply(wae, hol_multiply)
    return wae

In [None]:
from sklearn.model_selection import cross_val_score

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, min_samples_split=5,min_samples_leaf= 2,
                           max_features=0.5,max_depth=50,bootstrap=False)

used_features = ['Weekly_Sales', 'std_Dept', 'std_Store', 'std_Type', 'CPI', 'Unemployment', 
                 'Year', 'std_Week', 'std_Holiday_Type', 'Size']

#Droping rows with null values. This is not the right way to do. It is better fill the null values with statisticlal mesaures (mean, median, etc)
#of other rows with similars characteristics or take external variables

df_blended_data['CPI'] = df_blended_data['CPI'].fillna(0)
df_blended_data['Unemployment'] = df_blended_data['Unemployment'].fillna(0)
train_data = df_blended_data[df_blended_data['Data_Type'] == 'Train'][used_features].sort_values(by=used_features).dropna().to_numpy()
train_columns = df_blended_data[df_blended_data['Data_Type'] == 'Train'][used_features].columns
train_columns = list(train_columns)
train_columns.remove('Weekly_Sales')

nrow, ncol = train_data.shape
y = train_data[:,-0]
X = train_data[:,1:ncol]

p = 0.25 # elemets fraction to be used on test
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = p, random_state = 42)
    
def evaluate_models(dict_model):
    best_rmse = 999999999
    best_model = 'Prophet'
    for k, v in dict_model.items():
        if k == 'Prophet':
            pass
        else:
            v.fit(x_train, y_train)
            rmse = cross_val_score(v, x_train, y_train, scoring="neg_root_mean_squared_error", cv=5)
            mean_rmse = np.mean(rmse)
            if mean_rmse < best_rmse:
                best_rmse = mean_rmse
                best_model = {k: v}
    
    return ('Best Model: ', best_model, '\nRMSE: ', best_rmse)

dict_model = {'RF': rf, 'Prophet': m}
print(evaluate_models(dict_model))

used_features = ['std_Dept', 'std_Store', 'std_Type', 'CPI', 'Unemployment', 
                 'Year', 'std_Week', 'std_Holiday_Type', 'Size']

df_test_data = df_blended_data[df_blended_data['Data_Type'] == 'Test'][used_features].sort_values(by=used_features).dropna()
test_data = df_test_data.to_numpy()
test_columns = df_blended_data[df_blended_data['Data_Type'] == 'Test'][used_features].columns
test_columns = list(test_columns)

nrow, ncol = test_data.shape
test = test_data[:,0:ncol]
x_test = test[:,0:ncol]
y_pred = rf.predict(x_test)

df_test_data['y'] = y_pred.tolist()

#Columns to be used on merge
interest_columns = ['std_Store', 'std_Dept', 'Date', 'new_std_Store', 'new_std_Dept',
                    'std_Week', 'new_std_Week',
                    'Store', 'Dept', 'Year', 'Week']

df_mix = df_mix_data[df_mix_data['Data_Type'] == 'Test'].groupby(interest_columns)['Weekly_Sales'].agg('count').reset_index()
df_test_data = df_test_data.rename(columns = {'std_Store': 'new_std_Store', 
                                              'std_Dept': 'new_std_Dept',
                                              'std_Week': 'new_std_Week'
                                             }, 
                                   inplace = False)

df_test_data['Year'] = df_test_data['Year'].astype(str).astype(int)
df_mix['Year'] = df_mix['Year'].astype(str).astype(int)

df_test_data.new_std_Store = df_test_data.new_std_Store.round(2)
df_test_data.new_std_Dept = df_test_data.new_std_Dept.round(2)
df_test_data.new_std_Week = df_test_data.new_std_Week.round(2)

df_mix.new_std_Store = df_mix.new_std_Store.round(2)
df_mix.new_std_Dept = df_mix.new_std_Dept.round(2)
df_mix.new_std_Week = df_mix.new_std_Week.round(2)

df_test_final = df_mix.merge(df_test_data, how='left', 
                             left_on=["new_std_Store", "new_std_Dept", "Year", "new_std_Week"], 
                             right_on=["new_std_Store", "new_std_Dept", "Year", "new_std_Week"])

#Replace NaN values by zero
df_test_final['y'] = df_test_final['y'].fillna(0)
df_test_final = df_test_final[['Store', 'Dept', 'Date', 'y']]

df_test_final['Date'] = pd.to_datetime(df_test_final['Date'])
df_test_final['Store'] = df_test_final['Store'].astype(str).astype(int)
df_test_final['Dept'] = df_test_final['Dept'].astype(str).astype(int)

df_test['Date'] = pd.to_datetime(df_test['Date'])

df_test_final = df_test.merge(df_test_final, how='left', 
                             left_on=["Store", "Dept", "Date"], 
                             right_on=["Store", "Dept", "Date"])


df_test_final = df_test_final.rename(columns = {'y': 'Weekly_Sales'}, inplace = False)
df_test_final['Id'] = df_test_final['Store'].astype(str) + '_' + df_test_final['Dept'].astype(str) + '_' + df_test_final['Date'].astype(str)
df_test_final = df_test_final[['Id', 'Weekly_Sales']]
df_test_final = df_test_final.sort_values(by=['Id'])

df_test_final.to_csv('submission.csv', index=False)

## Recommended Improvements
- Hypertunning of parameters (there are mecanisms that help to do this: GridSearchCV, RandomSearchCV, etc)
- Test new alghoritms (XGBoosting, Neural Networks, SARIMAX, etc)
- Model holidays separately: holidays, especially Christmas and ThanksGiving, seems to have a big importance and that is not recognized by the models implemented in this notebook
- Delete rows in train set where Weekly_Sales < 0: this was not done because the objective is present good results in the output of kaggle exercise. In real life, we know that we can't have a number of sales lower than zero
- Delete outliers
- Investigate why Prophet has returned future predictions like this
- Fill NaN values from features with high importance (CPI, etc) with statisticlal mesaures (mean, median, etc) of other rows with similars characteristics. It is not recommended drop these rows because it its predicted zero sales for rows in this situation