# Claudio Santos

This notebook is part of the hiring process on Ze Delivery.

If you need some help, please consider asking in the comments or try to contact me.

[Github](http://github.com/cfsantos) | [Linkedin](https://www.linkedin.com/in/cfsantos85/)

In this challenge, Wallmart desires to predict sales from stores according to some features, like size, department and so on.

In this notebook , I will:

### Analyse the data

Check what features are relevant or not for a good prediction.

### Evaluate models

Using different machine learning models I will choose the best according to the evaluation I will demonstrate.

### Conclusions

A brief conclusion of this work.

In [None]:
import os
import numpy as np 
import pandas as pd 

# For data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
%matplotlib inline

# For data modeling & prediction
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import BayesianRidge, LinearRegression
import xgboost as xgb
from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import cross_validate, train_test_split
from statistics import mean
from sklearn.model_selection import KFold

import warnings
warnings.filterwarnings("ignore") 

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's load and take a little look at the data

In [None]:
df_train = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/train.csv.zip", compression='zip')
df_features = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/features.csv.zip", compression='zip')
df_test = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/test.csv.zip", compression='zip')
df_stores = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/stores.csv")

In [None]:
df_train.tail(3)

In [None]:
df_test.tail(3)

In [None]:
df_stores.tail(3)

In [None]:
df_features.head(3)

It appears "Train" and "Test" dataframes connect to "Features" by using Store, Date and IsHoliday columns. "Stores" dataframe connects to "Train" and "Test" by the Store column. Merging all content to make it easier to deal.

In [None]:
df_train = pd.merge(df_train,df_features, on = ['Store','Date','IsHoliday'],how='inner')
df_train = pd.merge(df_train,df_stores, on= 'Store',how='inner')
df_test = pd.merge(df_test,df_features, on = ['Store','Date','IsHoliday'],how='inner')
df_test = pd.merge(df_test,df_stores, on= 'Store',how='inner')

As it is possible to see, there are some missing (NaN) values, at least in the "Features" dataframe. Let's take a look at all data frames to search more missing information.

In [None]:
"train", df_train.isna().mean(),"---------------------", 

Only "features" has missing values, As described in the challenge, Markdown-1/5 are only available from a given date. Over 50% of this information is missing and around 7% of CPI and Unemployment is missing too.

First, I will take a look at the importance of each feature. If necessary, I will deal with the missing information in a future step.

# Pre-processing

Something bothering me is that this work is evaluated by week but Date column gives the information about some days, apparently every 7 days. I understand this analysis should be done using weeks instead of days, so I will change some information to make more sense of the data.

In [None]:
def convert_dates(dataframe):
    dataframe['Date'] = pd.to_datetime(dataframe['Date'])
    dataframe['year'] = dataframe['Date'].dt.year
    dataframe['week'] = dataframe.Date.dt.week 
    
    return dataframe

df_test = convert_dates(df_test)
df_train = convert_dates(df_train)

Most of the columns are numerical, however, some of them (Store, Dept, IsHoliday, Type, year and week) I rather work as category.

In [None]:
to_categorical = ['Store', 'Dept', 'IsHoliday', 'Type', 'year', 'week']
for column in to_categorical:
    df_train[column] = df_train[column].astype('category')
    
df_train.dtypes

Looks better. Let's see the correlation. 

In [None]:
df_train[['Dept', 'Store', 'week', 'year', 'IsHoliday']] = df_train[['Dept', 'Store', 'week', 'year', 'IsHoliday']].astype('int')

plt.figure(figsize=(18,12))
corr = df_train.corr()
np.fill_diagonal(corr.values, np.nan)

sns.heatmap(corr, annot=True, fmt='.2f')

MarkDowns are not strongly correlated to Weekly_Sales. I am highlighting these columns because there is a lot of missing data. It may be a good idea to forget these columns.
Let's check each variable. Starting with the discrete ones. I will use boxplot to check each quartil difference.

In [None]:
def boxplot(column, x_size=15, y_size=10):
    fig = plt.figure(figsize=(x_size,y_size))
    sns.boxplot(y=df_train.Weekly_Sales, x=df_train[column])
    plt.ylabel('Weekly_Sales')
    plt.xlabel(column)


In [None]:
boxplot('Store')

There are some points to highlight here:

- All stores concentrate most of the selling in the third quartile, with several outliers.
- Apparently store 28 has some values above 0. It is not clear for other stores but apparently it is possible to have negative sellings. It would be the case to ask for the store manager if the information is correct before continues, at least for these negative values. As it is not possible to do it here, I will assume these values are correct.

In [None]:
boxplot('Dept', x_size=25)

Points to highligth:

- Again, some Depts apparently have negative values (more visible in Depts 6, 45 and 47).
- Depts 7 and 72 have lots of outliers too.

In [None]:
boxplot('week')

As described in the challenge, holidays sellings are different. It is clear that weeks closer to them (5, 47 and 51/52) sell more than other weeks.

In [None]:
boxplot('year')

Not a lot of difference in the sellings here, however, this field is important because, as described in the competion, training data is related from 2010-02-05 until 2012-11-01. It means that the lack of outliers in the year 2012 may be related to the lack of information about thanksgiving and christmas, which it has lots of outliers as seen in previous analysis.

In [None]:
boxplot('Type')

Type is not clearly described in the challenge description. Here it is possible to see that Type C sells less than types A and B. Type B has more outliers and Type A has the higher mean because of the size of the second and third quartiles. It looks to be an important feature.

In [None]:
boxplot('IsHoliday')

Holidays definetly influences the Weekly_Sales because of the quantity of outliers and higher mean.
Now checking the continue columns.

In [None]:
def correlation(column):
    print("----------------------------Column name: "+column+"----------------------------")
    print("Correlation: " + str(df_train['Weekly_Sales'].corr(df_train[column])))
    print("\n")

In [None]:
correlation("CPI")
correlation("Unemployment")
correlation("Temperature")
correlation("Size")
correlation("Fuel_Price")
correlation("Unemployment")
correlation("MarkDown1")
correlation("MarkDown2")
correlation("MarkDown3")
correlation("MarkDown4")
correlation("MarkDown5")

In each column:

CPI: low correlation, looks not important

Temperature: low correlation, looks not important

Umemployment: low correlation, apparently no importance

Temperature: low correlation, apparently no importance

Size: there is some correlation, maybe with some importance

Fuel_Price: low correlation, looks not important

MarkDown-1/5: low correlation in all cases, apparently no importance

Something I decided to check: according to the description, each store has one size, is it correct? The following code prints it


In [None]:
df_train.groupby('Store')['Size'].nunique()

So it is correct, one size for each store. It may be the case of use only one of these two variables.

From this analysis, I will make two different predictions: one using all features and other just using the ones I considered relevant for this case (Store, Dept, IsHoliday, Size, year, week and Type).

In [None]:
df_train = pd.get_dummies(df_train, columns=['Type'])
df_train.columns

In [None]:
df_test = pd.get_dummies(df_test, columns=['Type'])
df_test.columns

In [None]:
df_train.columns

In [None]:
df_train.dtypes

## Machine Learning

I will start with a simple baseline: I will calculate the average week sales according to each week, dept and store. My score will be calculated using this mean historical value.

In [None]:
mean_sales = df_train.groupby(["Store", "Dept", "week"], as_index=False).agg({"Weekly_Sales": "mean"})
df_val = df_test.merge(mean_sales, on=['Store', 'Dept', 'week'], how='left')
sample_submission = pd.read_csv("/kaggle/input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip")

test_ids = df_test.Store.astype(str) + '_' + df_test.Dept.astype(str) + '_' + df_test.Date.astype(str)
sample_submission['Id'] = test_ids.values
sample_submission["Weekly_Sales"] = df_val["Weekly_Sales"]

# apparently there are some missing values. I will fill the NaN values with 0 (I know I miss some score :( ).
sample_submission = sample_submission.fillna(0)
sample_submission.to_csv('submission_simple_mean.csv',index=False)

This submission gave me a 3337.81 on public score and 3460.96 on the private score. Considering I just calculated the mean value on the training set using 3 of the features, it is huge. If this challenge was on, I would be around the 300th position, which I consider very good.

### More training after first prediction

Let's see some machine learning in action. I will perform 3 different tests here: One using the same features used in the mean submission, another using the features I considered more important according to my analysis and a final one using all the features.

In this comparison I will consider the Mean Absolute Error (MAE) because it is very close to the metric used in this challenge. I will print the time used for training the model (in seconds) too, this could be important if time is short.

First, just using Store, Dept and Wee information

In [None]:
kfold = KFold(n_splits=5, random_state=35)

x = df_train.loc[:, df_train.columns != 'Weekly_Sales']
x = x.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Size', 'IsHoliday', 'Type_A', 'Type_B', 'Type_C', 'year', 'Date'], axis=1)

y = df_train.loc[:, df_train.columns == 'Weekly_Sales']

models = {
    
    'xgboost' : xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10),
    'Bayesian' : BayesianRidge(),
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=1),
    'AdaBoostRegressor' : AdaBoostRegressor(n_estimators=50, learning_rate=.1, loss='square'),
    'ExtraTreesRegressor': ExtraTreesRegressor(n_estimators=50, max_features='auto', random_state=35),
    'RandomForestRegressor': RandomForestRegressor(n_estimators=50, random_state=35),
}

for model_name, model in models.items():
    results = cross_validate(model, x,y , cv=kfold, scoring=['neg_mean_absolute_error'], return_estimator=False)
    print(model_name, mean(results['test_neg_mean_absolute_error']), mean(results['fit_time']), mean(results['score_time']))

Now using the information I considered more important.

In [None]:
x = df_train.loc[:, df_train.columns != 'Weekly_Sales']
x = x.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)

y = df_train.loc[:, df_train.columns == 'Weekly_Sales']

models = {
    
    'xgboost' : xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10),
    'Bayesian' : BayesianRidge(),
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=1),
    'AdaBoostRegressor' : AdaBoostRegressor(n_estimators=50, learning_rate=.1, loss='square'),
    'ExtraTreesRegressor': ExtraTreesRegressor(n_estimators=50, max_features='auto', random_state=35),
    'RandomForestRegressor': RandomForestRegressor(n_estimators=50, random_state=35),
}

res = {}

for model_name, model in models.items():
    results = cross_validate(model, x,y , cv=kfold, scoring=['neg_mean_absolute_error'], return_estimator=False)
    print(model_name, mean(results['test_neg_mean_absolute_error']), mean(results['fit_time']), mean(results['score_time']))

Last case: using all features

In [None]:
x = df_train.loc[:, df_train.columns != 'Weekly_Sales']
x = x.drop(['Date'], axis=1)
x = x.fillna(0)
y = df_train.loc[:, df_train.columns == 'Weekly_Sales']

models = {
    'xgboost' : xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10),
    'Bayesian' : BayesianRidge(),
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=1),
    'AdaBoostRegressor' : AdaBoostRegressor(n_estimators=50, learning_rate=.1, loss='square'),
    'ExtraTreesRegressor': ExtraTreesRegressor(n_estimators=50, max_features='auto', random_state=35),
    'RandomForestRegressor': RandomForestRegressor(n_estimators=50, random_state=35),
}

res = {}

for model_name, model in models.items():
    results = cross_validate(model, x,y , cv=kfold, scoring=['neg_mean_absolute_error'], return_estimator=False)
    print(model_name, mean(results['test_neg_mean_absolute_error']), mean(results['fit_time']), mean(results['score_time']))

So the best MAE in the kfold split was using Extra Tree Regressor using the features I considered more important in the analysis. 
I will train one and make a prediction.

In [None]:
x = df_train.loc[:, df_train.columns != 'Weekly_Sales']
x = x.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)

y = df_train.loc[:, df_train.columns == 'Weekly_Sales']

extratreeregressor = ExtraTreesRegressor(n_estimators=50, max_features='auto', random_state=35)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=35)
extratreeregressor.fit(X_train, y_train)
y_pred = extratreeregressor.predict(X_test)
mean_absolute_error(y_pred, y_test)

In [None]:
x_val = df_test.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)

y_val = extratreeregressor.predict(x_val)

In [None]:
test_ids = df_test.Store.astype(str) + '_' + df_test.Dept.astype(str) + '_' + df_test.Date.astype(str)
sample_submission['Id'] = test_ids.values
sample_submission["Weekly_Sales"] = y_val

sample_submission = sample_submission.fillna(0)
sample_submission.to_csv('submission_extratreeregressor.csv',index=False)

Got 3153.03549 on private score and 3046.07420 on public, which improves my position to closer to the 210th.
What about using PCA to generate some features, and then training using the Extra Tree Regressor?

In [None]:
from sklearn.decomposition import PCA

x = df_train.loc[:, df_train.columns != 'Weekly_Sales']
x = x.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)
x = x.fillna(0)
y = df_train.loc[:, df_train.columns == 'Weekly_Sales']

pca = PCA(n_components=5)
pca.fit(x)
pca_features = pca.transform(x)

columns = ['pca_%i' % i for i in range(5)]
x = pd.DataFrame(pca_features, columns=columns, index=x.index)

In [None]:
extratreeregressor_pca = ExtraTreesRegressor(n_estimators=50, max_features='auto', random_state=35)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=35)
extratreeregressor_pca.fit(X_train, y_train)
y_pred = extratreeregressor_pca.predict(X_test)
mean_absolute_error(y_pred, y_test)

In [None]:
x_val = df_test.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)
pca_features_val = pca.transform(x_val)
columns = ['pca_%i' % i for i in range(5)]

x_val = pd.DataFrame(pca_features_val, columns=columns, index=x_val.index)

y_val = extratreeregressor_pca.predict(x_val)

In [None]:
test_ids = df_test.Store.astype(str) + '_' + df_test.Dept.astype(str) + '_' + df_test.Date.astype(str)
sample_submission['Id'] = test_ids.values
sample_submission["Weekly_Sales"] = y_val

sample_submission = sample_submission.fillna(0)
sample_submission.to_csv('submission_extratreeregressor_pca.csv',index=False)

Another improvement, got 3030.99599 on the private score and 2883.80360 on the public score, placing me under the 180th position.
And if we combined the original features with the 5 PCA generated features? Could that work?

In [None]:
x = df_train.loc[:, df_train.columns != 'Weekly_Sales']
x = pd.concat([x, pd.DataFrame(pca_features)], axis=1)

x = x.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)
x = x.fillna(0)
y = df_train.loc[:, df_train.columns == 'Weekly_Sales']

In [None]:
extratreeregressor_pca_allfeatures = ExtraTreesRegressor(n_estimators=50, max_features='auto', random_state=35)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=35)
extratreeregressor_pca_allfeatures.fit(X_train, y_train)
y_pred = extratreeregressor_pca_allfeatures.predict(X_test)
mean_absolute_error(y_pred, y_test)

In [None]:
x_val = df_test.drop(['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4','MarkDown5', 'CPI', 
            'Unemployment', 'Date'], axis=1)
pca_features_val = pca.transform(x_val)
x_val = pd.concat([x_val, pd.DataFrame(pca_features_val)], axis=1)

y_val = extratreeregressor_pca_allfeatures.predict(x_val)

In [None]:
test_ids = df_test.Store.astype(str) + '_' + df_test.Dept.astype(str) + '_' + df_test.Date.astype(str)
sample_submission['Id'] = test_ids.values
sample_submission["Weekly_Sales"] = y_val

sample_submission = sample_submission.fillna(0)
sample_submission.to_csv('submission_extratreeregressor_pca_allfeatures.csv',index=False)

An improvement of 3008.95940 on the private score but a small decline on the public (2884.29135), still I consider as an improvement.

## Conclusion

In this exercise I tried to generate the best possible prediction according to the data provided by Wallmart.
I demonstrated how I selected the features I considered more important, according to correlation and other features.

In this specific case, using just simple math was enough to create a really good baseline (way better than the baseline provided by the challenge). It shows that, in some cases, it is possible to achieve a reasonably good result without using any machine learning technique, just understanding how each feature is relevant for a given problem.

However, this is not the case. By comparing results of 7 different machine learning supervised methods, it was possible to significantly improve results. When combined with PCA, the model could still improve a little.

### Further analysis

I am not sure if this challenge really reflects the real world, at least in the description. It would be nice to check other holidays or important selling dates like Easter, Mother's Day and Vallentine's Day, there might be some more hidden information which could help improve our results.

Another consideration to make is to isolate each store to create better models. Perhaps the data provided by just one store is not enough to generate a good predictor (because of the lack of information in this case) but maybe isolate stores in the same region (one predictor for each city for example) could be a good idea.

### Improving results?

I just used K-fold for deciding which model I would use to generate my final results. It is possible I would achieve better results if I used an ensemble of Extra Tree Regressors.

Other point here is to search for the best parameters for the chosen model. There may be other settings that could improve this results, both on the Extra Tree Regressor and PCA. 

One last thing to try could be the use of Time Series models, like recurrent neural networks (RNN). As it is clear in the description, an analysis over the time could really generate a good model.



Best Regards,
Claudio Santos