# **Introduction**

This is a Kaggle notebook with quantitative and qualitative analysis about Walmart’s Weekly Sales. 

I focused on describing my decisions about the process during a machine learning project. For this reason, I registered plenty of qualitative analysis together with the codes, tables and graphs. 

I started this project with a ‘Prelude’ (uploading packages and checking the datasets). Then, I made an extensive data analysis passing among the main issues. After, I prepared the dataset to be ready to be used on Machine Learning Model. In this project, I applied XGboost. Although I could improve more on the error results, my approach gave more importance to the variables explanation, despite the ‘forecasting’ precision. Finally, at the end, I did a brief review of the major insights and some few next steps.

Let me know if you have any question or suggestion. Please drop any message to me. I will try to answer as soon as possible.

I hope you have a productive time reading this notebook. 

# Prelude

In [None]:
#loading packages

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#graphs - boxplot
import matplotlib as plt
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid') 
import seaborn as sns
%matplotlib inline

#visulization
import plotly
from plotly.graph_objs import graph_objs as go
from IPython.html.widgets import interact

#Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import pyplot
from pylab import rcParams

#ACF and PACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

#Split datasets
from sklearn.model_selection import train_test_split

#Machine learning and error analysis
import xgboost as xgb
from sklearn.metrics import mean_squared_error


#Parameter tunning
from sklearn.model_selection import GridSearchCV

#Display an image
from IPython.display import Image

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#loading the datasets
dt_st = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/stores.csv', sep=',')
dt_feat = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/features.csv.zip', sep=',')
dt_train = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/train.csv.zip', sep=',')
dt_test = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/test.csv.zip', sep=',')

Basic steps to figure out what we have in our hands:

In [None]:
#Dataframe basic information

dt_st.info()
dt_feat.info()
dt_train.info()
dt_test.info()

# Data Analysis

There are 4 datasets with distinct types of variables between them. We have boolean, integer, float and object. I will start my analysis by understanding which kind of information we have inside of the two ‘support’ datasets: ‘Store’ and ‘Features’. 

Let’s explores the Store’s dataset and count how many stores by type: 

In [None]:
dt_st.head(3)

In [None]:
dt_st.tail(3)

In [None]:
#Adjusting the format of the float
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
dt_st.describe()

In [None]:
dt_st.groupby('Type')['Size'].describe()

In [None]:
#to figure out how many Not avaliable (NA) variables we have in each column
for key,value in dt_st.iteritems():
    print(key,value.isnull().sum().sum())

In [None]:
bplot = sns.boxplot(y='Size', x='Type', data= dt_st, width=0.5, palette="bright")
bplot.axes.set_title("Boxplot of Stores: Type and distribution",fontsize=16)

#add swarplot
bplot=sns.swarmplot(y='Size', x='Type',data=dt_st, color='black', alpha=0.75)

#setting the axis
bplot.set_xlabel("Type",fontsize=12)
bplot.set_ylabel("Size",fontsize=12)
bplot.tick_params(labelsize=10)

I like box plots. It is simple and useful! In one graph and few lines of code, we can visualise a lot of information. For instance, from the box plot above, it is possible to see:

- Type A contain the largest stores. The average size is around 180,000. Also, the variance of the size seems to be higher than B and C. 
- Type B is a kind of ‘medium’ store with average size by almost 100,000. However, there are a few stores with ‘outliers’ size.
- Type C is the small brother with low variance and size below of 50,000.

When we group the Store’s dataset by Type, we can also extract the frequency’s information. Type A and B are the most ‘popular’. 


**Internal comment:** I like to write/register the key information that we can use to achieve/understand our target (predicting the department-wide sales for each store). In this part, without considering the other data, I could expect more amounts of sales in Stores of the type A and B, only because of their high frequencies and maybe, we can include their size. But at the moment, it is only a very initial hypothesis.

Let's have a look what we have in the features dataset:

In [None]:
dt_feat.head(3)

In [None]:
dt_feat.tail(3)

In [None]:
dt_feat.describe()

In [None]:
dt_feat.groupby('IsHoliday').describe(include=['object'])

In [None]:
#to figure out how many Not avaliable (NA) variables we have in each column
for key,value in dt_feat.iteritems():
    print(key,value.isnull().sum().sum())

Well, this dataset looks like a time series data. Actually, this is more than a time series, this is a classic Panel Data (longitudinal data). It seems that we need to look at information not only along the time, but also among the 'individuals' (Stores).    

For this reason, we need to use the right visualizations tools to catch the right felling of the moment. Let's have some fun: 

**Correlation:**

It is important to understand the correlation between explanatory variables, because we need to avoid multicollinearity. 

Multicollinearity does not reduce the predictive power, but it can affect our coefficient’s estimation. So, if you are doing a model, sometimes is an excellent idea to know which coefficients are more important. To do it, we should deal with explanatory variables that has high correlation.

In [None]:
plt.figure(figsize=(15, 10))
corr = dt_feat.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

CPI and Unemployment has an negative correlation. It was expetected, because high unemployment, less jobs, less demand to consume, then the inflation (CPI) going down.  

Other correlation that we can destak is markdown 1 and 4, maybe in less intensity markdown 5. In this case, it is a positive correlation

In [None]:
#I am going to do a copy, because I don't want to do the features process now. I am just testing
dt_feat_copy = dt_feat.copy()

In [None]:
#I changed to datetime format, because I want to do a graph using 'date' as x-axis in datetime format
dt_feat_copy['Date'] = pd.to_datetime(dt_feat_copy['Date'])

In [None]:
#to figure out how many Not avaliable (NA) variables we have in each column
for key,value in dt_feat_copy.iteritems():
    print(key,value.isnull().sum().sum())

In [None]:
def f(var):
    plt.figure(figsize=(20,5))
    sns.lineplot(x="Date", y="{}".format(var), data=dt_feat_copy)
    
#if you are running this notebook on-line, run this function to make the things more intereative
#interact(f, var=dt_feat_copy[dt_feat_copy.columns[2:]]) 

In [None]:
list_1 = dt_feat_copy.columns[2:].tolist()

In [None]:
#version off-line - no interaction
data_1=[]
for i in list_1:
    data_1.append(f(i))

To be honest, I am not a big fan to apply line graph to analyse panel data. We can miss a plenty of ‘individual’ information. However, if we want to have just a general idea of what is going on and we do not want to waste too much time on this task, so, ok. Line plot can give to us a pleasant sight of the general dynamic along the time.
The cool thing is that we can see the dominant trend (the darkest line in the middle of the line) and also the ‘individual trends’ (light line colour). There are no huge deviations along the series. However, sometimes, we can note some exceptions such as in 2013-01 with fuel price.

In this date, the Fuel-price had a considerable variation between the stores. This event can affect negatively our estimation. But I believe that it is too early to suffer with this expectation. I will put this information in my pocket to consult it during the modelling stage.

Ok, but which more useful insights we can takeaway?

From the interactive graph above, we can point out:

- Temperature: it is a seasonal data from a Northern Hemisphere weather (Winter at the end of the year and Summer in the middle)
- Fuel_Price: although there is considerable fluctuation in the fuel's evolution price, we can also note a positive trend along the period.
- Markdown 1: high volatility, but with a considerable volume of promotions with two peaks along the period
- Markdown 2: less volume of with promotions with a few peaks distributed along of the series
- Markdown 3: a series with only two peaks of promotion. But these two promotions are the highest values between the markdowns.
- Markdown 4: very similar with markdown 1 (high correlation). Similar dynamic with two peaks in the same period.
- Markdown 5: very similar with markdown 1 and 4 (high correlation). However, in a very ‘shy’ way, because there is less amount of promotional.
- CPI: there is a substantial variation between stores (the light-blue line) with a positive trend along the period.
- Unemployment: like a CPI, there is a considerable variation between stores. However, in this case, we can note a downtrend.
- Isholiday: a boolean variable with few observations along the period (4 per year). It is important to note that the four holidays (Super Bowl, labor day, Thanksgiving/Black Friday and Christmas) are already considered here, but others are missing, such as Easter, Father’s and Mother’s day. In pre-processing stage, we need to include it

**Internal comments:** So, from here, some ideas starting to appear in my mind and I would like to put it on the table now. Because it will help us find our path when we are doing the feature’s engineering to complement these explanatory variables.

It will help us have some feelings of which kind of machine learning model we should use here. For those who have experience with economic variables, we can also have the first sensations about the impact of the independent variables into weekly sales. Let’s have a look:

There are some variables that we can expect a positive or negative impact and other we don’t know, such as temperature.

(This is only expectations, we do not know yet what kind of result it will come!)

    - Positive: 
    
    - Markdowns: we expect that markdowns will affect positively. The only problem is that we have plenty of missing values. One way to tackle it is to replaced by dummy variables. We will lose the dimension of the impact, but at least we can maintain still measure the impact of these ‘events’. 
    - Isholiday: Sincerely, I do not know. But I guess that people have more time to spend their moneys in holiday. So, maybe I will expect a positive impact here. 
    
    - Negative: 
    
    - Fuel price: this variable represents the cost to move from one place to another. It can affect increase the price of the products (logistics) and also the cost of opportunity to consume. 
    Because, if it is more expensive to move to somewhere, then people stay more at home.
    
    - Both:
    
    - CPI (Consumer Price Index): well, this guy is tricky. CPI can help consumption, but also can affect negatively. 
    Why? Because CPI measures the average change in prices over time that consumers pay for a basket of goods and services. 
    So, in a very superficial explanation, an increase in CPI means the consumers need to pay more for the same services and goods, it will rest less money to consume other things. 
    The other point that we should consider is that CPI contain the prices of retail. I represent here the prices of the Walmart sales. We never should exclude this variable, because it is our proxy to Walmart prices. Ok, it is not the perfect variable, but this is life. We work with what we have. But what we will do with the missing variables (585)? Considering that CPI is a continuous variable with small variance in USA, we can apply just an interpolation.
    
    - Unemployment: as like as CPI, unemployment can affect two directions. 
    High unemployment, less consumption. Low unemployment, high consumption. As CPI case, we can apply an interpolation to deal with NA unemployment’s data.        

Oow! I wrote a lot. But it is important to describe these variables, because this is the foundation of everthing. I promise that next steps I will be more syntetic.

Weekly Sales - the explatory mission

What we have in out Train dataset?


Train dataset

In [None]:
dt_train.head(3)

In [None]:
dt_train.tail(3)

In [None]:
for key,value in dt_train.iteritems():
    print(key,value.isnull().sum().sum())

I am going to merge the feature, store and train datasets, because it will help us to understand our target variable. However, this is not the pre-processing stage yet. Maybe, it is the preparation for it. 

My focus here is weekly sales.

In [None]:
dt_explor = dt_train.copy()

In [None]:
dt_explor = dt_explor.merge(dt_st, how='left').merge(dt_feat, how='left')

In [None]:
dt_explor['Date'] = pd.to_datetime(dt_explor['Date'])

In [None]:
dt_explor.head(3)

In [None]:
dt_explor.tail(3)

After merge it, I am going to do some exercises. I will start with weekly sales to compare it with our independent variables

In [None]:
plt.figure(figsize=(20,5))
sns.lineplot(x="Date", y="Weekly_Sales", data=dt_explor)
plt.title('Weekly Sales')

From those who like time series tools, the decomposition is the first thing you should do. Basically, it breaks the time series down into systematic and unsystematic components.

To do it, I did a little manipulation to have a one big series. This series is not considering the individuals aspects, just the total of weekly sales per date.

In [None]:
#Let's find the total aggregate sales per date (observed) and transforms into a 'timeseries'
serie_1 = dt_explor.groupby(dt_explor['Date']).sum()['Weekly_Sales']

So, now we built a timeseries. What we are looking for? 

- The average weekly sales: **level**
- The general dynamic in the series: **trend**
- Is there any repeating short-term cycle in the series? **Seasonality**
- Some random variation without explanation: **noise**

In [None]:
rcParams['figure.figsize'] = 11, 9
result = seasonal_decompose(serie_1, model='additive', freq=52) #weekly freq.
result.plot()
pyplot.show()

Hmmm, interesting. I was expecting a more obvious pattern of seasonality in weekly sales. But apparently there is only two seasonal peaks (holidays: thanksgiving/back Friday and Christmas). At first look, if we do not consider these two peaks, the series is almost stable with a positive uptrend and average level around by 50 millions.

Ahn, ok, maybe other holidays can affect too, such as mother’s day (May in the U.S.) and father’s day (June in the U.S.), both not included in our variable ‘IsHoliday’. However, these other holidays seems to be marginal/minor at a glance. 

One thing that calls my attention in ‘Resid’ graph is the error in April. The ‘deviations’s errors’ at the end of the year is because the thanksgiving/Black Friday and Christmas. But why this cycle ‘error’ in April? Easter? Yeah, maybe. We should include a dummy to correct it.Let’s analysed more the impact of these outliers (thanksgiving/back Friday and Christmas) in our weekly sales.

Now, I will back to analyse the panel data

In [None]:
#selecting the period of the peaks.
peak_1 = (dt_explor['Date'] > '2010-11') & (dt_explor['Date'] <= '2010-12-26')
peak_2 = (dt_explor['Date'] > '2011-11') & (dt_explor['Date'] <= '2011-12-26')

In [None]:
#selecting the year of the peaks.
y_1 = (dt_explor['Date'] > '2010-1') & (dt_explor['Date'] < '2011-1')
y_2 = (dt_explor['Date'] > '2011-1') & (dt_explor['Date'] < '2012-1')

I am a kind of person who like old school style. Sometimes, it is good to make some calculations by 'scratch', just to have some ideas of parameters. 

I want to find how much these peaks represent in total sales. I am looking for the 'share' of sales by peak.  

In [None]:
#sales in the peak 1 and its respective year (full) 
sales_1 = dt_explor.loc[peak_1, 'Weekly_Sales'].sum()
sales_total_1 = dt_explor.loc[y_1, 'Weekly_Sales'].sum()

In [None]:
#sales in the peak 2 and its respective year (full) 
sales_2 = dt_explor.loc[peak_2, 'Weekly_Sales'].sum()
sales_total_2 = dt_explor.loc[y_2, 'Weekly_Sales'].sum()

In [None]:
#Calculating the share
print('Share of peak 1: {}'.format(sales_1/sales_total_1)) 
print('Share of peak 2: {}'.format(sales_2/sales_total_2))

So, these peaks represent almost 20% of the total sales in their respective years. Ok, I know, the data from year 2010 is not complete. But, wait. This gave to me an idea: let’s do a crazy exercise using the year 2011 (actually, not so crazy). 

How much was the sales until the thanksgiving and Christmas in 2011? Then, I will compare this amount with the value of the 2011’s peak. The idea is to figure out the real importance of these events in the Total sales.


In [None]:
y_2011_without_christ = (dt_explor['Date'] > '2011-1') & (dt_explor['Date'] < '2011-11')

In [None]:
sales_withou_christ = dt_explor.loc[y_2011_without_christ, 'Weekly_Sales'].sum()
print('Sales w/t Christ: {}'.format(sales_withou_christ))
print('Sales in peak (Thanks Giving and Christ): {}'.format(sales_2))
print('Peak against rest of the year: {}'.format(sales_2/sales_withou_christ))

Ho-Ho-Ho!

In only two months, Thanksgiving (Black Friday) and Christmas represent almost 25% of the total sales that the stores have done during all the rest of the year! Yeah, Santa Claus smiles. 

But for us, it will demand a special attention, because this peak distorts our series pattern. To treat it, maybe a dummy is enough.

Extra exercise - if the weekly sales series was a time series (remember it is a panel data).

I will plot ACF and PACF to see the autocorrelation function (ACF) of the series with the previous lags and the partial correlation (PACF) with its residual. Basically, we can interpret it as:
- ACF: correlation of the present with past values
- PACF: correlation of hidden information with the lags

In [None]:
plot_acf(serie_1,lags=20)
plt.show()

In [None]:

plot_pacf(serie_1,lags=20)
plt.show()

**Sales by Type and Dept.**

Let's start making some functions to calculate the share of the sales by type and dept. It will be very usefull.

In [None]:
#This function is going to be very useful to us, when we are looking for the total sum of sales by certain column 'd'.
def sales(s,d): return dt_explor.loc[dt_explor[d] == s, 'Weekly_Sales'].sum()

In [None]:
# Y is the column of the dataframe and X is the 'item' of this column that you want to find the share
def share_all(x,y): return sales(x,y)/dt_explor['Weekly_Sales'].sum()

In [None]:
dt_explor.groupby('Type')['Weekly_Sales'].sum()

In [None]:
#Share of sales by Store
print('Share A: {}'.format(share_all('A','Type')))
print('Share B: {}'.format(share_all('B','Type')))
print('Share C: {}'.format(share_all('C','Type')))

Do you remember when I wrote few lines ago that we can expect more sales from A and B? So, there is our proof. But I was not expecting that Type A was so impressive. 

From this information, we can expect that the dynamic of the weekly sales will follow the patterns of 'A' and a little less by 'B'. Stores of Type 'C' have less influence in the total sales. 

Just for fun, let's make a decomposition of the total sales by Type A per date. Since type 'A' represents almost 65% of total sales, probably the format of the systematic and unsystematic components will be very similar with the main series.  

In [None]:
#function to find the total of sales by Type per date.
def type(x): 
    a = dt_explor.loc[dt_explor['Type'] == x]
    a = type_a = a.groupby(a['Date']).sum()['Weekly_Sales']
    return a

In [None]:
type_a = type('A')

In [None]:
rcParams['figure.figsize'] = 11, 9
result2 = seasonal_decompose(type_a, model='additive', freq=52)
result2.plot()
pyplot.show()

Yeah. Almost the same of the main series. 

However, I am curious. And how about the others one?

In [None]:
type_b = type('B')
type_c = type('C')

In [None]:
rcParams['figure.figsize'] = 11, 9
result2 = seasonal_decompose(type_b, model='additive', freq=52)
result2.plot()
pyplot.show()

Type B is very similar with Type 'A' and the main serie. And Type C?

In [None]:
rcParams['figure.figsize'] = 11, 9
result2 = seasonal_decompose(type_c, model='additive', freq=52)
result2.plot()
pyplot.show()

Type C is a unique guy. Look at the ‘Resid’ graph. There is a systematic error, almost a seasonal error along the period. The other fun thing is the ‘peak’. There is a negative peak at the end of the year! Why? 

Probably Type C is a kind of super premium store. Wall Mart is a popular store in which its profit comes from the diversity and quantity of products it sells. For this reason, Wall Mart stores are enormous. However, as we have seen above, the size of Type C is the smallest. Because of it, my hypothesis is that Type C only sells special products for special people. Which kind of store did not sell at the end of the year? Or their customers are travelling or the stores are closed. 

Ok, but this is only a hypothesis. Unfortunately, we can not confirm it with the information that we have here. But I particularly enjoy making this kind of interpretation. It helps us to improve our power of explanation (qualitative analysis).

Store

Analysis by Department:

In [None]:
a = dt_explor.groupby(['Dept'])['Weekly_Sales'].sum()

In [None]:
plt.rcdefaults()
plt.style.use('seaborn')
fig, ax = plt.subplots(figsize=(10,20))
ax.barh(a.index,a)
ax.set_yticks(a.index)
ax.set_yticklabels(a.index)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_ylabel('Dept')
ax.set_xlabel('Weekly Sales')
ax.set_title('Sum of Weekly Sales by Department', color='C0')

plt.show()

In [None]:
#How much the top 3 dept in sales (92, 95 and 38) represent in total sales? 
share_all(92,'Dept')+share_all(95,'Dept')+share_all(38,'Dept')


There are some departments that sell more than others. As we can see above, the top 3 depts in sales is responsible for almost 20% of the sales. This is insane, because there is almost 99 unique kinds of dept and only 3 dept represents 20%!

Ok, but these departments belong to the same store or all the stores have the same departments?

In [None]:
#lets take the dept 92 as example
dt_explor.loc[dt_explor['Dept']==92].head(3)

In [None]:
dt_explor.loc[dt_explor['Dept']==92].tail(3)

In [None]:
#function to find the total of observations of some dept by Type
def dept(i): 
    print(dt_explor.loc[dt_explor['Dept']==i].groupby(['Type'])['Dept'].count())

In [None]:
#Let's pick up some dept. randomly, just to check its frequency distribution
dept(92), dept(6), dept(23)

From the exercise above, we can realise that there are different frequency distributions of departments by Type of store. 

In [None]:
plt.figure(figsize=(15, 10))
corr = dt_explor.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap="YlGnBu",
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

# Pre-processing

In this project, I will merge the datasets available, then I will build a feature engineering and selection. After it, I will split it to make the new train and test set.

In [None]:
#I will create a column to label our train and test dataset. It will make our lifes more easy at the time that we will split it
dt_train['label'] = 'train'
dt_test['label'] = 'test'

In [None]:
#Because I am going to concatenate, I will create a column of weekly sale in test set with n/a values, just to security
dt_test['Weekly_Sales'] = np.nan

dt_all = pd.concat([dt_train, dt_test])

In [None]:
dt_train.shape, dt_test.shape

In [None]:
dt_all.shape

In [None]:
dt_all.tail(3)

In [None]:
dt_all.head(3)

In [None]:
#let's merge
dt_all = dt_all.merge(dt_st, how='left').merge(dt_feat, how='left')

In [None]:
dt_all.shape

In [None]:
dt_all.head(3)

In [None]:
dt_all.tail(3)

In [None]:
dt_all['Date'] = pd.to_datetime(dt_all['Date'])

**Features engineering**

Missing values



In [None]:
#To have a visualization of the space/dimension of the missing values into our dataset 
sns.heatmap(dt_all.notnull(), cbar=False, yticklabels='',cmap="Blues")

Did you see what I am seeing? 

We know that the null values from weekly sales data belong to test dataset (because I created this column in this dataset few lines ago). So, based on this reference, we can note that there is a tricky fact here about the missing values of Markdown, CPI and Unemployment. 

**In markdown, it concentrates the missing values in training dataset. However, in CPI and Unemployment, the missing values are in test dataset.**

It means that how we full in the markdown’s missing values will affect our training model. The CPI and Unemployment missing values will affect our test model results.
Well, if I have more time, maybe I would try to do a forecast exercise to find a suitable CPI and Unemployment. But, I believe that an interpolation is enough, because as we have seen it, there is a very clear dominant trend along the period in both series. So, an interpolation is ok here (just some few observations to fill). This is not the perfect world, but this is the life.

With markdown, at the beginning I was thinking to create a dummy, then go with the flow. Now with this information, if a create a dummy, maybe I will lose a lot of important forecast explanation (error). I am thinking to fill it with the mean group by Store, dept, temperature and holiday. Because it will ‘maintain’ the main ‘individuals’ features of each markdown according with its store, dept (place) and temperature, holiday (seasonal/time aspect).

In [None]:
#Making a copy to preserve our original dataset
dt_all_interp = dt_all.copy()

In [None]:
#organising markdows using the average by store and temperature.
dt_all_interp_v2 = dt_all_interp.groupby(['Store','Dept','Temperature','IsHoliday']).median()[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].reset_index()

In [None]:
#then applying backward filling method 
dt_all_interp_v2 = dt_all_interp_v2.fillna(method='bfill')

In [None]:
#Checking null values
for key,value in dt_all_interp_v2.iteritems():
    print(key,value.isnull().sum().sum())

In [None]:
#Now I am going to interpolate CPI and Unemployment
dt_all_interp_v3 = dt_all_interp.groupby(['Store','Temperature']).median()[['CPI','Unemployment']].reset_index()

In [None]:
dt_all_interp_v3['CPI'] = dt_all_interp_v3['CPI'].interpolate()
dt_all_interp_v3['Unemployment'] = dt_all_interp_v3['Unemployment'].interpolate()

In [None]:
#Checking null values
for key,value in dt_all_interp_v3.iteritems():
    print(key,value.isnull().sum().sum())

In [None]:
#it's time to put it back into the full dataset. Before, let's rename the columns, because we do not want to replace all original data 
dt_all_interp_v2.rename(columns={'MarkDown1':'1_mk','MarkDown2':'2_mk','MarkDown3':'3_mk','MarkDown4':'4_mk','MarkDown5':'5_mk'}, inplace=True)
dt_all_interp_v3.rename(columns={'CPI':'inter_CPI','Unemployment':'inter_unempl'}, inplace=True)

In [None]:
#merging
dt_all =dt_all.merge(dt_all_interp_v2, on=['Store','Dept','Temperature','IsHoliday'], how = 'inner').merge(dt_all_interp_v3, on=['Store','Temperature'], how = 'inner')

In [None]:
#replacing
dt_all.MarkDown1.fillna(dt_all['1_mk'],inplace=True)
dt_all.MarkDown2.fillna(dt_all['2_mk'],inplace=True)
dt_all.MarkDown3.fillna(dt_all['3_mk'],inplace=True)
dt_all.MarkDown4.fillna(dt_all['4_mk'],inplace=True)
dt_all.MarkDown5.fillna(dt_all['5_mk'],inplace=True)
dt_all.CPI.fillna(dt_all['inter_CPI'],inplace=True)
dt_all.Unemployment.fillna(dt_all['inter_unempl'],inplace=True)
dt_all.drop(['1_mk','2_mk','3_mk','4_mk','5_mk','inter_CPI','inter_unempl'], axis=1, inplace=True)

In [None]:
#checking missing values
for key,value in dt_all.iteritems():
    print(key,value.isnull().sum().sum())

**Dummies!** 

Yeah, everybody loves dummies. But, we need to be careful to not use too much dummies and overfit our model. 

When we are working with dummy variables, we need only (n-1) dummy variables. For instance, if we are working with dummies to treat month, we can not create dummies for all months. We need to pick one month and excluded it. This month ‘excluded’ will be our ‘control’. 

Why? Because, when all month’s dummy is zero, the remain value is related to our control variable.

Ok, but which variables you made dummy? I chose almost all the categorical variables. Because some models did not perform with categorical variables (Xgboost, regression models, etc): 

- Dept: yes, it is categorical. Although it is a number, this is not a sequence of values. It is only the classification of the dept. Be careful, the risk to consider it as a ‘ordinal’ is created a super variable that will explain everything - as we have seen, the sales are very correlated with the dept. 
- Type
- IsHoliday 
- Size (Same case with the dept. It is a kind of categorical variable. But I will exclude it, so I did not create a dummy) 

**Back to the Future:**

After running the Machine Learning model, I found that there are too much dummies. Although I wrote the message at the beginning and the error of the model looks great, the model looks like over-fitting. For this reason, I did only some few dummies: Type and Is Holiday and some dept.

In [None]:
dt_all = pd.get_dummies(dt_all, columns=["Type",'IsHoliday'])

In [None]:
def dummy_92(c):
    if c['Dept'] == 92:
        return 1
    else:
        return 0

def dummy_6(c):
    if c['Dept'] == 6:
        return 1
    else:
        return 0

def dummy_23(c):
    if c['Dept'] == 23:
        return 1
    else:
        return 0

In [None]:
#toop 3 dept in sales
dt_all['dept_92'] = dt_all.apply(dummy_92, axis=1)
dt_all['dept_6'] = dt_all.apply(dummy_6, axis=1)
dt_all['dept_23'] = dt_all.apply(dummy_23, axis=1)

In [None]:
dt_all.head()

In [None]:
#excuding unsuless dummies
dt_all = dt_all.drop(columns=['Type_C','IsHoliday_False'])

In [None]:
for key,value in dt_all.iteritems():
    print(key,value.isnull().sum().sum())

In [None]:
dt_all.shape

**Feature Selection**

This is the time to back to our annotations and think a little about it. The feature selection is the consequence of our data analysis.
So, ok, which variables and why? 

- Size: No. Because I have seen that type and size are very correlated. So, I prefer to put a dummy of type. I am a little afraid to put variables that have big deviations. I prefer to avoid this kind of variables. 
- Temperature: Yes. It is our seasonal element. It will help us fit some aspects that can vary according with the season of the year
- Fuel price: Yes. It can be a proxy of the cost of opportunities to buy, and also it has a ‘individual’ component.
- Markdowns: Yes, but only some of them. Markdown 1 and 4 are very correlated. So, I will exclude the 4 (more original missing values)
- CPI: yes. It is our proxy about the prices in the retail sector. It also works as a proxy about the power of consumption.
- Unemployment: yes. It is our proxy to demand and power of consumption. 
- dummies: yes. All of them. It is our categorical variables. We have seen that the sales are different among different dept, type, store and holiday. 

**Internal comment:** we should consider a variable that represents the concurrency such as the demand by on-line products or the price of the concurrency. Other important variable to retail is there is the level of scholarly and age of the population. But it is not allowed to import external dataset, otherwise we could use some google trends series as a proxy.


In [None]:
#excluding variables
dt_all = dt_all.drop(columns=['Dept','Store','Size','MarkDown4'])

In [None]:
for key,value in dt_all.iteritems():
    print(key,value.isnull().sum().sum())

# **Preparing training, validation and testing dataset**

In [None]:
#spliting data
train = dt_all[dt_all.label=='train'].reset_index(drop=True)
test = dt_all[dt_all.label=='test'].reset_index(drop=True)

In [None]:
train = train.drop(columns=['label'])
test = test.drop(columns=['label'])

In [None]:
train.head()

In [None]:
train.shape, test.shape

In [None]:
#X_test
X_test = test.iloc[:,2:]

In [None]:
X_test.head(3)

In [None]:
#train
y_train,X_train =  train.iloc[:,1],train.iloc[:,2:]

In [None]:
#train_2 and validation set
X_train_2, X_valid, y_train_2, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

# **Machine learning algorithm: XGboost**

Since this is a problem of forecast with a reasonable dataset, the suitable models that I would apply are: SGD, Lasso, ElasticNet, Ridge Regression, SVR and Ensemble Regressors. Ensemble regressors are techniques to combine the predictions of several base estimators built with a learning algorithm to improve robustness over a single estimator.

Inside of Ensemble regressors, there are averaging (Bagging, Forest random trees, etc) and boosting methods (AdaBoost, Gradient Boosting trees, etc). 
I applied the XGboost because the implementation is easy and fast. Besides, the result from boosting algorithm with tree ensembles (set of classification and regression trees) is fantastic.

XGBoost is well recognised to produce stronger results than other machine learning models. It has grown into the most popular algorithm to handle with structured data. XGboost uses the gradient boosting (GBM) framework at its core. It is an optimised distributed gradient boosting library that controls the bias and variance aspects, combining several weak models to produce a powerful ensemble. One important step to achieve satisfactory results is tune its parameters. 

Parameters are essential to give the right ‘orientation’ for the sequential models. Generic speaking, there are 3 main categorical parameters: 1) Three-Specific Parameters (regularisation of the individuals models), 2) Boosting parameters (controls the boosting operation) and 3) Other general parameters to maintain the operation. 

I will start with cross validation. Then I will apply a Gridsearch method to make my life more easier in parameter tunning. After it, I will train the model in the training set, then test its results in validation set for final adjustment. At the end, with the last model, I will submit it.

In [None]:
Image('../input/models-sklearn/Screenshot from 2020-07-12 02-20-33.png')

source: [ScikitLearn](http://https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)



XGBoost gets efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In [None]:
# Create optimized DMatrix to improve the quality of model
sales_dmatrix = xgb.DMatrix(data=X_train_2,label=y_train_2)

#  Parameter dictionary for each tree: params - just a generic example without tunning
params = {"objective":"reg:linear", "max_depth":4}


#Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost mode

# Perform cross-validation with early stopping
cv_results = xgb.cv(dtrain=sales_dmatrix, params=params, nfold=3, num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)

print(cv_results)

Ok, good. I saw how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of XGBoost mode. Now let's make more parameter tunning

In [None]:
# Creating dict. of range of parameters to grid
gbm_param_grid = {
    'learning_rate': [0.01,0.1,0.5],
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# the regressor
gbm = xgb.XGBRegressor()

# Grid search (yes!): 
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, scoring='neg_mean_squared_error', cv=4, verbose=1)

#Fit the parameters!
grid_mse.fit(X_train_2, y_train_2)

# best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

According with the grid search, the parameters above are the best to my model. Let's it in the model. 

In [None]:
# Now, back to the model these new parameters
xg_reg_2 = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.7, learning_rate = 0.5, max_depth = 5, alpha = 10, n_estimators = 50)

In [None]:
#fiting the model into our training dataset
xg_reg_2.fit(X_train_2,y_train_2)

In [None]:
#Feature importance
xgb.plot_importance(xg_reg_2)
plt.show()

In [None]:
#!pip install shap
#import shap
#shap.initjs()

# explain the model's predictions using SHAP

#explainer = shap.TreeExplainer(xg_reg_2)
#shap_values = explainer.shap_values(X_train_2,check_additivity=False)

# summarize the effects of all the features
#shap.summary_plot(shap_values, X_train_2)

I did a little tricky here, because the kaggle notebook are not running the 'shap.TreeExplainer' (July 2020). However, I really need to include this graph here, because it will give to us a great dimension of the feature's impacts. 

In order to solve it, I downgraded the the xgboost version (!pip install xgboost==1.0.0), then I runned it again. 

Ok, there is another problem. The kaggle notebook do not accept 'downgraded' the version of the package. So, I downloaded the notebook, runned it in colab (can be in jupyter too), then I putted it back here. 

I lost a lot of time doing it, because everytime that I up-dated this notebook, the error appered again. So I decided to add only an Image and that's it. 

But if you run this notebook in colab or jupyter following all the steps that I mentioned (dowgraded version, etc), the next graph will appear correctly. Otherwise, drop a message to me. 

In [None]:
Image('../input/shap-features/Screenshot from 2020-07-11 19-46-04.png')

I liked this model. Although we can apply more treatments to minimise the error, I think the features are very well balanced. 

If the Walmart’s shareholders asked to me: ‘According to your model, what are the most important features that impact our sales and why?’

I would say:

- CPI: it is a proxy of our price and also represents the power of our consumer’s consumption. For this reason, it is responsible to positive and negative values.
- Unemployment: it is the thermometer of our potential market, more people working, more potential market and the other way around. 
- Temperature: it is our seasonal variable correlated with months and special dates (end of the year). 
- Markdowns: I was expecting only positive affects here. Because there are some markdowns related with special dates. But, one hypothesis that can explain this ‘dual’ impact, it is because some markdowns only happens when the sales are decreasing. So, considering that my assumption to fill the missing values was back interpolate it, maybe the model are getting this delay correlation between negative sales and markdowns. 
- Fuel price: it is the cost of opportunity, for this reason, when the price of the fuel is high, the sales decrease. Also, the fuel price can effect the last price of the products. 

So, one shareholder could ask: ‘why you did not make any comment about the categorical variables (dept and type)?

I am very careful to include a lot of categorical variables without understanding very well the meaning of each one. Because it can contain a lot of features with high potential of explanation concentrate in just one variable. 
If I am just forecasting and taking part in the forecasting competition, ok, perfect, I would include all the categorical variables. Because it will minimise the error,
for sure!

But if it was in my job, I don’t know. Here of forecasting sales, I am not sure if this kind of model are sustainable in the mid and long term.
For instance, imaging that we have a company that sells fish and chips. Then, we bought a huge database of our consumers classified by ‘ID’, their demand by fish and chips and other features. Then someone included the column ‘ID’ into the model. Ow, perfect, the model explain very well everything, nice job. In the next month, new consumers and new ‘IDs’ starting to appear and disappear. So, how we are going to to forecast ‘ID’ of our future consumers to still using our model? It makes little sense. 

In this case, dept and store are our ‘IDs’. Inside of each ‘ID’, there are ‘swallow’ features that we do not include, because there are not available dataset and we are not allowed to add external variables. If it was allowed, we could include series that represents recent trends in consumer behaviours, the competitors movement (digital stores, such as amazon), education profile, etc. It is not the dept or store that explain the sales per se, but what it is inside of this categorical variables. For this reason, I only included the top 3 dept in sales in my model (probably, it represents the most popular products). If I included all those categorical features, we are over-fitting our model. 

Why it works very well in the prediction, if I add the column store and dept? Because most stores and dept are constant in both datasets (train and test set), in this case there is no difference in result. Once again, it is okay to include these variables for the forecast. However, as well as in time series models, we can not use this model (with a lot of categorical features) to understand the impact of each action into the sales.

In [None]:
#applying the model to make the predictions, based on the features of the validation dataset
preds = xg_reg_2.predict(X_valid)

In [None]:
#let's see the error - rmse 
rmse = np.sqrt(mean_squared_error(y_valid, preds))
print("RMSE: %f" % (rmse))

In [None]:
#function to measure the error, based on the criteria of this competion
def wmae(dataset, real, predicted):
    weights = dataset.IsHoliday_True.apply(lambda x: 5 if x else 1)
    return np.round(np.sum(weights*abs(real-predicted))/(np.sum(weights)), 2)

In [None]:
wmae(X_valid,y_valid,preds)

# **Conclusion and next steps**

I had a great time doing this exercise. It was a very good. The main insights that I would like to takeaway are:

- The train and test set were built to reflects not only the quality of the model, but the way that missing values are treated
- There is a little issue with sharp package and Xgboost current version. The alternative solution is downgraded the Xgboost version
- Xgboost is very fast and easy to implement. The algorithm is fantatisc. But parameter tunning is an art.  


Next time, I would back here and testing these hypothesis: 

- Feature selection: 

    - I could make a time series model to predict the CPI and Unemployment. Then applied the results to replace the missing values. 
    - Markdowns and Weekly sales: understand the causality. It is the markdown the affects sales or maybe, some markdowns happens only because the sales are decreasing?
    - create new variables with external links, such as google trends, twitter, etc. 
    
- Spend more time with parameters tunning
- Better Error analysis to understand where I can improve


I hope you have enjoyed this notebook. I really appreciated your attention until this conclusion!

Thank you!

# **Submission**

In [None]:
#forecast in test set
preds_final = xg_reg_2.predict(X_test)

In [None]:
#Preparing the subimission
dt_submission = dt_test.copy()
dt_submission['weeklySales'] = preds_final

In [None]:
#adapting the model
dt_submission['id'] = dt_submission['Store'].astype(str) + '_' +  dt_submission['Dept'].astype(str) + '_' +  dt_submission['Date'].astype(str)
dt_submission = dt_submission[['id', 'weeklySales']]
dt_submission = dt_submission.rename(columns={'id': 'Id', 'weeklySales': 'Weekly_Sales'})

In [None]:
dt_submission.head(3)

In [None]:
dt_submission.info()

In [None]:
dt_submission.to_csv('output_submission.csv', index=False)