### Kaggle Leaderboard
- Public Score: 3348.10600
- Private Score: 3463.42705

## Student Project Team
The Hague University of Applied Sciences <br><br>
Christiaan Morreau: https://www.linkedin.com/in/christiaan-morreau-507565187/ <br>
Marcus van Gulik: https://www.linkedin.com/in/marcus-van-gulik-7718b91bb/ <br>
Gerard Draadjer: https://www.linkedin.com/in/gerard-draadjer-635751179/ <br>


#### Walmart business problem
For this project Walmart made data available on Kaggle for (aspiring) data-scientists to help Walmart with their business problem. In this particulair case Walmart provided a few years worth of weekly retail data based on wich they would like a model which could be used to predict their future sales. This model could be very valuable because it would lead to a better understanding of multiple facets of their business such as: Sales fluctuations, warehouse-to-store logistics and their seasonal markdowns.  

#### Workflow:
1. Understanding Data
<ul>
<li>1.1 Imports and First Impressions</li>
<li>1.2 Univariate Analysis</li>
<li>1.3 Bivariate Analysis</li>
<li>1.4 Further EDA on important features</li>
</ul>
2. Feature Engineering
<ul>
<li>2.1 Basic Feature Changes</li>
<li>2.2 Feature Engineering Idea 1</li>
<li>2.3 Feature Engineering Idea 2</li>
<li>2.4 Feature Engineering Idea 3</li>
<li>2.5 Feature Engineering Idea 4</li>
<li>2.6 Feature Engineering Idea 5</li>
<li>2.7 Feature Engineering Idea 6</li>
</ul>
3. Modeling
<ul>
<li>3.1 Model Preperation</li>
<li>3.2 Correlation of Prepared Features</li>
<li>3.3 Model- and Feature Selection & Hyperparamater Tuning</li>
<li>3.4 Final Model & Submission</li>
<li>3.5 Visualizing predictions</li>
</ul>

# 1. Understanding Data

## 1.1 Imports and First Impressions

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns', None)
sns.set_style('darkgrid')

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [None]:
stores = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/stores.csv')
features = pd.read_csv( '/kaggle/input/walmart-recruiting-store-sales-forecasting/features.csv.zip')
test = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/test.csv.zip')
train = pd.read_csv('/kaggle/input/walmart-recruiting-store-sales-forecasting/train.csv.zip')

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

In [None]:
print(stores.shape)
stores.head()

In [None]:
print(features.shape)
features.head()

the train and test data is splitted by date. 
- The train data contains rows that go from 2010 till the 26th of October 2012. 
- The test data contains rows that go from November 2nd 2012 till July 26th, 2013.

To better analyse our data and make preproccesing more convenient we'll combine the datasets till our modelingprocess.

In [None]:
# concatenating test and train datasets
train['dataset'] = 'train'
test['dataset'] = 'test'
train_test = pd.concat([train, test])

# Merge all data
train_test = train_test.merge(stores, how='left').merge(features, how='left')

# Creating date-time objects and some extra date-time info
train_test['Date'] = pd.to_datetime(train_test['Date'])
train_test['Year'] = pd.to_datetime(train_test['Date']).dt.year
train_test['Month'] = pd.to_datetime(train_test['Date']).dt.month
train_test['Week'] = pd.to_datetime(train_test['Date']).dt.week
train_test['DayOfTheMonth'] = pd.to_datetime(train_test['Date']).dt.day

train_test.head()

In [None]:
# test and train data is seperated by dates
display(train.Date, test.Date)

In [None]:
pd.DataFrame(train_test.dtypes).reset_index().rename(columns={'index':'Columns', 0:'Type'})

In [None]:
train_test.info()

In [None]:
train_test.isna().sum()

## 1.2 Univariate Analysis

### Numeric Features
Important for numeric features is the distribution including the mean, median and mode. For this we'll use histograms and boxplots.

In [None]:
# all numeric columns
numeric = train_test.select_dtypes(include=['number']).copy()

# discrete number columns for bar-graphs
disc_num_var = ['Year','Month','Week','DayOfTheMonth']

# continious number columns for histograms
cont_num_var = []
for i in numeric:
    if i not in disc_num_var:
        cont_num_var.append(i)
        
print('Discrete:', disc_num_var)
print('Continious:', cont_num_var)

In [None]:
# all categorical columns
categoric = train_test.select_dtypes(exclude=['number']).drop(['Date', 'dataset'], axis=1).copy()
categoric.columns

- All categorical columns are nominal.

Let's plot histograms for distribution on the continious number variables.

In [None]:
fig = plt.figure(figsize=(14,10))

for index, col in enumerate(cont_num_var): 
    plt.subplot(4,4,index+1) 
    sns.distplot(numeric.loc[:,col].dropna(), kde=False) 
    plt.xlabel(None)
    plt.title(col, fontsize=12)
fig.tight_layout(pad=1.0) 

Let's plot boxplots for better overview of the outliers

In [None]:
fig = plt.figure(figsize=(15,16))

for index, col in enumerate(cont_num_var):
    plt.subplot(6,4,index+1)
    sns.boxplot(y=col, data=numeric.dropna())
    plt.ylabel(None)
    plt.title(col, fontsize=12)
fig.tight_layout(pad=1.0)

- Because the markdowns seem to either consist mostyly of 1 value or have a lot of oultiers above 0 they are most likely connected to one of the holidays 
- Furthermore, only the unemployment-rate seems to have some outliers

Let's create some plots for the discrete number variables

In [None]:
fig = plt.figure(figsize=(15,8))

for index, col in enumerate(disc_num_var):
    plt.subplot(2, 2, index+1)
    sns.countplot(x=col, data=numeric.dropna()) 
    plt.ylabel(None)
    plt.title(col)
fig.tight_layout(pad=1.0)

Plotting barplots for categorical values.

In [None]:
# to be able to make an countplot for this boolean variable we've to change the type of it
categoric['IsHoliday'] = categoric['IsHoliday'].apply(str)

fig = plt.figure(figsize=(10,6))
for index, col in enumerate(categoric):
    plt.subplot(2,1,index+1)
    sns.countplot(x=categoric[col], data=categoric.dropna())
    plt.xlabel(None)
    plt.title(col, fontsize=14)
fig.tight_layout(pad=1.0)

# 1.2 Bi-variate Analysis

Let's look at multicolinearity

In [None]:
plt.figure(figsize=(10,7))
# joining isholiday and type
numeric['Type'] = categoric['Type'].replace({'A': 3, 'B': 2, 'C': 1})
numeric['IsHoliday'] = categoric['IsHoliday'].replace({'False': 0, 'True': 1})
cor = numeric.corr()
sns.heatmap(cor, linewidths=0.2, cmap='Blues') 
plt.show()

The columns below seems to have a high relationship between each other:
- MarkDown1 - MarkDown4
- Week - Month
- Type - Size

Let's look at the correlation coefficients between Weekly_Sales and its features. 

In [None]:
train_test.corr()[['Weekly_Sales']].apply(abs).sort_values('Weekly_Sales', ascending=False).head(10)

- On first sight there doesn't seem to be high correlations with the target variable. To further investigate this, we'll first want to make sure we don't miss out on other correlatoins such ash quadratic or exponential ones.

Let's make scatterplots to confirm the above.

In [None]:
fig = plt.figure(figsize=(20,20))
for index, col in enumerate(numeric):
    plt.subplot(5,4,index+1)
    sns.scatterplot(x=col, y='Weekly_Sales', data=numeric.dropna())
    plt.title(col, fontsize=14)
    plt.xlabel(None)
    plt.ylabel('Weekly_Sales')
fig.tight_layout(pad=1.0)    
plt.show()

#### To sum up: 
We found an x amount of features that seem most relevant to indicate our target variable. These features either came up in our analysis or by our interpretation of their possible meaning. 
1. MarkDowns:
    - So far, we noticed a linear relationship between MarkDown3 and IsHoliday. 
2. Type & Size:
    - We'll further investigate this multicolinearity and it's behaviour in regards to the Weekly_Sales
3. IsHoliday:
    - We're going to further explore the relation between holidays and sales. Also we'll look whether all holidays are included.
4. Year:
    - We further want to explore the Weekly_Sales over 2010 up and till 2012 to see whether a respective development is applicable. 
5. Store:
    - We are going to further explore the relationship between sales and the individual stores
6. Dept:
    - Since we expect the departments to be identical over all the stores by selling the same goods (e.g. an electronical- & groceries department), we like to further investigate the difference within the departments themselves.  

## 1.4 Further EDA on important features

#### 1. Markdowns

Markdowns on average lower or higher during holidays?

In [None]:
# Creating a column with markdown_totals
train_test['markdown_totals'] = train_test[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].sum(axis=1)
train_test.groupby('IsHoliday')[['markdown_totals']].mean()

- Why are markdowns higher during holidays? In the description they mention markdowns PRECEDE holidays.

Plot markdown by date and store.

In [None]:
plt.figure(figsize=(15,5))

# working on xticks labels with dates
weeks = [str(i) for i in range(1,53)]
weeks[5] = '6 - superbowl'
weeks[12] = 'easter - 2010 / 13'
weeks[13] = 'easter - 2012 / 14'
weeks[15] = 'easter - 2011 / 16'
weeks[21] = 'memorial day - 22'
weeks[26] = 'independence day - 27'
weeks[35] = '36 - laborday'
weeks[46] = '47 - thanksgiving'
weeks[51] = '52 - christmas'

# plotting markdowns 1 to 6
for i in range(1,6):
    markdown_df = train_test[train_test['MarkDown' + str(i)] > 0].groupby(train_test['Date'].dt.week)[['MarkDown' + str(i)]].sum()
    plt.plot(markdown_df.index, markdown_df[['MarkDown' + str(i)]].values, label='MarkDown' + str(i))

plt.xticks(np.arange(1, 53, step=1), labels=weeks, rotation=90)
plt.legend(['Super Bowl','Christmas','Thanksgiving','Super Bowl','Unknown'], loc=2)
plt.title('Weekly markdown summings (of 3 years) per markdown_type', fontsize=16)
plt.ylabel('markdown_totals', fontsize=12)
plt.xlabel('Weeks', fontsize=12)
plt.show()

- We see that some markdowns do precede holidays and others fall upon the holiday weeks themselves. This has probably to do with the kind of holiday.

Is the markdown policy the same among all stores? Let's make a plot

In [None]:
# plotting
plt.figure(figsize=(15,6))
# we drop store 28 since it's an outlier for 1 particular moment which influences the visibality of our plot.
for i in train_test.Store.value_counts().sort_index().index.drop(28):
    mark_per_store = train_test[(train_test['Store'] == i) & (train_test['markdown_totals'] > 0)].groupby('Date')[['markdown_totals']].mean()
    plt.plot(mark_per_store.index, mark_per_store.values, label='store' + str(i))
print(len(train_test.Store.unique()), ' different stores')
plt.title('Markdown vs Stores')
plt.ylabel('Markdown Totals')
plt.xlabel('Date')
plt.show()

In [None]:
plt.figure(figsize=(15,5))

# working on xticks labels with dates
weeks = [str(i) for i in range(1,53)]
weeks[4] = 'superbowl - 2013 / 5'
weeks[5] = 'superbowl - 2012 / 6'
weeks[13] = 'easter - 2012 & 2013 / 14'
weeks[21] = 'memorial day - 22'
weeks[26] = 'independence day - 27'
weeks[35] = 'laborday - 36'
weeks[46] = 'thanksgiving - 47'
weeks[50] = 'christmas - 2010 / 51'
weeks[51] = 'christmas - 2011 / 52'

# making datasets for next plot
mark_totals_2011 = train_test[(train_test['Year'] == 2011) & train_test['Week'].isin(list(range(44,53)))].groupby(train_test['Week'])[['markdown_totals']].sum()
mark_totals_2012 = train_test[train_test['Year'] == 2012].groupby(train_test['Week'])[['markdown_totals']].sum()
mark_totals_2013 = train_test[train_test['Year'] == 2013].groupby(train_test['Week'])[['markdown_totals']].sum()

# plotting 
plt.plot(mark_totals_2011.index, mark_totals_2011['markdown_totals'])
plt.plot(mark_totals_2012.index, mark_totals_2012['markdown_totals'])
plt.plot(mark_totals_2013.index, mark_totals_2013['markdown_totals'])

plt.xticks(np.arange(1, 53, step=1), labels=weeks, rotation=90)
plt.legend(['2011', '2012', '2013'], fontsize=16)
plt.title('Sum of markdown totals - Per week & Year (department/ stores combined)', fontsize=18)
plt.ylabel('markdown totals', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()

#### 2. Type & Size 

Earlier, we noticed multicolinearity between type and size. Also, both features had a small correlation with the Weekly_Sales. Let's again make a plot to verify our assumptions.

In [None]:
for i in train_test['Type'].unique():
    local_data = train_test[train_test['Type'] == i]
    plt.scatter(local_data['Size'], local_data['Weekly_Sales'], label='Type: %s' %i, alpha=0.5)
    
plt.title('Size vs. Sales vs. Type')
plt.legend(loc=2)
plt.show()

- It seems that smaller stores are labeled as type C, medium sized stores as type B, and the biggest ones as type A.

What is the correlation between Types and Weekly_Sales? Let's plot!

In [None]:
plt.title('Type vs. average Sales')
sns.barplot(y='Weekly_Sales', x='Type', data=train_test.dropna())
plt.show()

- It seems that the biggest stores also have the highest Weekly_Sales.

#### 3. IsHoliday

Let's have a look at the mean weekly sales between holiday periods and regular ones!

In [None]:
train_test.groupby('IsHoliday')[['Weekly_Sales']].mean().rename(columns={'Weekly_Sales': 'Average Weekly Sales'})

Something we didn't do now, but we still might want investigate later:
- Sales are higher when there are holidays. Until what extent is this caused by markdowns?

#### 4. Year

Let's plot the mean Sales over the Years

In [None]:
# making datasets for next plot
weekly_sales_2010 = train_test[train_test['Date'].dt.year == 2010].groupby(train_test['Date'].dt.week)[['Weekly_Sales']].mean()
weekly_sales_2011 = train_test[train_test['Date'].dt.year == 2011].groupby(train_test['Date'].dt.week)[['Weekly_Sales']].mean()
weekly_sales_2012 = train_test[train_test['Date'].dt.year == 2012].groupby(train_test['Date'].dt.week)[['Weekly_Sales']].mean()

In [None]:
plt.figure(figsize=(15,5))

# working on xticks labels with dates
weeks = [str(i) for i in range(1,53)]
weeks[5] = 'superbowl - 6'
weeks[12] = 'easter - 2010 / 13'
weeks[13] = 'easter - 2012 & 2013 / 14'
weeks[15] = 'easter - 2011 / 16'
weeks[21] = 'memorial day - 22'
weeks[26] = 'independence day - 27'
weeks[35] = 'laborday - 36'
weeks[46] = 'thanksgiving - 47'
weeks[50] = 'christmas - 2010 / 51'
weeks[51] = 'christmas - 2011 / 52'

# plotting 
plt.plot(weekly_sales_2010.index, weekly_sales_2010['Weekly_Sales'])
plt.plot(weekly_sales_2011.index, weekly_sales_2011['Weekly_Sales'])
plt.plot(weekly_sales_2012.index, weekly_sales_2012['Weekly_Sales'])

plt.xticks(np.arange(1, 53, step=1), labels=weeks, rotation=90)
plt.legend(['2010', '2011', '2012'], fontsize=16)
plt.title('Average Weekly Sales (department/ stores combined) - Per Year', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()

- We see that christmas sales are done in the preceding week, so we have to adjust this later on.

#### 5. Stores

Let's have a look at the differences in sales between the stores.

In [None]:
plt.figure(figsize=(20,8))
sns.barplot(x='Store', y='Weekly_Sales', data=train_test)
plt.title('Average Sales - per Store', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()

- This gives us an interesting insight in the difference between the sales of the stores. We'll later on look how to make better use of the feature by converting the data. 

To further confirm the preceding insight, let's plot sales over time per store.

In [None]:
# plotting
plt.figure(figsize=(15,6))
for i in train_test.Store.unique():
    sale_per_store = train_test[(train_test['Store'] == i)].groupby('Date')[['Weekly_Sales']].mean()
    plt.plot(sale_per_store.index, sale_per_store.values, label='store' + str(i))

print(len(train_test.Store.unique()), ' different stores')
plt.title('Max Sales vs stores')
plt.ylabel('max of Weekly sales')
plt.xlabel('date')
plt.show()

#### 6. Departments

We'll want to do the same thing for departments as for stores, so let's plot!

In [None]:
plt.figure(figsize=(20,8))
sns.barplot(x='Dept', y='Weekly_Sales', data=train_test)
plt.title('Average Sales - per Dept', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Dept', fontsize=16)
plt.show()

- We see the same for departments as with stores. There seems to be a big difference in sales between departments. 
- Also we see that some departments don't seem to have any sales on average. But it is a False conclusion since these departments actually are departments that are not in the train dataset, but are in the test dataset. This means that with our model we will need to predict on departments that it didn't see before.
- Let's verify the different sales per department.

In [None]:
plt.figure(figsize=(15,6))
for i in train_test.Dept.unique():
    sale_per_dept = train_test[(train_test['Dept'] == i)].groupby('Date')[['Weekly_Sales']].mean()
    plt.plot(sale_per_dept.index, sale_per_dept.values, label='department: ' + str(i))

print(len(train_test.Dept.unique()), ' different departments')
plt.title('Average Sales vs departments', fontsize=14)
plt.ylabel('Average Weekly sales')
plt.xlabel('Date')
plt.show()

- Now it's clear that the stores and departments have a big difference in sales among eachother, which probably also influences the mean to differ from the median. We'll confirm this below.

In [None]:
# creating datasets for next plot
weekly_sales_mean = train_test['Weekly_Sales'].groupby(train_test['Date']).mean()
weekly_sales_median = train_test['Weekly_Sales'].groupby(train_test['Date']).median()

# plotting
plt.figure(figsize=(15,5))
plt.plot(weekly_sales_mean.index, weekly_sales_mean.values)
plt.plot(weekly_sales_median.index, weekly_sales_median.values)
plt.legend(['Mean', 'Median'], loc='best', fontsize=16)
plt.title('Weekly Sales - Mean and Median', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Date', fontsize=16)
plt.show()

In [None]:
train_test[['Weekly_Sales']].describe()

- The median is much lower than the mean. This is caused by a couple of stores/departments that have huge weekly sales.

# 2. Feature Engineering

At this point we have concluded our EDA. The next step is to look whether we can create more meaningful features which could be used in the final model for sales predictions. 

## 2.1 Basic Feature Changes

In [None]:
# Again concatenating test and train datasets
train['dataset'] = 'train'
test['dataset'] = 'test'
train_test = pd.concat([train, test])

# Merge all data
train_test = train_test.merge(stores, how='left').merge(features, how='left')

# Creating date-time objects and some extra date-time info
train_test['Date'] = pd.to_datetime(train_test['Date'])
train_test['Year'] = pd.to_datetime(train_test['Date']).dt.year
train_test['Month'] = pd.to_datetime(train_test['Date']).dt.month
train_test['Week'] = pd.to_datetime(train_test['Date']).dt.week
train_test['DayOfTheWeek'] = pd.to_datetime(train_test['Date']).dt.dayofweek
train_test['DayOfTheMonth'] = pd.to_datetime(train_test['Date']).dt.day

# labelencoding the type column by order
train_test['Type_encoded'] = train_test['Type'].replace({'A': 3, 'B': 2,'C':1})

# fahrenheit to celcius
train_test['Temperature'] = round((train_test['Temperature'] - 32) * 5/9,2)

# Creating a column with markdown_totals
train_test['markdown_totals'] = train_test[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].sum(axis=1)

# Department and Store should be categorical values
train_test['Dept'] = train_test['Dept'].apply(str)
train_test['Store'] = train_test['Store'].apply(str)

## 2.2 Feature Engineering Idea 1

- In our plot with which we investigated the sales over the years we noticed that the IsHoliday feature missed out on some important dates. To make our future model better predict based on the holiday feature we completed the data with more relevant holidays.
- Also, to let our model interpretet the feature we created dummy variables of the new column with holiday types and kept the old IsHoliday column.

In [None]:
# holidays
train_test.loc[train_test.Week==6, 'IsHoliday'] = 'superbowl'
train_test.loc[train_test.Week==22, 'IsHoliday'] = 'memorial'
train_test.loc[train_test.Week==27, 'IsHoliday'] = 'independence'
train_test.loc[train_test.Week==36, 'IsHoliday'] = 'laborday'
train_test.loc[train_test.Week==47, 'IsHoliday'] = 'thanksgiving'
# christmas
train_test['week_day'] = np.nan
train_test.loc[(train_test.Year==2010) & (train_test.Week==51), 'IsHoliday'] = 'christmas'
train_test.loc[(train_test.Year==2010) & (train_test.Week==51), 'weekday'] = 'friday'
train_test.loc[(train_test.Year==2011) & (train_test.Week==52), 'IsHoliday'] = 'christmas'
train_test.loc[(train_test.Year==2011) & (train_test.Week==52), 'weekday'] = 'monday'
train_test.loc[(train_test.Year==2012) & (train_test.Week==52), 'IsHoliday'] = 'christmas'
train_test.loc[(train_test.Year==2012) & (train_test.Week==52), 'weekday'] = 'tuesday'
train_test.loc[(train_test.Year==2012) & (train_test.Week==51), 'weekday'] = 'before_CM'
# easter
train_test.loc[(train_test.Year==2010) & (train_test.Week==13), 'IsHoliday'] = 'easter'
train_test.loc[(train_test.Year==2011) & (train_test.Week==16), 'IsHoliday'] = 'easter'
train_test.loc[(train_test.Year==2012) & (train_test.Week==14), 'IsHoliday'] = 'easter'
train_test.loc[(train_test.Year==2013) & (train_test.Week==13), 'IsHoliday'] = 'easter'

# setting rest of holidays to NaN
holidays = ['superbowl','laborday','thanksgiving','christmas','easter','independence','memorial']
train_test.loc[~train_test['IsHoliday'].isin(holidays), 'IsHoliday'] = np.nan

# holiday dummies
holiday_dummies = pd.get_dummies(train_test['IsHoliday'], prefix='Holiday')

# extra holiday column for WMAE
train_test['old_IsHoliday'] =[1 if i != 0 else 0 for i in train_test['IsHoliday'].fillna(0)]

# concat dummies with data
train_test = pd.concat([train_test, holiday_dummies], axis=1)
train_test.head()

## 2.3 Feature Engineering Idea 2

So far we already created dummy variables for the different holiday types. Since the sales are not affected in all departments. This means our dummy variables are not yet a great indicator for the weekly sales. Therefore we'll try to create a new feature which will only work with the departments that are positively affected by the different holidays.

In [None]:
# Long functions - Run them ones!
def get_affected_depts(holiday):
    # make dataframe with overall means and means on holiday
    overall_mean = train_test[~train_test['IsHoliday'].isin(holidays)].groupby('Dept')[['Weekly_Sales']].mean()
    mean_on_holiday = train_test[train_test['IsHoliday'] == holiday].groupby('Dept')[['Weekly_Sales']].mean()
    means = pd.concat([overall_mean, mean_on_holiday], axis=1, keys=['overall_mean', '%s_mean' %holiday])

    # select departments with higher mean sales on christmas than overall
    affected_departments = []
    for label, row in means.iterrows():
        if row['overall_mean'][0] < row['%s_mean' %holiday][0]:
            affected_departments.append(label)
    return affected_departments

# create feature that returns True if the department is one of the affected ones by holiday_x and whether holiday_x is True
    # - "NaN"
    # - "LD" = (laborday)
    # - "SB" = (superbowl)
    # - "TG" = (thanksgiving)
    # - "CM" = (christmas) 
    # - "ID" = (independence day)
    # - "EA" = (easter)    
    # - "MD"= (memorial day)
    
# get affected department-numbers
christ_depts = get_affected_depts('christmas')
thanks_depts = get_affected_depts('thanksgiving')
bowl_depts = get_affected_depts('superbowl')
labor_depts = get_affected_depts('laborday')
easter_depts = get_affected_depts('easter')
independence_depts = get_affected_depts('independence')
memorial_depts = get_affected_depts('memorial')

def assign_holidays_to_affected_depts(row):
    if np.logical_and(row.Dept in labor_depts, row.Holiday_laborday == 1) :
        return 'LD' 
    elif np.logical_and(row.Dept in bowl_depts, row.Holiday_superbowl == 1) :
        return 'SB'
    elif np.logical_and(row.Dept in thanks_depts, row.Holiday_thanksgiving == 1) :
        return 'TG'
    elif np.logical_and(row.Dept in christ_depts, row.Holiday_christmas == 1) :
        return 'CM'
    elif np.logical_and(row.Dept in easter_depts, row.Holiday_easter == 1) :
        return 'EA'
    elif np.logical_and(row.Dept in independence_depts, row.Holiday_independence == 1) :
        return 'ID'
    elif np.logical_and(row.Dept in memorial_depts, row.Holiday_memorial == 1) :
        return 'MD'
    else :
        return np.nan
    
# this function creates the new feature - (running takes a while)
IsHoliday_x_and_affected_dept = train_test.apply(lambda i: assign_holidays_to_affected_depts(i), axis=1)

In [None]:
# getting dummies and assign
train_test['IsHoliday_x_and_affected_dept'] = IsHoliday_x_and_affected_dept
hol_af_dummies = pd.get_dummies(train_test['IsHoliday_x_and_affected_dept'], prefix='Af')
train_test = pd.concat([train_test, hol_af_dummies], axis=1)

In [None]:
sns.barplot(train_test.IsHoliday, train_test['Weekly_Sales'], order=['superbowl','easter','christmas','memorial','independence','laborday','thanksgiving'])
plt.xticks(ticks=[0,1,2,3,4,5,6], labels=['SB','EA','CM','MD','ID','LD','TG'])
plt.title('Old Holiday Feature', fontsize=14)
plt.ylabel('Mean Weekly_Sales')
plt.show()

In [None]:
sns.barplot(train_test.IsHoliday_x_and_affected_dept, train_test['Weekly_Sales'])
plt.title('New Holiday Feature', fontsize=14)
plt.ylabel('Mean Weekly_Sales')
plt.show()

- Looking at the y-axis we can see that the new feature is less affected by certain departments. Hereby we can further differentiate between the departments and holidays which seem to contribute to the development of the weekly sales.  

## 2.4 Feature Engineering Idea 3

Per holiday type there still is a huge differents between the affected departments and the average weekly_sales. The reason for this is probably that there are departments that affected more than others or relatively more. A possible solution could be finding the percentual change in weekly_sales of the affected departments in order to give a better indication of what the possible sales could be eventualy. 

In [None]:
def get_holiday_change(holiday):
    overall_mean = train_test[train_test['old_IsHoliday'] == False].groupby('Dept')[['Weekly_Sales']].mean()
    mean_on_holiday = train_test[train_test['Holiday_' + holiday] == True].groupby('Dept')[['Weekly_Sales']].mean()
    means = pd.concat([overall_mean, mean_on_holiday], axis=1, keys=['overall_mean', '%s_mean' %holiday])
    means['mean_change_%s' %holiday] = (means['%s_mean' %holiday] - means['overall_mean']) / abs(means['overall_mean']) * 100    
    return means[['mean_change_%s' %holiday]]

holiday_changes = pd.concat([get_holiday_change('thanksgiving'), 
                                 get_holiday_change('christmas'), 
                                 get_holiday_change('laborday'),
                                 get_holiday_change('superbowl'),
                                 get_holiday_change('independence'),
                                 get_holiday_change('easter'),
                                 get_holiday_change('memorial')], axis=1)

def assign_change_per_holiday(row):
    if row.Holiday_thanksgiving == True:
        return holiday_changes[['mean_change_thanksgiving']].loc[row['Dept']][0]
    elif row.Holiday_christmas == True:
        return holiday_changes[['mean_change_christmas']].loc[row['Dept']][0]
    elif row.Holiday_superbowl == True:
        return holiday_changes[['mean_change_superbowl']].loc[row['Dept']][0]
    elif row.Holiday_laborday == True:
        return holiday_changes[['mean_change_laborday']].loc[row['Dept']][0]
    elif row.Holiday_easter == True:
        return holiday_changes[['mean_change_easter']].loc[row['Dept']][0]
    elif row.Holiday_memorial == True:
        return holiday_changes[['mean_change_memorial']].loc[row['Dept']][0]
    elif row.Holiday_independence == True:
        return holiday_changes[['mean_change_independence']].loc[row['Dept']][0]
    else:
        return np.nan

In [None]:
# Long function, only run ones!
percentual_change = train_test.apply(lambda i: assign_change_per_holiday(i), axis=1)

In [None]:
# assign percentual change 
train_test['percentual_change'] = percentual_change

# plot
train_test.plot('percentual_change','Weekly_Sales', kind='scatter', alpha=0.5)
plt.show()

- Looking at the resulting plot we have to conclude that this feature will not be able to contribute to a better predictive model.

## 2.5 Feature Engineering Idea 4

To further distinguish between holidays and affected departments we want to create a feature which also considers store-type. 

In [None]:
# creating a feature that combines Affected departments per holicay and types. E.g: Af_CM_TA, Af_CM_TB, Af_CM_Tc, etc...
for i in ['Af_CM', 'Af_EA', 'Af_LD', 'Af_SB', 'Af_TG', 'Af_ID', 'Af_MD']:
    selected_data = train_test[train_test[i] == 1]
    TA = selected_data[selected_data['Type'] == 'A']['Type'].index
    TB = selected_data[selected_data['Type'] == 'B']['Type'].index
    TC = selected_data[selected_data['Type'] == 'C']['Type'].index
    train_test[i + '_TA'] = 0
    train_test[i + '_TB'] = 0
    train_test[i + '_TC'] = 0
    train_test.loc[TA, i + '_TA'] = 1
    train_test.loc[TB, i + '_TB'] = 1
    train_test.loc[TC, i + '_TC'] = 1

## 2.6 Feature Engineering Idea 5

In the EDA we saw that in some instances the mardowns preceded the sales. To implement this in the model we can create a feature which puts the relevant markdown values and weekly sales in the correct week. Also we created a feature which indicates whether a certain markdown type is present in a certain period. 

In [None]:
# markdown lag 
mark1_lag = train_test[['MarkDown1']].rename(columns={'MarkDown1':'mark1_lag'})
mark1_lag.drop(536633, inplace=True)
mark1_lag = pd.concat([pd.DataFrame([np.nan]).rename(columns={0:'mark1_lag'}), mark1_lag]).reset_index(drop=True)

mark2_lag = train_test[['MarkDown2']].rename(columns={'MarkDown2':'mark2_lag'})
mark2_lag.drop(536633, inplace=True)
mark2_lag = pd.concat([pd.DataFrame([np.nan]).rename(columns={0:'mark2_lag'}), mark2_lag]).reset_index(drop=True)

mark3_lag = train_test[['MarkDown3']].rename(columns={'MarkDown3':'mark3_lag'})
mark3_lag.drop(536633, inplace=True)
mark3_lag = pd.concat([pd.DataFrame([np.nan]).rename(columns={0:'mark3_lag'}), mark3_lag]).reset_index(drop=True)

mark4_lag = train_test[['MarkDown4']].rename(columns={'MarkDown4':'mark4_lag'})
mark4_lag.drop(536633, inplace=True)
mark4_lag = pd.concat([pd.DataFrame([np.nan]).rename(columns={0:'mark4_lag'}), mark4_lag]).reset_index(drop=True)

mark5_lag = train_test[['MarkDown5']].rename(columns={'MarkDown5':'mark5_lag'})
mark5_lag.drop(536633, inplace=True)
mark5_lag = pd.concat([pd.DataFrame([np.nan]).rename(columns={0:'mark5_lag'}), mark5_lag]).reset_index(drop=True)

# markdown presence
train_test = train_test.assign(md1_present = train_test.MarkDown1.notnull().astype(int))
train_test = train_test.assign(md2_present = train_test.MarkDown2.notnull().astype(int))
train_test = train_test.assign(md3_present = train_test.MarkDown3.notnull().astype(int))
train_test = train_test.assign(md4_present = train_test.MarkDown4.notnull().astype(int))
train_test = train_test.assign(md5_present = train_test.MarkDown5.notnull().astype(int))

# concat them
train_test = pd.concat([train_test, mark1_lag, mark2_lag, mark3_lag, mark4_lag, mark5_lag], axis=1)

## 2.7 Feature Engineering Idea 6

Earlier we found weekly_sales is most affected by 'Store', 'Department', 'week' and 'special events such as holidays. Therefore we created a feature which calculates the average sales combined on these features. Also we created the total weekly_sales per store and week, divided by the years. We did this so we could divide the former created feature by the total weekly_sales of the store to get a percentage which could be assigned to the responisble department. Also, because christmas falls on a Tuesday in 2012 and on a Monday in 2011 we have to adjust the results so that the model won't predict the same sales for christmas 2012 as we're in week 51 and 52 in 2011.

In [None]:
train_test['IsHoliday'] = train_test['IsHoliday'].replace({np.nan: 'NA'})
first_part = train_test[['Store','Dept', 'IsHoliday', 'Week', 'weekday', 'Weekly_Sales']]
first_part['weekday'] = first_part['weekday'].replace({np.nan: 'NA'})
first_part = first_part.groupby(['Store','Dept', 'IsHoliday', 'Week', 'weekday'])[['Weekly_Sales']].mean()
print(first_part.shape)
first_part.reset_index()[(first_part.reset_index()['IsHoliday'] == 'christmas') & (first_part.reset_index()['Week'] == 52)].head()

In [None]:
second_part = train_test[(train_test.Year == 2012) & (train_test.Week.isin([51,52]))].groupby(['Store','Dept', 'IsHoliday', 'Week', 'weekday'])[['Weekly_Sales']].mean()
second_part = second_part.reset_index()
second_part.head()

In [None]:
# finding the right values for the 2012 data for week 51 and 52
for label, row in second_part.iterrows():
    store = row['Store']
    dept = row['Dept']
    weekday = row['weekday']

    try:
        if weekday == 'before_CM': # for when it isn't officially christmas week yet but much purchases are done
            weekly_sale_before_CM = first_part.loc[store, dept, 'christmas', 51, 'friday'][0] * 0.85 #2010
            second_part.loc[label, 'Weekly_Sales'] = weekly_sale_before_CM

        elif weekday == 'tuesday': # for when it is officially christmas but because it falls on tuesday the purchases are devided with previous week
            weekly_sale_CM =  first_part.loc[store, dept, 'christmas', 51, 'friday'][0] * 0.7 #2010
            second_part.loc[label, 'Weekly_Sales'] = weekly_sale_CM
            
    except:
        # if department wasn't in train data or in train data 2010
        second_part.loc[label, 'Weekly_Sales'] = np.nan # we'll later impute it with a median of stores per week and holiday

In [None]:
# create a df with values for all of the rows excluding the NaN-values
weekly_dept_mean = pd.merge(first_part, second_part, on=['Store','Dept','IsHoliday','Week','weekday'], how='left')
print(weekly_dept_mean.shape)
weekly_dept_mean.fillna(0, inplace=True)
weekly_dept_mean['Weekly_Sales'] = weekly_dept_mean['Weekly_Sales_x'] + weekly_dept_mean['Weekly_Sales_y']
weekly_dept_mean['Weekly_Sales'] = weekly_dept_mean['Weekly_Sales'].replace({0: np.nan})
print(weekly_dept_mean.isna().sum())

# departments of which some values are still None. They are not in the train data
display(weekly_dept_mean[~weekly_dept_mean['Weekly_Sales'].notnull()]['Dept'].value_counts().sort_index(), weekly_dept_mean[~weekly_dept_mean['Weekly_Sales'].notnull()])
weekly_dept_mean = weekly_dept_mean[['Store','Dept','IsHoliday','Week','weekday','Weekly_Sales']]

# impute feature column where rows are with departments that only excist in test with the median of the store per week and holiday
store_medians = train_test[['Store', 'IsHoliday', 'Week', 'Weekly_Sales']]
store_medians['IsHoliday'] = store_medians['IsHoliday'].replace({np.nan: 'NA'})
store_medians = store_medians.groupby(['Store','Week','IsHoliday'])[['Weekly_Sales']].median()

weekly_dept_mean.fillna(0, inplace=True)
weekly_dept_mean[['Weekly_Sales']] = weekly_dept_mean.apply(lambda i: store_medians.loc[i['Store'], i['Week'], i['IsHoliday']][0] if i['Weekly_Sales'] == 0 else i['Weekly_Sales'], axis=1)
weekly_dept_mean = weekly_dept_mean.set_index(['Store','Dept','IsHoliday','Week','weekday'])[['Weekly_Sales']]
weekly_dept_mean

In [None]:
# Long functions - Run them ones!
train_test['weekday'] = train_test['weekday'].replace({np.nan: 'NA'})

# new - based on functions above
dept_weekly_mean_per_store = train_test[['Store', 'Dept', 'IsHoliday', 'Week', 'weekday']].apply(lambda i: weekly_dept_mean.loc[i['Store'], i['Dept'], i['IsHoliday'], i['Week'], i['weekday']][0], axis=1)

In [None]:
# Long functions - Run them ones!
weekly_store_max = train_test.groupby(['Store', 'IsHoliday', 'Week', 'Year'])[['Weekly_Sales']].sum()
weekly_store_max = weekly_store_max[weekly_store_max['Weekly_Sales'] != 0]
weekly_store_max = weekly_store_max.groupby(['Store', 'IsHoliday', 'Week'])[['Weekly_Sales']].mean()
weekly_store_max_total = train_test[['Store', 'IsHoliday', 'Week']].apply(lambda i: weekly_store_max.loc[i['Store'], i['IsHoliday'], i['Week']][0], axis=1)

In [None]:
# assign newly created features to the data
train_test['dept_weekly_mean_per_store'] = dept_weekly_mean_per_store
train_test['weekly_store_mean_total'] = weekly_store_max_total
# getting percentage
train_test['dept_percentage'] = train_test['dept_weekly_mean_per_store'] / train_test['weekly_store_mean_total'] * 100
# resetting IsHoliday column NaN Features
train_test['IsHoliday'] = train_test['IsHoliday'].replace({'NA': np.nan})
train_test['weekday'] = train_test['weekday'].replace({'NA': np.nan})

# 3 Modeling

To be able to create the best possible predictive model we followed a few essential steps. We started with model preperation, where we converted our data to machine language. After a final insight in the linear correlation we selected the best features and model by evaluation (trial and error) of the different feature-combinations and models. Finally we created a submission file based on our best model, so we could achieve our best possible score in the Kaggle competion.  

## 3.1 Model Preperation

In [None]:
# get dummies for features that are not machine readable yet
dept_dummies = pd.get_dummies(train_test['Dept'], drop_first=True, prefix='Dept')
store_dummies = pd.get_dummies(train_test['Store'], drop_first=True, prefix='Store')
month_dummies = pd.get_dummies(train_test['Month'], drop_first=True, prefix='Month')
week_dummies = pd.get_dummies(train_test['Week'], drop_first=True, prefix='Week')
year_dummies = pd.get_dummies(train_test['Year'], drop_first=True, prefix='Year')
type_dummies = pd.get_dummies(train_test['Type'], prefix='Type').drop('Type_C',axis=1)
old_IsHoliday_dummies = pd.get_dummies(train_test['old_IsHoliday'], prefix='old_IsHoliday', drop_first=True)

# concat dummies with data
train_test = pd.concat([train_test, type_dummies, week_dummies, month_dummies, store_dummies, dept_dummies, year_dummies, old_IsHoliday_dummies], axis=1)

## 3.2 Correlation of all Model Features

Now, with all of the new features, lets have a look at some correlations insights with the Weekly_Sales.

In [None]:
# pearson correlation 
check_corr = train_test.corr()

In [None]:
check_corr[['Weekly_Sales']].sort_values('Weekly_Sales', ascending=False).head(10)

In [None]:
# plotting best indicator feature
plt.scatter(train_test['dept_weekly_mean_per_store'], train_test['Weekly_Sales'])
plt.title('New Feature', fontsize=14)
plt.ylabel('Weekly_Sales', fontsize=12)
plt.xlabel('Assigned weekly average sales', fontsize=12)
plt.show()

In [None]:
# Multicolinearity
plt.figure(figsize=(12,8))
cor = check_corr.loc[['Weekly_Sales', 'Size','dept_weekly_mean_per_store', 'dept_percentage','Dept_92','weekly_store_mean_total'], ['Weekly_Sales', 'Size','dept_weekly_mean_per_store', 'dept_percentage','Dept_92','weekly_store_mean_total']]
threshold = cor < 0.7 
sns.heatmap(cor, mask=threshold, cmap='Blues') 
plt.show()

- It seems that with some of the new created features we'll be able to predict quite accurately. Let's see how they really behave in models!

## 3.3 Model- and Feature Selection & Hyperparameter Tuning

With al our features combined we ran multiple model-configurations to see which features where scoring best according to the feature-importance-attribute of the models. Afterwards we tried different settings and parameters in the models in order to achieve an even better final score. 

In [None]:
# all features in lists for feature selection
affected_features = ['Af_CM', 'Af_EA', 'Af_LD', 'Af_SB', 'Af_TG', 'Af_ID', 'Af_MD']
affected_features_types = ['Af_CM_TA', 'Af_EA_TA', 'Af_LD_TA', 'Af_SB_TA', 'Af_TG_TA', 'Af_CM_TB', 'Af_EA_TB', 'Af_LD_TB', 'Af_SB_TB', 'Af_TG_TB', 'Af_CM_TC', 'Af_EA_TC', 'Af_LD_TC', 'Af_SB_TC', 'Af_TG_TC', 'Af_ID_TA', 'Af_ID_TB', 'Af_ID_TC', 'Af_MD_TA', 'Af_MD_TB', 'Af_MD_TC']

dept_features = dept_dummies.columns.to_list()
type_features = type_dummies.columns.to_list()
store_features = store_dummies.columns.to_list()
week_features = week_dummies.columns.to_list()
month_features = month_dummies.columns.to_list()
year_features = year_dummies.columns.to_list()

markpresent_features = ['md1_present', 'md2_present', 'md3_present', 'md4_present', 'md5_present']
marklag_features = ['mark1_lag', 'mark2_lag', 'mark3_lag', 'mark4_lag', 'mark5_lag']
markdown_features = ['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']
other_features = ['Size', 'old_IsHoliday', 'dept_weekly_mean_per_store', 'weekly_store_mean_total', 'dept_percentage']

# The final features-combination-input
feature_names = other_features + dept_features

# get data
X = train_test[train_test['dataset'] == 'train'][feature_names].fillna(0)
y = train_test[train_test['dataset'] == 'train']['Weekly_Sales']

In [None]:
# split data
X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.6)

# modeling 
rf = RandomForestRegressor(verbose=2, n_estimators=57)

# fitting model
rf.fit(X_train, y_train)
names = train_test[feature_names]

print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), reverse=True)) # see link 4

In [None]:
# neural network
nn = MLPRegressor(verbose=2)
nn.fit(X_train,y_train.astype(int))

The evaluation is based on Weighted Mean Absolute Error (WMAE), with a weight of 5 for Holiday Weeks and 1 otherwise. Turning this formula in a function, we create the following code.

In [None]:
# scoring
def WMAE(data, real, predicted):
    weights = [5 if i == 1 else 1 for i in data.old_IsHoliday]
    return np.round(np.sum(weights*abs(real-predicted))/(np.sum(weights)), 2)

In [None]:
# predicting with RandomForest
y_pred_rf = rf.predict(X_valid)
print('rf-WMAE:', WMAE(X_valid, y_valid, y_pred_rf))

# plotting rf-predictions against y_valid
plt.scatter(y_pred_rf, y_valid)
plt.title('Random Forest')
plt.show()

plt.scatter(X_valid.index, y_valid)
plt.scatter(X_valid.index, y_pred_rf, c='red', alpha=0.1)
plt.show()

In [None]:
# predicting with Neural Network
y_pred_nn = nn.predict(X_valid)
WMAE(X_valid, y_valid, y_pred_nn)
print('nn-WMAE:', WMAE(X_valid, y_valid, y_pred_nn))

# plotting nn-predictions against y_valid
plt.scatter(y_pred_nn, y_valid)
plt.title('Neural Net')
plt.show()

plt.scatter(X_valid.index, y_valid)
plt.scatter(X_valid.index, y_pred_nn, c='red', alpha=0.1)
plt.show()

## 3.4 Final Model & Submission

Finaly based on the WMAE we were able to retrieve the best model, hyperparameters and data-feature-combination to predict on our given test set. 

In [None]:
# for now we take X and y to fit the model on all the data
rf = RandomForestRegressor(verbose=2, n_estimators=57, random_state=42)
rf.fit(X, y)
names = train_test[feature_names]

print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), reverse=True)) # see link 4

In [None]:
# neural network
nn = MLPRegressor(verbose=2, random_state=42)
nn.fit(X,y.astype(int))

In [None]:
# predicting on test data for submission
test_predict = train_test[train_test['dataset'] == 'test'][feature_names].fillna(0)
test_y_pred = rf.predict(test_predict)

# # neural_net predicting on test data for submission
# test_predict = train_test[train_test['dataset'] == 'test'][feature_names].fillna(0)
# test_y_pred = nn.predict(test_predict)

In [None]:
# submitting
testfile = pd.concat([train_test[train_test['dataset'] == 'test'].reset_index(drop=True), pd.DataFrame(test_y_pred)], axis=1).rename(columns={0:'prediction'})

submission = pd.DataFrame({'id':pd.Series([''.join(list(filter(str.isdigit, x))) for x in testfile['Store'].astype(str)]).map(str) + '_' +
                           pd.Series([''.join(list(filter(str.isdigit, x))) for x in testfile['Dept'].astype(str)]).map(str)  + '_' +
                           testfile['Date'].astype(str).map(str),
                          'Weekly_Sales':testfile['prediction']})
submission.head()

In [None]:
submission.to_csv('Walmart_submission.csv',index=False)

## 3.5 Inspect predictions 

Let's plot the new predictions over time and see how they'll differ from the train_data!

In [None]:
# making datasets for next plot
weekly_sales_2012_test = testfile[testfile['Year'] == 2012].groupby(testfile['Week'])[['prediction']].mean()
weekly_sales_2013_test = testfile[testfile['Year'] == 2013].groupby(testfile['Week'])[['prediction']].mean()

plt.figure(figsize=(15,5))

# working on xticks labels with dates
weeks = [str(i) for i in range(1,53)]
weeks[5] = 'superbowl - 6'
weeks[12] = 'easter - 2010 & 2013 / 13'
weeks[13] = 'easter - 2012 / 14'
weeks[15] = 'easter - 2011 / 16'
weeks[21] = 'memorial day - 22'
weeks[26] = 'independence day - 27'
weeks[35] = 'laborday - 36'
weeks[46] = 'thanksgiving - 47'
weeks[50] = 'christmas - 2010 / 51'
weeks[51] = 'christmas - 2011 & 2012 / 52'

# plotting 
plt.plot(weekly_sales_2010.index, weekly_sales_2010['Weekly_Sales'])
plt.plot(weekly_sales_2011.index, weekly_sales_2011['Weekly_Sales'])
plt.plot(weekly_sales_2012.index, weekly_sales_2012['Weekly_Sales'])

plt.plot(weekly_sales_2012_test.index, weekly_sales_2012_test['prediction'], marker='D')
plt.plot(weekly_sales_2013_test.index, weekly_sales_2013_test['prediction'], marker='D')

plt.xticks(np.arange(1, 53, step=1), labels=weeks, rotation=90)
plt.legend(['2010','2011','2012','2012_predictions', '2013_predictions'], fontsize=16)
plt.title('Average predicted Weekly Sales - Per week & year', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()

### Bibliography
1. https://www.kaggle.com/avelinocaio/walmart-store-sales-forecasting
2. https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/notebooks?sortBy=voteCount&group=everyone&pageSize=20&competitionId=3816
3. https://www.startpagina.nl/v/wetenschap/wiskunde/vraag/660756/bereken-correlatie-tussen-nominaal-meetniveau/
4. https://blog.datadive.net/selecting-good-features-part-iii-random-forests/
5. https://www.timeanddate.com/holidays/us/?hol=1
6. https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e
7. https://explained.ai/rf-importance/index.html#5
8. https://www.inc.com/bill-murphy-jr/heres-crazy-reason-why-holiday-season-is-6-days-shorter-in-2019.html

### <i>Points of improvement</i>
1. 1.4 Further EDA on important Features
    - Sales are higher when there are holidays. We could further explore untill what extent this could be caused by markdowns.
2. 3.4 Final Model & Submissions
    - Even though it seems that we found a great explanotry variable for the Weekly_Sales, there are plenty of better scores on Kaggle than ours. Therefor we expect the test data to be slightly different than what we fitted the model on - train data. Herefor, to improve our model and make adjustments for the difference in the test data we could further explore it.
    - The test data seems to be different from the train data in a couple of aspects:
        - There are departments we need to predict upon that are not in the train data
        - Some of the holidays fall on other dates/ weeks what makes it harder to predict
    - Also, it seems there are still some more official holidays to add to make better use of the holiday feature