# Project Objective

The main purpose of this project is to correctly predict sales, specially in holiday weeks, despite not having complete/ideal historical data. It is part of tha challenge to model the effects of markdowns on these holiday weeks.

The data for this project is available in:
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

pd.options.display.max_columns = 30

## Importing files

Next, we will import and analyze all data files in order to evaluate all the information we have. From the "Data Description" section from Kaggle, we can see that there are date columns in "train.csv", "test.csv" and "features.csv" files. Hence, we will import the "Date" column from these files as “datetime64” data type.

## Train file

The first file is the train file. Each row of this file represents one day of sale for one department of one store. It also informs how much was sold and if that day is part of a special holiday week.

In [None]:
#importing data
train = pd.read_csv("../input/walmart-recruiting-store-sales-forecasting/train.csv.zip", parse_dates = ["Date"])

def snake_case(dataframe):
    #convert column names to snake_case
    dataframe.columns = dataframe.columns.str.lower()
    dataframe = dataframe.rename(columns = {"isholiday" : "is_holiday"})
    return dataframe

#converting column names to snake_case
train = snake_case(train)

#exploring the data
print(train.shape)
train.head()

## Test file

It contains the same columns as the train file, except for "weekly_sales", which is our target. The test file is about 27% the train file.

In [None]:
#importing data
test = pd.read_csv("../input/walmart-recruiting-store-sales-forecasting/test.csv.zip", parse_dates = ["Date"])

#converting column names to snake_case
test = snake_case(test)

#exploring the data
print(test.shape)
print(test.shape[0] / train.shape[0]) #how long is the test dataframe in comparison to train the dataframe
test.head()

## Features file

The features are separated from the train and test dataframes. It must be merged together with both test and train dataframes in order to train the model and predict the weekly sales. It will be merged on "store" and "date" columns.

In [None]:
#importing data
features = pd.read_csv("../input/walmart-recruiting-store-sales-forecasting/features.csv.zip", parse_dates = ["Date"])

#converting column names to snake_case
features = snake_case(features)

#exploring the data
print(features.shape)
features.head()

## Stores file

Characteristics of each store are also separeted from the train and test dataframes. Therefore, stores dataframe will also be merged with both of them. They will be merged on "store" column.

In [None]:
#importing data
stores = pd.read_csv("../input/walmart-recruiting-store-sales-forecasting/stores.csv")

#converting column names to snake_case
stores = snake_case(stores)

#exploring the data
print(stores.shape)
stores.head()

## Sample Submission file

The sample submission file contains one Id column that is not present in the train dataframe nor in the test dataframe. Hence, the Id column will be created in both test and train dataframes by concatenating the "store", "dept", and "date" with underscores.

In [None]:
#importing data
sample = pd.read_csv("../input/walmart-recruiting-store-sales-forecasting/sampleSubmission.csv.zip")

#exploring the data
print(sample.shape)
sample.head()

## Checking for duplicated information

We will check if the data have duplicated information. It is important to avoid addition of new rows when merging the dataframes together. For train and test, we should have one row for each subset "store-dept-date". For features, we should have one row for each subset "store-date". For stores dataframe, we must have one row for each store. Finally, for sample, it should contain one row for each "Id".

In [None]:
#checking for duplicated. If sum is > than 0, it has duplicated rows.
print("duplicated in train:", train.duplicated(subset = ["store", "dept", "date"]).sum())
print("duplicated in test:", test.duplicated(subset = ["store", "dept", "date"]).sum())
print("duplicated in features:", features.duplicated(subset = ["store", "date"]).sum())
print("duplicated in stores:", stores.duplicated(subset = ["store"]).sum())
print("duplicated in sample:", sample.duplicated(subset = ["Id"]).sum())

There are no duplicated rows and we can proceed to check for missing values.

## Checking for missing values

We will check how much missing data we have and choose the best way to deal with that.

In [None]:
train.info()

No missing values in the "train" dataframe.

In [None]:
test.info()

No missing values in the "test" dataframe.

In [None]:
stores.info()

No missing values in the "stores" dataframe.

In [None]:
features.info()

There are missing values in the "features" dataframe. We have to deal with them in order to apply machine learning techniques. We will investigate these data further.

In [None]:
#importing data visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

#plotting missing data
fig_missing_data, ax_missing_data = plt.subplots(figsize = (18,6))
sns.heatmap(features.isnull(), ax = ax_missing_data)

ax_missing_data.set_title("Missing data by column")
ax_missing_data.set_ylabel("Row index")

plt.show()

We will have to deal with 7 columns. All five "markdown" columns, "cpi" column and "unemployment" column. We will start dealing "cpi" and "unemployment", that have much less missing data than any of the five "markdown" columns.

## Dealing with "cpi" and "unemployment" missing values

From the "Data Description" section from Kaggle, "cpi" stands for consumer price index and "unemployment" is the unemployment rate. Both vary a little over time, thus, one idea to deal with missing data is to take the mean between the preceding week and the subsequent week and use it to fill the missing values. We will plot both of these columns to check if taking the mean  between the subsequent week and the preceding week is a valid option.

In [None]:
#registering converters to avoid warning
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

def plot_feature(axes_object, dataframe, column_name, store = None):
    #plot single store
    if store is not None:
        dataframe_mask = dataframe["store"] == store
        
        hue = None #only one line will be plotted
        
        title = "Store {} - {} data".format(store, column_name.title())
    
    #plot all stores
    else:
        # creating mask with only true values
        dataframe_mask = pd.Series(np.ones(len(dataframe), dtype=bool))
        
        hue = dataframe[dataframe_mask]["store"] #create one line for each store
        
        title = "All Stores - {} data".format(column_name.title())
    
    #plotting
    sns.lineplot(dataframe[dataframe_mask]["date"], dataframe[dataframe_mask][column_name], hue = hue, ax = axes_object)
    
    #setup
    axes_object.set_title(title)
    axes_object.set_xlabel("Date")
    axes_object.set_ylabel(column_name.title())
    
    plt.setp(axes_object.xaxis.get_majorticklabels(), rotation=45)

    
#creating figure with four graphs
fig_time, axs_time = plt.subplots(2, 2, figsize = (15, 15))
    
plot_feature(axs_time[0, 0], features, "cpi", store = 1)
plot_feature(axs_time[1, 0], features, "unemployment", store = 1)
plot_feature(axs_time[0, 1], features, "cpi")
plot_feature(axs_time[1, 1], features, "unemployment")

We expected a discontinuity in the lines, representing the missing data. All lines representing each store are continuous, thus, the missing values occur in the first dates of each store or in the last dates.

From the heatmap we plotted earlier, it seems like both "cpi" and "unemployment" data are missing in similarly spaced chunks of rows. Since the data from features dataframe is ordered by store and then by date, this could mean that the missing data occurs at the same dates for all stores.

Now we will check if the missing values for "cpi" and "unemployment" really occur in the beginning or the end of the time series.

In [None]:
#Checking if "cpi" and "unemployment" data are missing in the same rows
(features["cpi"].isnull() != features["unemployment"].isnull()).sum() # When summing boolean values, True equals 1 
                                                                      # and False equals 0

Both "cpi" and "unemployment" have missing data in the same rows. Thus, we can use either of these columns to create a mask for missing values.

In [None]:
#creating mask
missing_mask = features["cpi"].isnull()

#dates with missing cpi and unemployment data
missing_dates = features[missing_mask]["date"].unique()
sorted(missing_dates)

In [None]:
#all dates in the dataframe
all_dates = features["date"].unique()
sorted(all_dates)

As expected, the missing data occurs in the last 13 weeks of the time series. It seems that from 2013-05-03 onward, no more "cpi" and "unemployment" data was collected. We will verify if that happens to every store.

In [None]:
#column that identify if cpi is null
features["cpi_isnull"] = missing_mask

# pivot table to check if "cpi" and "unemployment" data are missing in the same dates for all stores
pivot_missing_cpi = features.pivot_table("cpi_isnull", "date", "store", aggfunc = np.sum)
pivot_missing_cpi

There are 45 stores. If the cpi is missing for every store in a specific date (row), the sum of all the values of the row will be 45. Similarly, if all values are False, the sum of all values of that row will be 0.

In [None]:
#counting how many stores have missing "cpi" data for each date
count_missing_stores = pivot_missing_cpi.sum(axis = 1)

# printing the value counts to be sure that all values will be either 0 (no missing data for all stores) 
# or 45 (all stores missing cpi data)
print(count_missing_stores.value_counts())
count_missing_stores.tail(15)

From above, we can conclude that all stores misses "cpi" and "unemployment" data from 2013-05-03 onward. If we want to use "cpi" and "unemployment" as features to predict future sales, the lack of data in the most recent dates can be quite troublesome. For instance, it is likely that the train file contains older sales and the test files contain more recent sales. If that is true, all "cpi" and "unemployment" missing data will occur in the test file and no missing data will be present in the train file. 

As a consequence, if we want to train and test our model using "cpi" and "unemployment" data as features, we will have to manually fill the missing values, all of which are present in the test file. Hence, our model will likely perform far worse on the test set in comparison to the training set.

We have to verify if all missing values for "cpi" and "unemployment" occur on dates that are present exclusively in the test dataframe.

In [None]:
# checking how many times "cpi" and "unemployment" will be missing in each dataframe, after merging with features
missing_cpi_train = train["date"].isin(missing_dates).sum()
missing_cpi_test = test["date"].isin(missing_dates).sum()

print(missing_cpi_train, missing_cpi_test)

Every "cpi" and "unemployment" missing data will be present on the test dataframe and none on the train dataframe. About one third of the 115064 rows of the test dataframe (38162 rows) will have missing values for "cpi" and "unemployment" columns. For now, we will fill the missing values with the value from the closest date. But we must remember that dropping these columns entirely may be the best idea.

The last date with data for "cpi" and "unemployment" columns were in 2013-04-26. The value from that date will be used to fill the missing values.

In [None]:
#dropping "CPI_isnull" column that was used specifically for our previous sanity check
features = features.drop("cpi_isnull", axis = 1)

#slicing the dataframe with the date of 2013-04-26
features_2013_04_26 = features[features["date"] == "2013-04-26"]

#filling the CPI and Unemployment missing values with the data from the day 2013-04-26 for each store
for store in range(1, 46):
    
    #values to be used to fill
    cpi_value =  features_2013_04_26[features_2013_04_26["store"] == store]["cpi"].iloc[0]
    unemployment_value =  features_2013_04_26[features_2013_04_26["store"] == store]["unemployment"].iloc[0]
    
    #filling the missing values
    indexes = features[(features["store"] == store) & features["cpi"].isnull()].index
    features.loc[indexes, "cpi"] = cpi_value
    features.loc[indexes, "unemployment"] = unemployment_value

Finally, we will plot "cpi" and "unemployment" columns again to see if the data in the last weeks of the time series are now present in the charts and to check if the values seem far from reality.

In [None]:
cpifig_time, axs_time = plt.subplots(2, 2, figsize = (15, 15))
    
plot_feature(axs_time[0, 0], features, "cpi", store = 1)
plot_feature(axs_time[1, 0], features, "unemployment", store = 1)
plot_feature(axs_time[0, 1], features, "cpi")
plot_feature(axs_time[1, 1], features, "unemployment")

Using the last known value to fill the missing values seem to be an appropriate aproximation.

## Dealing with "markdown" missing values

Now that we have dealt with "cpi and "unemployment" columns, we will deal the remaining five "markdown" columns, that have much more missing data than "cpi" and "unemployment".

In [None]:
#markdown column names
mark_cols = ["markdown{}".format(count) for count in range(1,6)]

#percentage of missing values
features[mark_cols].isnull().sum() / len(features[mark_cols])

More than half of all data is missing. We could drop these columns entirely or try to fill it with a meaningful value. Since "markdown" is likely to have an influcente on sales, we will keep these columns for now.

From the "Data Description" section from Kaggle, the following is explained:

"MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA."

From that, we can infer that "markdown" is probably a value different from zero when data is not available. Since we cannot be sure if NA data is zero or an unknown value different from zero, we will try both approaches and keep the one that gives us the best results.

## Filling with zero approach

The first approach will be to fill all missing data with zeroes.

In [None]:
features_zero = features.copy()
features_zero[mark_cols] = features[mark_cols].fillna(0).copy()
features_zero

In [None]:
features_zero.info()

## Filling with mean approach

The second approach will be to fill all missing data with the mean value of the column. From Kaggles's Data Description:

"...Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas."

Thus, "markdown" values may be signicantly different depending if it is a holiday week or not. With this in mind, a better approach would be to calculate two means for each "markdown" column. One mean for holiday week and another for not holiday week.

In [None]:
#calculating the mean of "markdown" columns
markdown_holiday = features.groupby("is_holiday")[mark_cols].mean()
markdown_holiday

As expected, "markdown" are quite different during holiday season. We will fill with different values for holiday weeks and non holiday weeks.

In [None]:
#identifying the index of holiday rows and non holiday rows
features_holiday_index = features[features["is_holiday"]].index
features_not_holiday_index = features[~features["is_holiday"]].index

features_mean = features.copy()

#filling with the appropriate mean
features_mean.loc[features_holiday_index, mark_cols] = (features_mean.loc[features_holiday_index, 
                                                                          mark_cols
                                                                         ].fillna(markdown_holiday.iloc[1])) # Holiday
features_mean.loc[features_not_holiday_index, mark_cols] = (features_mean.loc[features_not_holiday_index, 
                                                                              mark_cols
                                                                             ].fillna(markdown_holiday.iloc[0])) # Non Holiday
features_mean

In [None]:
features_mean.info()

## Merging Data

All features we treated so far are related to the stores and none are related to the department. Hence, our machine learning model will at first predict the sales at each date for each store. The prediction of sales of the department will be done by calculating the representativity that each department has in each store.

Therefore, the first step we will take before merging the data, is group the sales of the train dataframe and the test dataframe by date and by store. Only then we will merge the features with the train and test dataframe. Finally, the boolean column "is_holiday" will be converted to 1 and 0.

In [None]:
#merging features with stores dataframes
features_zero = pd.merge(features_zero, stores, how = "left", on = ["store"])
features_mean = pd.merge(features_mean, stores, how = "left", on = ["store"])

#function to group train and test dataframes
def group_dataframe(dataframe, is_train = True):
    #grouping dataframe
    grouped = dataframe.groupby(["store", "date"]).agg(["sum","mean"])
    
    if is_train:
        #selecting the columns with the sum of sales and the mean of is_holiday
        grouped = grouped.iloc[:,[2,5]]

        #renaming the columns
        grouped.columns = ["weekly_sales", "is_holiday"]
    
    else:
        #selecting the column with the mean of is_holiday
        grouped = grouped.iloc[:,3]

        #drop column level
        grouped.name = "is_holiday"
        
    #return dataframe with indexes reset
    return grouped.reset_index()

# merging features with train and test dataframes. We are leaving "is_holiday" column from features
# out to avoid duplicating columns
feature_df_cols = ['store', 'date', 'temperature', 'fuel_price', 'markdown1', 'markdown2',
                   'markdown3', 'markdown4', 'markdown5', 'cpi', 'unemployment', 'type', 'size']

grouped_train = group_dataframe(train, True)
grouped_test = group_dataframe(test, False)

train_zero = pd.merge(grouped_train, features_zero[feature_df_cols], how = "left", on = ["store", "date"])
train_mean = pd.merge(grouped_train, features_mean[feature_df_cols], how = "left", on = ["store", "date"])

#Converting boolean to zero or one
train_zero["is_holiday"] = train_zero["is_holiday"].astype(int)
train_mean["is_holiday"] = train_mean["is_holiday"].astype(int)

## Exploratory analysis

Now we will investigate what features seem to influence the sales. We will use train_zero dataframe in our analysis instead of using both, train_zero and train_mean because they are the same, except for "markdown" columns. When investigating "markdown" columns we will use both to see how they differ.

In [None]:
#Creating exploratory dataframe
exploratory = train_zero.copy()

#function to create time related columns
def add_time_columns(dataframe):
    #add time related columns
    dataframe["year"] = dataframe["date"].dt.year
    dataframe["month"] = dataframe["date"].dt.month
    dataframe["year_month"] = dataframe["date"].dt.to_period("M")
    dataframe["week"] = dataframe["date"].dt.week
    
    return dataframe

#creating dummy function to explore categorical columns
def add_dummy(column_name, dataframe):
    # add dummy columns
    dummy_df = pd.get_dummies(dataframe[column_name], prefix = column_name)
    dataframe = pd.concat([dataframe, dummy_df], axis = 1)
    return dataframe

#adding time related columns and dummy columns to dataframes
exploratory = add_time_columns(exploratory)
exploratory = add_dummy('type', exploratory)

exploratory.head()

Firstly, we will evaluate how sales change monthly to see if we can find any pattern.

In [None]:
#monthly sales
exploratory.groupby("year_month")["weekly_sales"].sum().plot()
plt.title("Monthly Sales")
plt.ylabel("Revenue")
plt.show()

Clearly, sales are much higher in December in comparison to other months. We will zoom in and plot the dates (weeks) in the x axis to see if this peak happens throught the month or is centered in one week.

In [None]:
#weekly sales
exploratory.groupby("date")["weekly_sales"].sum().plot()
plt.title("Weekly Sales")
plt.ylabel("Revenue")
plt.show()

It seems like sales are higher throughout December and peaks in one week of December. To see if "is_holiday" is a useful column to be a feature of our model, we will plot a vertical line for each of these dates and see if they match with the peaks.

In [None]:
#list of holidays from the exploratory set
holiday_weeks = exploratory.query("is_holiday == 1")["date"].unique()

#weekly sales
exploratory.groupby("date")["weekly_sales"].sum().plot()
plt.title("Weekly Sales")
plt.ylabel("Revenue")

#plot holidays
for holiday in holiday_weeks: 
    plt.axvline(holiday, c = "red", lw = 0.5 )

plt.show()

Surprisingly, only Thanksgiving (end of november) is matching a representative peak. The other holidays are not matching a peak. December sales are higher before Christmas week holiday. Hence, "is_holiday" column must not correlate well with weekly sales.

In [None]:
#correlation
correlation_table = np.abs(exploratory.corr()["weekly_sales"]).sort_values(ascending = False)

#removing weekly_sales
correlation_table = correlation_table.drop("weekly_sales")

sns.barplot(correlation_table.values, correlation_table.index, orient = "h")
plt.show()

"is_holiday" columns correlates very poorly with "weekly_sales". Moreover, it seems that many other columns does not correlate strongly either. The size of the store and its type are the most representative indicators of the sale potential of a store. 

However, as we saw previously, there is some sazonality in sales and we have no feature to capture that information. Instead of using "is_holiday" as a feature to capture sazonality in sales, we will create one column that fits those peaks in sales better. Firstly, we will create one table to see if the strong sales occur in the same week of the year, every year.

In [None]:
pd.options.display.max_rows = 150
#table to see in what week of the year sales are the strongest
exploratory.groupby("date")["weekly_sales","is_holiday", "week"].agg(["sum", "mean"]).iloc[:, [0,3,5]]

Sales are the strongest in weeks 47, 49, 50 and 51, every year. We will create one column called "is_strong_sales" to identify these strong sales weeks. This new feature will probably correlate better with "weekly_sales" than "is_holiday".

In [None]:
#creating new column
exploratory["is_strong_sales"] = 0

# it is 1 if weeks are 47, 49, 50 or 51
strong_weeks = [47, 49, 50, 51]

exploratory["is_strong_sales"] = exploratory["is_strong_sales"].mask(exploratory["week"].isin(strong_weeks), 1)

We will plot again our correlation table to see how well our new column perform.

In [None]:
#correlation
correlation_table = np.abs(exploratory.corr()["weekly_sales"]).sort_values(ascending = False)

#removing weekly_sales
correlation_table = correlation_table.drop("weekly_sales")

sns.barplot(correlation_table.values, correlation_table.index, orient = "h")
plt.show()

Now we have one new column to prepare our model to sazonality effects. From the figure above, we can identify what is likely our best features so far: "size", "type", "is_strong_sales" and "markdown". Now, we must verify if markdown is a better feature by filling N/A with zeroes or by filling with mean values.

In [None]:
#changing the name of exploratory dataframe for standardization
exploratory_zero = exploratory

#Creating exploratory dataframe using train_mean instead of train_zero
exploratory_mean = train_mean.copy()

#adding time related columns and dummy columns to dataframes
exploratory_mean = add_time_columns(exploratory_mean)
exploratory_mean = add_dummy('type', exploratory_mean)

#creating new column
exploratory_mean["is_strong_sales"] = 0

# it is 1 if weeks are 47, 49, 50 or 51
exploratory_mean["is_strong_sales"] = exploratory_mean["is_strong_sales"].mask(exploratory_mean["week"].isin(strong_weeks), 1)

exploratory_mean.head()

In [None]:
#correlation
correlation_table_mean = np.abs(exploratory_mean.corr()["weekly_sales"]).sort_values(ascending = False)

#removing weekly_sales
correlation_table_mean = correlation_table_mean.drop("weekly_sales")

sns.barplot(correlation_table_mean.values, correlation_table_mean.index, orient = "h")
plt.show()

Using train_mean instead of train_zero, the correlation o "markdown5" and "markdown1" with weekly_sales increased while the correlation of "markdown2", "markdown3" and "markdown4" decreased. However, the difference was small and we will double check by validating this with a Machine Learning model.

## Using Machine Learning to choose fill N/A method

We must decide which version of the exploratory file we will use. The zero version or the mean version. For that, we will use both in K-Nearest Neighbors and see which of them perform better.

Firstly, we will normalize the data in order to avoid unnintentionally weighting the features.

In [None]:
def normalize(list_of_columns, dataframe):
    #normalize dataframe making it range from 0 to 1
    normal_df = ((dataframe[list_of_columns] - dataframe[list_of_columns].min()) / 
                 (dataframe[list_of_columns].max() - dataframe[list_of_columns].min()))

    return normal_df

#features
numerical_cols = ['is_holiday', 'temperature', 'fuel_price', 'markdown1', 'markdown2', 'markdown3', 'markdown4', 'markdown5', 
                  'cpi', 'unemployment', 'size', 'type_A', 'type_B', 'type_C', 'is_strong_sales', 'week', 'month']

normal_train_zero = normalize(numerical_cols, exploratory_zero)
normal_train_mean = normalize(numerical_cols, exploratory_mean)

target = exploratory_zero["weekly_sales"]

We will shuffle the rows, cross validate 10-fold and calculate the root mean square error using K-Nearest Neighbors. Each dataframe version (mean or zero) will be repeated twice, once using all features and once using the top 6 features: "size", "type_A", "type_B", "markdown1", "markdown5" and "is_strong_sales". Note that "store" was not considered a feature because its number is simply an ID and "type_C" was not selected as one of the best features because it is collinear with "type_A" and "type_B".

In [None]:
# importing models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_val_score

#top 6 features
top_features = ["size", "type_A", "type_B", "markdown1", "markdown5", "is_strong_sales"]

#models
knn = KNeighborsRegressor()
kf = KFold(10, shuffle = True)

#root_mean_squared_error
knn_zero_all_features_rmse = (-np.mean(cross_val_score(knn, normal_train_zero, target, cv = kf, 
                                                 scoring = "neg_mean_squared_error")))**(1/2)
knn_zero_six_features_rmse = (-np.mean(cross_val_score(knn, normal_train_zero[top_features], target, cv = kf, 
                                                 scoring = "neg_mean_squared_error")))**(1/2)

knn_mean_all_features_rmse = (-np.mean(cross_val_score(knn, normal_train_mean, target, cv = kf, 
                                                 scoring = "neg_mean_squared_error")))**(1/2)
knn_mean_six_features_rmse = (-np.mean(cross_val_score(knn, normal_train_mean[top_features], target, cv = kf, 
                                                 scoring = "neg_mean_squared_error")))**(1/2)

#zero dataframe
print("zero version error:", knn_zero_all_features_rmse, knn_zero_six_features_rmse)

#mean dataframe
print("mean version error:", knn_mean_all_features_rmse, knn_mean_six_features_rmse)

As the zero version tend to be slightly better, we will use only it from now on.

## Dummy regressor

Our model must perform better than a minimum benchmark. We will predict the mean value for every store and calculate the error, which will be considered our minimum benchmark. From now on, we will calculate the Weighted Mean Absolute Error, which is the official metric of this challange. The error formula is available in:

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/overview/evaluation

In [None]:
#Importing dummy regressor and cross_val_predict
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_predict

#model
dr = DummyRegressor()
kf = KFold(10, shuffle = True)

#predictions
dummy_predictions = cross_val_predict(dr, normal_train_zero, target, cv = kf)

#function to calculate the Weighted Mean Absolute Error
def wmae(predictions, correct_value, is_holiday_column):
    #size of the series/vector
    size = len(correct_value)

    #creating series object with weights set to 1
    weights = pd.Series(np.ones(size), index = correct_value.index).astype(int)

    #changing weights to 5 when it is holiday 
    weights = weights.mask(is_holiday_column == 1, 5)

    #error metric
    wmae_value = (np.abs(correct_value - predictions) * weights).sum() / weights.sum()
    
    return wmae_value

#calculating the error
dummy_wmae = wmae(dummy_predictions, target, normal_train_zero["is_holiday"])

print("dummy error:", dummy_wmae)

## Predicting sales using Machine Learning
We will use "Random Forests" model to predict the sales of each store because it can handle well non-linearities in our data and tend to overfit less than "Decision Trees". Firstly, we will select the best features to the model. Secondly, we will choose the best parameters by using "Grid Search".

Finally, in a next section we will use the sales values we predicted and proportionally distribute them depending on how relevant each department is for each store. With this distribution we can calculate the error of our model.

## Random Forests - Feature Selection

We will start by selecting the best features for our model. From numerical cols, we have the following available features:
```python
numerical_cols = ['is_holiday', 'temperature', 'fuel_price', 'markdown1', 'markdown2', 'markdown3', 'markdown4', 
                  'markdown5', 'cpi', 'unemployment', 'size', 'type_A', 'type_B', 'type_C', 'is_strong_sales', 'week', 
                  'month']
```
We will use recursive feature elimination in order to optimize the feature selection. Since we are not fixing a random state, the selected features vary. We will run the feature selection 10 times and select the most recurring features.

In [None]:
#importing libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV

#function to select the best features by recursive feature elimination
def select_features(X_train, y_train):
    #recursive feature elimination in random forests, 10 fold, shuffling rows
    rfr = RandomForestRegressor(n_estimators = 10)
    kf = KFold(10, shuffle = True)
    
    #fitting recursive feature elimination model
    selector = RFECV(rfr, cv = kf)
    selector.fit(X_train,y_train)
    
    #best features
    best_columns = list(X_train.columns[selector.support_])
    
    return best_columns

#assigning features and target
X_train = normal_train_zero
y_train = target

#dictionary to hold 10 features selection by recursive feature elimination
dic_features = {}

for count in range(10):
    features = select_features(X_train, y_train)
    dic_features[count] = features

## Resulting dictionary

I saved the dictionary in order to not need to run the code multiple times, since it takes time.
Now we will count how many times each feature occured and select those that appered often.

In [None]:
#resulting dictionary
dic_features = {
                    0: ['temperature', 'fuel_price', 'markdown3', 'cpi', 'unemployment', 'size', 'type_A', 'type_B', 
                        'is_strong_sales', 'week'],

                    1: ['cpi', 'unemployment', 'size', 'is_strong_sales', 'week'],

                    2: ['temperature', 'fuel_price', 'markdown3', 'markdown4', 'cpi', 'unemployment', 'size', 'type_A', 
                        'type_B', 'is_strong_sales', 'week', 'month'],

                    3: ['temperature', 'fuel_price', 'markdown3', 'markdown4', 'cpi', 'unemployment', 'size', 'type_A', 
                        'type_B', 'is_strong_sales', 'week', 'month'],

                    4: ['cpi', 'unemployment', 'size', 'is_strong_sales', 'week'],

                    5: ['cpi', 'unemployment', 'size', 'type_A', 'is_strong_sales', 'week'],

                    6: ['temperature', 'fuel_price', 'markdown3', 'markdown4', 'cpi', 'unemployment', 'size', 'type_A', 
                        'type_B', 'type_C', 'is_strong_sales', 'week', 'month'],

                    7: ['cpi', 'unemployment', 'size', 'is_strong_sales', 'week'],

                    8: ['cpi', 'unemployment', 'size', 'is_strong_sales', 'week'],

                    9: ['temperature', 'fuel_price', 'markdown1', 'markdown3', 'markdown4', 'markdown5', 'cpi', 
                        'unemployment', 'size', 'type_A', 'type_B', 'type_C', 'is_strong_sales', 'week', 'month']
                }

# one big list to hold features off all runs
features_list = []

for run in dic_features:
    #adding each run to the list
    features_list += dic_features[run]

#counting how often each feature occured
pd.Series(features_list).value_counts()

Every feature that appeared four times or more will be selected. The other will be discarded to avoid overfitting of the model.
We already discussed collinearity issues with type_C, so it is no surprise that it was not selected. Markdown 5 and Markdown 1 may also not be important to predict sales. Other columns that did not appear, for instance, "is_holiday" may not be a good indicator of sales. The column we created "is_strong_sales" and "week" may be better than "is_holiday" to describe sazonality.

## Tuning Random Forests
Having selected the features, now we can select the best performing hyperparameters of the Random Forests model. Again we will run the parameter optimization agorithm 10 times without setting a random_state in order to see the most recurring parameters.
The ones most recurring configuration will be the one we will adopt.

In [None]:
#selected features
features_list = ['temperature', 'fuel_price', 'markdown3', 'markdown4', 'cpi', 'unemployment', 'size', 'type_A', 
                 'type_B', 'is_strong_sales', 'week', 'month']

#importing grid search for model tuning
from sklearn.model_selection import GridSearchCV

#importing again to be able to run this cell even if feature selection cell was not run
from sklearn.ensemble import RandomForestRegressor

#assigning again to be able to run this cell even if feature selection cell was not run
X_train = normal_train_zero
y_train = target

#function to select the best parameters of Random Forests
def select_hyperparams(features_columns, X_train, y_train):
    
    hyperparameters = {
                        "n_estimators": [10],
                        "max_depth": [None, 8, 13, 18],
                        "min_samples_leaf": [1, 4],
                        "min_samples_split": [2, 4, 5, 6, 7]
                      }
    
    rfr = RandomForestRegressor(n_jobs = 4)
    kf = KFold(10, shuffle = True)

    grid = GridSearchCV(rfr, param_grid = hyperparameters, cv = kf)
    grid.fit(X_train[features_columns], y_train)
    best_params = grid.best_params_
    best_score = grid.best_score_
    return best_params, best_score

#dictionaries to hold 10 hyperparameters selection and scores
dic_params = {}
dic_score = {}

for count in range(10):
    best_params, best_score = select_hyperparams(features_list, X_train, y_train)
    dic_params[count] = best_params
    dic_score[count] = best_score

## Resulting dictionary

I saved the dictionary in order to not need to run the code multiple times, since it takes time.
Now we will see what configuration ocurred most often.

In [None]:
dic_params = {
                0: {'max_depth': 13, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10},
                1: {'max_depth': 18, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10},
                2: {'max_depth': 18, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10},
                3: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 10},
                4: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10},
                5: {'max_depth': 18, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 10},
                6: {'max_depth': 13, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 10},
                7: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10},
                8: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 10},
                9: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 10}
             }

The most recurring configuration was max_depth = None, min_sample_leaf = 1 and min_sample_split = 4. Despite the fact we ran the model with 10 estimators, we know that increasing the number of estimators makes the model overfit less. However, this benefit has diminishing returns in expense of time complexity. Hence, we will use 100 estimators.

## Predicting using Random Forests

After selecting the features and tuning the Random Forests model, we can finally make predictions.

In [None]:
#importing again to be able to run this cell even if previous cells were not run
from sklearn.ensemble import RandomForestRegressor

#assigning again to be able to run this cell even if previous cells were not run
X_train = normal_train_zero
y_train = target

rfr = RandomForestRegressor(n_estimators = 100, max_depth = None, min_samples_leaf = 1, min_samples_split = 4)
kf = KFold(10, shuffle = True)

rfr_predictions = cross_val_predict(rfr, X_train[features_list], y_train, cv = kf)
wmae(rfr_predictions, y_train, X_train["is_holiday"])

This is the prediction error of the weekly sales of each store. It is much better than our dummy prediction (wmae: 479000) and it is also better than our k-nearest prediction (wmae: 189000), granted it was not hyperparamer and feature optimized. These error values may vary due to cross validation shuffling.

Next, we need to translate these store predictions to department predictions. We can create new features to each department, and build a whole new model to predict sales of each department or tackle the issue with a simpler solution, like proportionally distributing the predictions of each store considering how representative the department is in that store, for that week of the year.

For now, we will adopt the latter solution and see if our error is within a desired range.

In [None]:
# creating train dataframe with week column
reference_train = train.copy()
reference_train["week"] = reference_train["date"].dt.week

#pivot_table with weekly sales by store and by department
dept_reference_train = reference_train.pivot_table("weekly_sales", ["store", "week"], "dept", aggfunc = np.sum, fill_value = 0)

#weekly sales by store
store_weekly_sales = dept_reference_train.sum(axis = 1)

#finding how representative each department is within a store at a certain week of the year
proportion_sales = dept_reference_train.div(store_weekly_sales, axis = 0)

proportion_sales.head()

We will create a function to check this table and find how representative a department is in a store and use the apply method on both train and test dataframes. By doing this, we will create a new column with the proportions.

In [None]:
#function that returns the proportion of each department of each store
def proportion_by_dept(row):
    
    return proportion_sales.loc[(row["store"], row["week"]), row["dept"]]

#copying the train dataframe 
train_predictions = train.copy()

#creating the proportion column
train_predictions["proportion"] = reference_train.apply(proportion_by_dept, axis = 1)
train_predictions.head()

Now we need to concatenate the values predicted by our model into the grouped_train dataframe, which is the train dataframe but with the weekly_sales values grouped by store.

In [None]:
# adding predictions column to the grouped train dataframe
grouped_train_predictions = pd.concat([grouped_train, pd.Series(rfr_predictions)], axis = 1)
grouped_train_predictions = grouped_train_predictions.rename(columns = {0:"store_predictions"})
grouped_train_predictions.head()

Finally, this dataframe will be merged into train predictions dataframe, in order to bring the store predictions. The values from this column will be multiplied by the values in the proportion column in order to calculate the prediction value of each department of each store.

In [None]:
# adding store predictions column to the train dataframe
train_predictions = pd.merge(train_predictions, grouped_train_predictions[["store", "date", "store_predictions"]],
                             on = ["store", "date"], how = "left")

# predict department sales based on the relevance of the deparment in each store
train_predictions["predicted_department_sales"] = train_predictions["proportion"] * train_predictions["store_predictions"]
train_predictions.head()

The same function used to calculate the error of the grouped train dataframe can be used to calculate the error for each deparment. Since the sales by department is much lower than the sales by store, the error is also much lower.

In [None]:
#weighted error of the department sales predictions
train_wmae = wmae(train_predictions["predicted_department_sales"], train_predictions["weekly_sales"], 
                  train_predictions["is_holiday"])

train_wmae

Now, we will set another minimum benchmark to see how our model compares.

In [None]:
#Dummy model
dr = DummyRegressor()
kf = KFold(10, shuffle = True)

#predictions
dummy_predictions = cross_val_predict(dr, train["is_holiday"], train["weekly_sales"], cv = kf)

#weighted error of the department sales using the dummy model
dummy_wmae = wmae(dummy_predictions, train["weekly_sales"], train["is_holiday"])
dummy_wmae

Our model seems to be significantly better than the dummy benchmark. We well create a dashboard to explore how close our predictions are to the real data.

In [None]:
def plot_results(axes_object, grouped_dataframe, store_id):
    
    #plot single store
    dataframe_mask = grouped_dataframe["store"] == store_id

    hue = None #only one line will be plotted

    title = "Store {} - prediction vs real data".format(store_id)

    #plotting target data
    sns.lineplot(grouped_dataframe[dataframe_mask]["date"], grouped_dataframe[dataframe_mask]["weekly_sales"], hue = hue,
                 ax = axes_object, color = "blue")
    
    #plotting random forests predictions
    sns.lineplot(grouped_dataframe[dataframe_mask]["date"], grouped_dataframe[dataframe_mask]["store_predictions"], hue = hue,
                 ax = axes_object, color = "green")

    #setup
    axes_object.set_title(title)
    axes_object.set_xlabel("Date")
    axes_object.set_ylabel("Sales value")
    axes_object.legend(["Weekly Sales", "Predicted Value"])

    #rotating the ticks
    plt.setp(axes_object.xaxis.get_majorticklabels(), rotation=45)
    plt.tight_layout()

def create_dashboard(rows, columns, store_list, grouped_dataframe):
    
    #creating dashboard with rows x columns graphs
    fig, axs = plt.subplots(rows, columns, figsize = (18, 18))

    #plotting one graph for each store
    for count, store_id in enumerate(store_list):
        row = int(count / columns)
        column = count % columns
        plot_results(axs[row, column], grouped_dataframe, store_id)

    plt.show()

#creating a 3x3 dashboard with stores from 1 to 9
create_dashboard(3, 3, range(1,10), grouped_train_predictions)

Our predictions seem to be tracking real world data very well. Now we will prepare our test file to submit it to kaggle.

## Preparing the test file

We have to prepare our test file the same way we did with the train file. Firstly, we will add the features to the test dataframe.

In [None]:
#creating dataframe with all features
test_zero = pd.merge(grouped_test, features_zero[feature_df_cols], how = "left", on = ["store", "date"])

#Converting boolean to zero or one
test_zero["is_holiday"] = test_zero["is_holiday"].astype(int)

#adding time related columns and dummy columns to dataframes
test_zero = add_time_columns(test_zero)
test_zero = add_dummy('type', test_zero)

#creating new column "is_strong_sales"
test_zero["is_strong_sales"] = 0
test_zero["is_strong_sales"] = test_zero["is_strong_sales"].mask(test_zero["week"].isin(strong_weeks), 1)

test_zero.head()

Secondly, the numeric features will be normalized.

In [None]:
#normalizing numerical columns
normal_test = normalize(numerical_cols, test_zero)
normal_test

Finally, we are ready to predict the sales in the test file using Random Forests algorithm.

## Predicting sales from the test file

Now we will fit our whole data into the model and predict the weekly sales for each store. After that, we will predict the sales for each department the same way we did with the train dataframe.

Then we are ready to create our submission file.

In [None]:
#assigning again to be able to run this cell even if previous cells were not run
X_train = normal_train_zero
y_train = target
X_test = normal_test

#model with the optmized hyperparameters 
rfr_test = RandomForestRegressor(n_estimators = 100, max_depth = None, min_samples_leaf = 1, min_samples_split = 4)

#fitting the model with our selected features 
rfr_test.fit(X_train[features_list], y_train)

#predicting the test file with our selected features
rfr_test_predictions = rfr_test.predict(X_test[features_list])
rfr_test_predictions

Repeating the same steps from before to predict the sales by department.

In [None]:
#test file with predictions
test_predictions = test.copy()

#adding week column
test_predictions["week"] = test_predictions["date"].dt.week

#creating the proportion column
test_predictions["proportion"] = test_predictions.apply(proportion_by_dept, axis = 1)

# adding predictions column to the grouped test dataframe
grouped_test_predictions = pd.concat([grouped_test, pd.Series(rfr_test_predictions)], axis = 1)
grouped_test_predictions = grouped_test_predictions.rename(columns = {0:"store_predictions"})

# adding store predictions column to the test dataframe
test_predictions = pd.merge(test_predictions, grouped_test_predictions[["store", "date", "store_predictions"]],
                             on = ["store", "date"], how = "left")

# predict department sales based on the relevance of the deparment in each store
test_predictions["predicted_department_sales"] = test_predictions["proportion"] * test_predictions["store_predictions"]
test_predictions.head()

The "Id" column and the submission file in the correct format will be created in the next cell.

In [None]:
#creating Id column
test_predictions[['store', 'dept', 'date']] = test_predictions[['store', 'dept', 'date']].astype(str)
test_predictions['Id'] = test_predictions[['store', 'dept', 'date']].agg('_'.join, axis=1)

#creating submission file
my_sample = test_predictions[["Id", "predicted_department_sales"]].copy()
my_sample = my_sample.rename(columns = {"predicted_department_sales" : "Weekly_Sales"})

In [None]:
#saving csv file
my_sample.to_csv("submission.csv", index = False)
my_sample.head()

## Conclusion

The main goal of this project was to predict the weekly sales of each department of each Walmart store and we have taken lots of steps in order to do that. We started by filling "not a number values" with values that seemed to be consistent with real values. Then we engineered new time related features and one feature to identify strong sales periods based on what we learned by exploring the data. After that, we noticed that all features were related to the stores and not to the department of each store. With that in mind, we created a Random Forests model to predict sales of each store in each week. That value was distributed over each department considering how representative they were in the store at a specific time o the year.

With one single optmized model we were able to achieve a relatively low weighted mean absolute error (Train data: 1409, Kaggle submission: 3448). For further lowering our error we could create new features from existing ones (bining temperature, cpi and unemplyment rate, for instance), we could try different algorithms that were not used (K-Nearest Neighbors, Linear Regression, Neural Networks), optimize the features for them and select the best hyperparameters, finally we could broaden the range of our feature "is_strong_sales" in order to reduce overfitting (the current configuration might be too specific for tha train data).

There are many things to do to improve the model even more. It was a long way in these seven days but plenty was achieved!

Thank you for reading!
