# **Walmart Store Sales  Predicion**
**made by: Marx Cerqueira**

<img src="https://i.ytimg.com/vi/XRRu9cea1sg/maxresdefault.jpg" alt="some text" width=500 height=400 align="left">


This project is an end-to-end Data Science project with a regression adapted for time series as solution was created four machine learning models to forecast the weekly sales. Predictions can be accessed by users through a submission csv at the end.

**In this notebook, I have included the following contents:**

**Table of Contents**

* [1 Project Solution Planning](#section-one)
* [2 Business Problem](#section-two)
* [3 Imports](#section-three)
* [4 Loading Data](#section-four)
* [5 Data Description](#section-five)
    - [5.1 Rename Columns](#subsection-five-one)
    - [5.2 Data Dimension](#subsection-five-two)
    - [5.3 Data Types](#subsection-five-three)
    - [5.4 Check NA Values](#subsection-five-four)
    - [5.5 Fillout NAs](#subsection-five-five)
    - [5.6 Change dtypes](#subsection-five-six)
    - [5.7 Descriptive Statistics](#subsection-five-seven)
* [6 Feature Engineering](#section-six)
    - [6.1 Hypothesis Mindmap](#subsection-six-one)
    - [6.2 Hypothesis](#subsection-six-two)
    - [6.3 Final Hypothesis List](#subsection-six-three)
    - [6.4 Feature Engineering](#subsection-six-four)
* [7 Variable Filtering](#section-seven)
* [8 Exploratory Data Analysis (EDA)](#section-eight)
    - [8.1 Univariate Analysis](#subsection-eight-one)
    - [8.2 Bivariate Analysis](#subsection-eight-two)
    - [8.3 Multivariate Analysis](#subsection-eight-three)
* [9 Data Preparation](#section-nine)
    - [9.1 Split dataframe into traning and validation](#subsection-nine-one)
    - [9.2 Checking Features Outliers Presence](#subsection-nine-two)
    - [9.3 Feature Rescaling](#subsection-nine-three)
    - [9.4 Feature Transformation](#subsection-nine-four)
    - [9.5 Apply Transformations on Validation Dataset](#subsection-nine-five)
* [10 Feature Selection](#section-ten)
* [11 Machine Learning Models](#section-eleven)
    - [11.1 Average Model](#subsection-eleven-one)
    - [11.2 Linear Regression Model](#subsection-eleven-two)
    - [11.3 Linear Regression Model - Lasso](#subsection-eleven-three)
    - [11.4 Random Forest](#subsection-eleven-four)
    - [11.5 XGBoost Regressor](#subsection-eleven-five)
    - [11.6 LightGBM Regressor](#subsection-eleven-six)
    - [11.7 Compare Models Performance](#subsection-eleven-seven)
* [12 Hyperparameter Fine tunning](#section-twelve)
    - [12.1 Random Search](#subsection-twelve-one)
    - [12.2 Final Model](#subsection-twelve-two)
* [13 Error Interpretation](#section-thirteen)
    - [13.1 Business Performance - Store Granularity](#subsection-thirteen-one)
    - [13.2 Business Performance - Department Granularity](#subsection-thirteen-two)
    - [13.3 Total Performance](#subsection-thirteen-one)
    - [13.4 Machine Learning Performance](#subsection-thirteen-two)
* [14 Model Submission](#section-fourteen)
    - [14.1 Load Model and Scalers](#subsection-thirteen-one)
    - [14.2 Data ETL](#subsection-thirteen-two)
* [15 Conclusion](#section-fifteen)
* [16 Next Steps](#section-sixteen)

## 1 PROJECT SOLUTION PLANNING
<a id="section-one"></a>

### 1.1 Input

1. Business problem
    - The CFO wanted to reinvest in all stores, therefore, he need to know how much revenue each store will bring so he can invest it now.
    
2. Datasets:

    - **stores.csv**
    - **train.csv**
    - **test.csv**
    - **features.csv**


### 1.2 Output


1. Deliverables:

- Model's performance and results report with the following topics:
    - What's the weekly sales in dollars of each store and department?
    - Predictions will be available through a csv where stakeholders can access the predictions
             
2. Business Report with all insights

### 1.3 Tasks

**Project Development Method**

The project was developed based on the CRISP-DS (Cross-Industry Standard Process - Data Science, a.k.a. CRISP-DM) project management method, with the following steps:

- Project Planning
- Business Understanding;
- Data Collection;
- Data Cleaning;
- Exploratory Data Analysis (EDA);
- Data Preparation;
- Machine Learning Modelling and fine-tuning;
- Model and Business performance evaluation / Results;
- Model deployment.


<img src="https://www.researchgate.net/profile/Youssef-Tounsi-2/publication/341627969/figure/fig1/AS:903550875996160@1592434724532/CRISP-DM-data-mining-framework.png" alt="some text" width=500 height=400 align="left">


## 2 BUSINESS PROBLEM
<a id="section-two"></a>

Walmart Stores Sales

- A private multinational retail corporation that operates a chain of hypermarkets.

- Walmart owns hypermarkets (also called supercenters), discount department stores, and grocery stores from the United States

- Business Model: Product sales.

The problem:
- The CFO wanted to reinvest in all stores, therefore, he need to know how much revenue each store will bring so he can invest it now.

Goal:
- Predict the weekly sales of all stores.

Deliverables:
- Model's performance and results report with the following topics:
    - What's the weekly sales in dollars of each store and department?
    - Predictions will be available through a csv file where stakeholders can access the prediction by a smartphone

## 3 IMPORTS
<a id="section-three"></a>

In [None]:
pip install inflection

In [None]:
import math
import pandas                as pd
import numpy                 as np
import seaborn               as sns
import matplotlib.pyplot     as plt
import datetime
import inflection
import warnings
import random
import pickle
import json

import xgboost               as xgb
import lightgbm              as lgbm
    
from pandas.api.types        import is_string_dtype, is_numeric_dtype
from matplotlib              import gridspec
from scipy                   import stats as ss
from sklearn.preprocessing   import RobustScaler, MinMaxScaler, LabelEncoder
from sklearn.ensemble        import RandomForestRegressor
from sklearn.metrics         import mean_absolute_error, mean_squared_error
from sklearn.linear_model    import LinearRegression, Lasso
from sklearn.model_selection import RandomizedSearchCV
from boruta                  import BorutaPy

from IPython.core.display    import HTML
from IPython.display         import Image

# Versão da Linguagem Python
from platform                import python_version
print('Versão da Linguagem Python Usada Neste Jupyter Notebook:', python_version())
warnings.filterwarnings( 'ignore' )

### 3.1 Helper Functions

In [None]:
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
def cramer_v( x, y ):
    cm = pd.crosstab( x, y ).values # Confusion Matrix
    n = cm.sum()
    r, k = cm.shape
    
    chi2 = ss.chi2_contingency( cm )[0]
    chi2corr = max( 0, chi2 - (k-1)*(r-1)/(n-1) )
    
    kcorr = k - (k-1)**2/(n-1)
    rcorr = r - (r-1)**2/(n-1)
    
    return np.sqrt( (chi2corr/n) / ( min( kcorr-1, rcorr-1 ) ) )

def mean_absolute_percentage_error( y, yhat ):
    y, yhat = np.array(y), np.array(yhat)
    return np.mean( np.abs( ( y-yhat ) / y ))

def mean_percentage_error( y, yhat ):
    return np.mean( ( y - yhat ) / y )

# Define the function to evaluate the models
def weighted_mean_absolute_error(df, y, yhat):
    weights = df.is_holiday.apply(lambda x: 5 if x else 1)
    return np.round(np.sum(weights*abs(y-yhat))/(np.sum(weights)), 2)

def ml_error( df,model_name, y, yhat):
    mae = mean_absolute_error( y,yhat )
    mape = mean_absolute_percentage_error( y,yhat )
    rmse = np.sqrt(mean_squared_error( y,yhat ))
    WMAE = weighted_mean_absolute_error(df, y, yhat)
    
    return pd.DataFrame( {'Model Name': model_name,
                          'MAE': mae,
                          'RMSE': rmse,
                          'WMAE': WMAE}, index=[0])

# time-series cross validation implementation
def cross_validation( x_training, kfold, model_name, model, verbose=False ):
    mae_list = []
    mape_list = []
    rmse_list = []
    WMAE_list = []
     
    for k in reversed( range( 1, kfold+1 ) ): #k-fold implementation
        if verbose:
            print( '\nKFold Number: {}'.format( k ) )
        # start and end date for validation 
        start_date_validation = x_training['date'].max() - datetime.timedelta( weeks=k*22) #primeira semanada da venda realizada
        end_date_validation = x_training['date'].max() - datetime.timedelta( weeks=(k-1)*22) #ultima semana

        # filtering dataset
        training = x_training[x_training['date'] < start_date_validation]
        validation = x_training[(x_training['date'] >= start_date_validation) & (x_training['date'] <= end_date_validation)]

        # training and validation dataset
        # training
        xtraining = training.drop( ['date', 'weekly_sales'], axis=1 ) 
        ytraining = training['weekly_sales']

        # validation
        xvalidation = validation.drop( ['date', 'weekly_sales'], axis=1 )
        yvalidation = validation['weekly_sales']

        # model
        m = model.fit( xtraining, ytraining )

        # prediction
        yhat = m.predict(xvalidation)

        # performance
        m_result = ml_error( xvalidation, model_name, np.expm1( yvalidation ), np.expm1( yhat ) )

        # store performance of each kfold iteration
        mae_list.append(  m_result['MAE'] )
        rmse_list.append( m_result['RMSE'] )
        WMAE_list.append( m_result['WMAE'])

    return pd.DataFrame( {'Model Name': model_name,
                          'MAE CV':  np.round( np.mean( mae_list ), 2 ).astype( str )  + ' +/- ' + np.round( np.std( mae_list ), 2 ).astype( str ),
                          'RMSE CV': np.round( np.mean( rmse_list ), 2 ).astype( str ) + ' +/- ' + np.round( np.std( rmse_list ), 2 ).astype( str ),
                          'WMAE CV': np.round( np.mean( WMAE_list ), 2 ).astype( str ) + ' +/- ' + np.round( np.std( WMAE_list ), 2 ).astype( str )}, index=[0] )


In [None]:
jupyter_settings()

## 4 LOADING DATA
<a id="section-four"></a>

In [None]:
#project home path for importing files
# home_path = '/home/marxcerqueira/repos/ze-delivery-prediction-case/'
home_kaggle = '../input/'

#loading datasets available for this project
df_sales_raw  = pd.read_csv(home_kaggle + 'dataset/train.csv', low_memory = False)
df_test_raw   = pd.read_csv(home_kaggle + 'dataset/test.csv', low_memory = False)
df_features   = pd.read_csv(home_kaggle + 'dataset/features.csv', low_memory = False)
df_stores     = pd.read_csv(home_kaggle + 'dataset/stores.csv', low_memory = False)

In [None]:
#first look at the dataframes
df_sales_raw.head()

In [None]:
#first look at the dataframes
df_features.head()

In [None]:
#first look at the dataframes
df_stores.head()

**Datasets Feature Description**

**stores.csv**

This file contains anonymized information about the 45 stores, indicating the type and size of store.

**train.csv**

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales -  sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week

**test.csv**

This file is identical to **train.csv**, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

**features.csv**

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

- Store - the store number
- Date - the week
- Temperature - average temperature in the region
- Fuel_Price - cost of fuel in the region
- MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday - whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

- Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
- Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
- Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
- Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

### 4.1 Merge datasets

The datasets keys here can be 'Store', 'Dept' and 'IsHoliday'.

In [None]:
# merge datasets into one
df_store_feature = df_features.merge(df_stores, on = 'Store', how = 'left')

# main dataframe for exploring
df0 = df_sales_raw.merge(df_store_feature, on = ['Store', 'Date', 'IsHoliday'], how = 'left').sort_values(['Store','Dept','Date'])

In [None]:
#take a first look at the main dataset after merges
df0.head()

## 5 DATA DESCRIPTION
<a id="section-five"></a>

In [None]:
#Copy dataset
df1 = df0.copy()

### 5.1 Rename Columns
<a id="subsection-five-one"></a>

In [None]:
cols_old = df1.columns

snakecase = lambda x: inflection.underscore(x)

cols_new = list( map( snakecase, cols_old ) )

#Rename Columns
df1.columns = cols_new

In [None]:
#checking cols transformation
df1.columns

### 5.2 Data Dimension
<a id="subsection-five-two"></a>

In [None]:
# checking data dimesions to see if we have enough computational power
print( 'Number of Rows: {}'.format( df1.shape[0] ) )
print( 'Number of Cols: {}'.format( df1.shape[1] ) )

### 5.3 Data Types
<a id="subsection-five-three"></a>

In [None]:
# checking features dtypes
df1.dtypes

### 5.4 Check NA values
<a id="subsection-five-four"></a>

In [None]:
# checking NA values sum and its percentagem from total number of rows
missing_count = df1.isnull().sum() # the count of missing values
value_count = df1.isnull().count() # the count of all values

missing_percentage = round(missing_count/value_count *100, 2) # the percentage of missing values
missing_df = pd.DataFrame({'missing values count': missing_count, 'percentage': missing_percentage})
missing_df

In [None]:
# missing na chart
barchart = missing_df.plot.bar(y='percentage')
for index, percentage in enumerate( missing_percentage ):
    barchart.text( index, percentage, str(percentage)+'%')

- Markdown 1-5 columns have NAs, all other columns are complete.

- It contains lots of missing values, more than 64% of NAs in each markdown column. 

- They correspond to the promotional activities being carried out at different stores. 

- The promotional markdowns only started after November 2011 and is not running all the times at all the stores. So this makes sense why these columns have lot of NAs values. 

- Let's perform the exploratory data analysis and study their relationship with the weekly sales and then we will decide about these columns and the missing values

### 5.5 Fillout NA
<a id="subsection-five-five"></a>

In [None]:
# replacing NAs with 0
df1 = df1.fillna(0)

- replacing NAs with 0 even though this action will create more bias to the model
- during the next cicle of CRIPS we will take a deep look into it.

### 5.6 Change Types
<a id="subsection-five-six"></a>

In [None]:
#converting feature 'date' to datetime
df1['date'] = pd.to_datetime( df1[ 'date' ] )

In [None]:
df1.dtypes # checking datatypes transformation

### 5.7 Descriptive Statistics
<a id="subsection-five-seven"></a>

- It's usefull to get the first knowledge of the business problem over the features and we can detect some data erros

In [None]:
# separate numerical and categorical attributes
num_attributes = df1.select_dtypes( include = 'number')
cate_attributes = df1.select_dtypes( include = 'object')

#### 5.7.1 Numerical Attributes

In [None]:
# Central Tendency - Mean, median
ct1 = pd.DataFrame( num_attributes.apply( np.mean ) ).T
ct2 = pd.DataFrame( num_attributes.apply( np.median ) ).T

# Dispersion - std, min, max, range, skew, kurtoisis
d1 = pd.DataFrame(num_attributes.apply( np.std )).T
d2 = pd.DataFrame(num_attributes.apply( min )).T
d3 = pd.DataFrame(num_attributes.apply( max )).T
d4 = pd.DataFrame(num_attributes.apply( lambda x: x.max() - x.min() )).T
d5 = pd.DataFrame(num_attributes.apply( lambda x: x.skew() )).T
d6 = pd.DataFrame(num_attributes.apply( lambda x: x.kurtosis() )).T

#concatenate
m1 = pd.concat( [d2, d3, d4, ct1, ct2, d1, d5, d6] ).T.reset_index()


m1.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
m1

In [None]:
# check numerical features distribution
num_attributes.hist(bins = 50);
plt.style.use('tableau-colorblind10');

- histograms were take into account to check how features distributions behave

#### 5.7.2 Categorical Attributes

In [None]:
# check unique values of categorical features
cate_attributes.apply( lambda x: x.unique().shape[0])

In [None]:
cate_attributes.type.value_counts()

In [None]:
# plot boxplots of categorical features against target variable
aux1 = df1[(df1['type'] != '0') & (df1['weekly_sales'] > 0)]

plt.subplot (1, 3, 1)
sns.boxplot(x='type', y= 'weekly_sales', data=aux1);

- A boxplot is a graph that gives you a good indication of how the values in the data are spread out
- boxplots may seem primitive in comparison to a histogram or density plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or datasets.

## 6 FEATURE ENGINEERING
<a id="section-six"></a>

- in this sessions we created a hypothese mindmap to help us to create hipothesis and after that feature engineering

In [None]:
df2 = df1.copy()

### 6.1 Hypothesis Mindmap

Created based on what affects the business problem:

1) Phenomeno: What phenomenon am I modeling?

2) Agents: Who are the agents that act on the phenomenon of interest? (all entities that impact the phenomenon)

3) Agent attributes: what is the description of the agents? (ex: client is age, salary, profession, etc.)

4) List of Hypotheses: Hypotheses to validate with the data


- insights are generated in two ways: surprise and belief contravention
- Hypotheses are bets, they must be written as a statement in relation to the response variable.
- It is not a cause and effect relationship, but a correlation

In [None]:
# Hypothesis Mindmap to help us to create business hypothesis
Image(home_kaggle + 'images2/DAILY_STORE_SALES.png')

### 6.2 Creating Hypothesis

#### **6.2.1 Store Hypothesis**

**1.** Stores with more employees should sell more.

**2.** Stores with greater inventory capacity should sell more.

**3.** Larger stores should sell more.

**4.** Stores with larger assortments should sell more.

**5.** Type A stores should sell more

**6.** Stores with more departments should sell more

#### **6.2.2 Product Hypothesis**

**1.** Stores that invest more in Marketing should sell more.

**2.** Stores with more product exposure should sell more.

**3.** Stores with lower priced products should sell more.

**5.** Stores with more aggressive promotions (bigger discounts) should sell more.

**6.** Stores with longer active promotions should sell more.

#### **6.2.3 Time Hypothesis**

**1.** Stores during the Christmas holiday should sell more.

**2.** Stores should sell more over the years.

**3.** Stores should sell more in the second half of the year.

**4.** Stores should sell more after the 2nd week each month.

**5.** Stores should sell less on weekends.

**6.** Stores should sell more during holidays.

#### **6.2.4 Macroeconomics**

**1.** Places with lower temperatures sell more

**2.** Locations with lower gas prices sell more.

**3.** Places with higher unemployment rate sell less

**4.** Places with a high consumer confidence index sell more

### 6.3 Final Hypothesis List

- Prioritization based on available features in the dataset
- The hypothesis list below will be validated during Exploratory Data Analysis

**1.** Larger stores should sell more.

**2.** Type A stores should sell more.

**3.** Stores with more departments should sell more.

**4.** Stores with more aggressive promotions (bigger discounts) should sell more. (markdows)

**5.** Stores during the Christmas holiday should sell more.

**6.** Stores should sell more over the years.

**7.** Stores should sell more in the second half of the year.

**8.** Stores should sell more after the 2nd week each month.

**9.** Places with lower temperatures sell more

**10.** Locations with lower gas prices sell more.

**11.** Places with higher unemployment rate sell less

**12.** Places with a high consumer confidence index sell more

### 6.4 Feature Engineering

In [None]:
# year
df2['year'] = df2['date'].dt.year

# month
df2['month'] = df2['date'].dt.month

# day
df2['day'] = df2['date'].dt.day

# week of year
df2['week_of_year'] = df2['date'].dt.isocalendar().week.astype('int64')

# year week
df2['year_week'] = df2['date'].dt.strftime( '%Y-%W' )

# year quarter
df2['quarter'] = df2['date'].dt.to_period('Q')

## 7 VARIABLE FILTERING 
<a id="section-seven"></a>

In [None]:
 df3 = df2.copy()

the motivation behind variable filtering is business constraints

### 7.1 Rows Filtering

- Firstly I do the rows filtering because it reduces the dataset volume, increasing the processing performance;
- We are removing weekly sales with negative values since it will not increase the bias of the model that much;
- During the next CRISP cicle we can work on negative sales and ask the business teams for some business constraints

In [None]:
# checking the impact of removing negative sales
# percentage of register that we will deleted with this action
df3[df3['weekly_sales']< 0].shape[0]/df3.shape[0] * 100

In [None]:
df3 = df3[df3['weekly_sales']>= 1]

## 8 EXPLORATORY DATA ANALYSIS (EDA)
<a id="section-eight"></a>

How do the variables impact the phenomenon, in this case weekly sales?

How strong is this impact?

It serves to measure the impact of features in relation to the response variable (target)

3 goals:
- gain business experience
- validate business hypotheses (insights)
- elect variables that are important to the model

In [None]:
df4 = df3.copy()

### 8.1 Univariate Analysis

#### 8.1.1 Response Variable (target)

In [None]:
# plot target variable distribution
fig = plt.figure( figsize = (14, 6), constrained_layout=True)
sns.distplot(df4['weekly_sales'], kde = False);
plt.style.use('tableau-colorblind10');

- It is not close to a normal distribution. 
- Skew far to the right.
- The more normal the response variable, the better the algorithm will perform. We might have to use log transformation later on it

#### 8.1.2 Numerical Variable

In [None]:
num_attributes = df4.select_dtypes( include = 'number')

In [None]:
# histogram for numerical features
num_attributes.hist(bins = 45);
plt.style.use('tableau-colorblind10');

**Overall:** All variables don't follow a normal distribution

 - Store: There are specific stores that have less weekly sales points than others;
 - dept: There are specific departaments that have less weekly sales points than others;
 - temperature: THe closest to a normal distribuition, a little skewed to the left (-)
 - week_of_year: some especifics weeks have more sales data points
 - month: more sales data points in months 4 and 7
 - year: more sales data points in the year 2011

#### 8.1.3 Categorical Variable

In [None]:
cate_attributes = df4.select_dtypes( include = 'object')

In [None]:
# type
plt.subplot(2, 2, 1)
sns.countplot(data = df4, x = df4['type'])

plt.subplot(2, 2, 2)
sns.kdeplot(data = df4, x = df4[df4['type'] == 'A']['weekly_sales'], shade = True)
sns.kdeplot(data = df4, x = df4[df4['type'] == 'B']['weekly_sales'], shade = True)
sns.kdeplot(data = df4, x = df4[df4['type'] == 'C']['weekly_sales'], shade = True);

plt.subplot(2, 2, 3)
sns.boxplot(x='type', y= 'weekly_sales', data=df4);

plt.subplot(2, 2, 4)
sns.boxplot(x='type', y= 'weekly_sales', data=df4, showfliers=False);
plt.style.use('tableau-colorblind10');

- More sales datapoint in stores of type A
- Little contribution of the types of stores in relation to the response variable, since the distributions are overlapped.
- The median of A is the highest and C is the lowest

### 8.2 Bivariate Analysis

#### H1. Larger stores should sell more.

**TRUE** Stores with more sizes have higher sales record

In [None]:
# compare the size and type of store with weekly sales
aux = df4[['weekly_sales', 'type', 'size']].groupby(['type', 'size']).mean().reset_index()

sns.barplot(x = 'size', y = 'weekly_sales', data = aux, hue = 'type')
plt.xticks(rotation = 75);
plt.style.use('tableau-colorblind10');

In [None]:
# scatterplot from df4
sns.scatterplot(df4['size'], df4['weekly_sales']);

In [None]:
# scatterplot aggregated sizes
sns.scatterplot(aux['size'], aux['weekly_sales']);

In [None]:
# type and size boxplot
plt.subplot(1, 2, 1)
sns.boxplot(x='type', y='size', data=df4)
plt.style.use('tableau-colorblind10');

plt.subplot(1, 2, 2)
sns.boxplot(x='type', y= 'weekly_sales', data=df4, showfliers=False);
plt.style.use('tableau-colorblind10');

In [None]:
#check how different type stores performed over years
aux2 = df4[['weekly_sales', 'type', 'year_week']].groupby(['type', 'year_week']).mean().reset_index()
aux2.pivot( index = 'year_week', columns = 'type', values = 'weekly_sales').plot();


In [None]:
# correlation between size and weekly sales
sns.heatmap(aux.corr(method= 'pearson'), annot= True);

- By boxplot, we can infer that type A store is the largest store and C is the smallest
- There is no overlapped area in size among A, B, and C. Type is the best predictor for Size
- Stores with more sizes have higher sales record (The order of median of size and median of sales is the same)

#### H2. Type A stores should sell more.
**TRUE** Type A Stores sell more over time, but because they are strongly correlated with size (hypothersis number 1)

#### H3. Stores with more departments should sell more.
**TRUE** Stores with more department tend to sell more.

In [None]:
# compare the number of departments of each store with weekly sales
aux = df4[['store', 'dept', 'weekly_sales']].groupby(['store']).agg(n_store_dept = ('dept', 'nunique'),
                                                                    weekly_sales = ('weekly_sales', 'mean')).reset_index()

# compare the store number with weekly sales
aux2 = df4[['store', 'weekly_sales']].groupby(['store']).mean().reset_index()

plt.subplot(2, 1, 1)
sns.barplot(x= 'n_store_dept', y= 'weekly_sales', data= aux);
plt.style.use('tableau-colorblind10');

plt.subplot(2, 1, 2)
sns.barplot(x= 'store', y= 'weekly_sales', data= aux2);
plt.style.use('tableau-colorblind10');

In [None]:
# checking each department individualy by type
f, ax = plt.subplots(figsize=(10, 50))
sns.boxplot(x='weekly_sales', y= 'dept', data=df4, showfliers=False, hue="type",orient="h");
plt.style.use('tableau-colorblind10');

- Each department shows the different level of sales
- Department may be the powerful variable to predict sales
- When department and type of store are considered together, generally department in A type shows the highest sales record
- Type and department may have the interaction effect
- There are missing some departaments (eg: 15, 73..)

#### H4. Stores with more aggressive promotions (bigger discounts) should sell more. (markdows)

**FALSE** Not very clear the influence of markdows in the weekly sales. Next CRISP cycle we will do a more in-depth analysis


In [None]:
# compare weekly sales with all mark downs features
aux = df4[['weekly_sales', 'mark_down1', 'mark_down2', 'mark_down3', 'mark_down4', 'mark_down5']].groupby(df4['week_of_year']).mean()

plt.figure(figsize=(20,8))

sns.lineplot(aux.index, aux.weekly_sales.values)
sns.lineplot(aux.index, aux.mark_down1.values)
sns.lineplot(aux.index, aux.mark_down2.values)
sns.lineplot(aux.index, aux.mark_down3.values)
sns.lineplot(aux.index, aux.mark_down4.values)
sns.lineplot(aux.index, aux.mark_down5.values)

plt.grid()

plt.xticks(np.arange(1, 53, step=1))
plt.legend(['weekly_sales', 'mark_down1', 'mark_down2', 'mark_down3', 'mark_down4', 'mark_down5'], loc='best', fontsize=16)
plt.title('Average Weekly Sales - Mark Downs', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()

#### H5. Stores during the Christmas holiday week should sell more.
**TRUE** The highest point of weekly sales happen during the week 50 and 51

In [None]:
# plot sales week by week of the year for each year
weekly_sales_2010 = df4[df4['year']==2010]['weekly_sales'].groupby(df4['week_of_year']).mean()
weekly_sales_2011 = df4[df4['year']==2011]['weekly_sales'].groupby(df4['week_of_year']).mean()
weekly_sales_2012 = df4[df4['year']==2012]['weekly_sales'].groupby(df4['week_of_year']).mean()

plt.figure(figsize=(20,8))

sns.lineplot(weekly_sales_2010.index, weekly_sales_2010.values)
sns.lineplot(weekly_sales_2011.index, weekly_sales_2011.values)
sns.lineplot(weekly_sales_2012.index, weekly_sales_2012.values)

plt.grid()

plt.xticks(np.arange(1, 53, step=1))
plt.legend(['2010', '2011', '2012'], loc='best', fontsize=16)
plt.title('Average Weekly Sales - Per Year', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()

In [None]:
# compare weekly sales with holidays
aux = df4[['is_holiday', 'weekly_sales']].groupby('is_holiday').mean().reset_index();

fig = plt.figure(figsize = (18,12))
plt.subplot(211)
sns.barplot(data = aux, x= 'is_holiday', y= 'weekly_sales');

aux2 = df4[['month','is_holiday', 'weekly_sales']].groupby(['month','is_holiday']).mean().reset_index();

plt.subplot(212)
sns.barplot(data = aux2, x= 'month', y= 'weekly_sales', hue= 'is_holiday');
plt.xticks(rotation = 45);

- by thanksgiving and Christmas the sales rise up by a huge margin in all the years
- sales increase with holidays
- Sales are high in the weeks leading up to Christmas

#### H6. Stores should sell more over the years.
**FALSE** the weekly sales average is maintained over the years. However, it tends to decay a little but because 2012 christimas is not computed yet.

In [None]:
# average weekly sales over the years
aux  = df4[['year', 'is_holiday','weekly_sales']].groupby(['year', 'is_holiday']).mean().reset_index()
aux2 = df4[['year','weekly_sales']].groupby(['year']).mean().reset_index()

plt.subplot(3, 1, 1)
sns.barplot(data = aux2, x = 'year', y= 'weekly_sales');


plt.subplot(3, 1, 2)
sns.barplot(data = aux, x = 'year', y= 'weekly_sales', hue= 'is_holiday');

plt.subplot(3, 1, 3)
sns.regplot(data = aux, x= 'year', y= 'weekly_sales');

In [None]:
# correlation between year and weekly sales
sns.heatmap(aux.corr(method= 'pearson'), annot= True);

#### **H7.** Stores should sell more in the second half of the year.
**TRUE** Stores sell more in the second half of the year, mainly because of the last quarter which include thanks giving and christimas sales

In [None]:
# month sales performance by year
aux = df4[['year', 'month','weekly_sales']].groupby(['year', 'month']).mean().reset_index()
sns.barplot(data = aux, x = 'month', y= 'weekly_sales', hue= 'year');

In [None]:
# aggregate month related to sales
plt.subplot(131)
aux = df4[['month','weekly_sales']].groupby(['month']).mean().reset_index()
sns.barplot(data = aux, x = 'month', y= 'weekly_sales');

plt.subplot(132)
sns.regplot(data = aux, x= 'month', y= 'weekly_sales');

plt.subplot(133)
sns.heatmap(aux.corr(method= 'pearson'), annot = True);

In [None]:
# sales performance by quarter
aux2010 = df4[df4['year']==2010][['weekly_sales', 'quarter']].groupby('quarter').mean().reset_index()
aux2011 = df4[df4['year']==2011][['weekly_sales', 'quarter']].groupby('quarter').mean().reset_index()
aux2012 = df4[df4['year']==2012][['weekly_sales', 'quarter']].groupby('quarter').mean().reset_index()

plt.subplot(131)
sns.barplot(data = aux2010, x = 'quarter', y = 'weekly_sales')
plt.subplot(132)
sns.barplot(data = aux2011, x = 'quarter', y = 'weekly_sales')
plt.subplot(133)
sns.barplot(data = aux2012, x = 'quarter', y = 'weekly_sales');

#### **H8.** Stores should sell more after the 2nd week each month.
**FALSE** Sales are almost the same.

In [None]:
# plot to check sales after and before 2 weeks of each month
aux = df4[['day', 'weekly_sales']].groupby('day').mean().reset_index()
aux['before_after'] = aux['day'].apply(lambda x: 'before_15_days' if x <= 15 else
                                                  'after_15_days')

sns.barplot(data = aux, x= 'before_after', y= 'weekly_sales');

#### **H9.** Places with lower temperatures sell more
**FALSE** Temperature seems to have no relationship with weekly sales

In [None]:
# relationship between temperature and sales over the years
aux = df4[['temperature', 'weekly_sales', 'year_week']].groupby('year_week').mean().reset_index()

sns.lineplot(data = aux, x = 'year_week', y = 'weekly_sales')
plt.xticks(rotation = 90);
ax2 = plt.twinx()
sns.lineplot(data=aux,x = 'year_week', y = 'temperature', color="r", ax=ax2);

In [None]:
# temperature x weekly sales
sns.scatterplot(x = df4.temperature, y = df4.weekly_sales);

- There seems to be no relatiobship between the temperature in the region and weekly sales of the stores. 
- At low and very high temperatures the sales seems to dip a bit but in general there doesn't exist a clear relationship

#### **H10.** Locations with lower gas prices sell more.
**FALSE** There are no clear reletionship!

In [None]:
# relationship between temperature and sales over the years
aux = df4[['fuel_price', 'weekly_sales', 'year_week']].groupby('year_week').mean().reset_index()

sns.lineplot(data = aux, x = 'year_week', y = 'weekly_sales')
plt.xticks(rotation = 90);
ax2 = plt.twinx()
sns.lineplot(data=aux,x = 'year_week', y = 'fuel_price', color="r", ax=ax2);

In [None]:
# temperature x weekly sales
sns.scatterplot(x = df4.fuel_price, y = df4.weekly_sales);

- Between fuel price and the sales there doesn't seem to exist any clear relationship

#### **H11.** Places with higher unemployment rate sell less
**FALSE** There are no clear reletionship!

In [None]:
# relationship between Unemployment rate and sales over the years
aux = df4[['unemployment', 'weekly_sales', 'year_week']].groupby('year_week').mean().reset_index()

sns.lineplot(data = aux, x = 'year_week', y = 'weekly_sales')
plt.xticks(rotation = 90);
ax2 = plt.twinx()
sns.lineplot(data=aux,x = 'year_week', y = 'unemployment', color="r", ax=ax2);

In [None]:
# Unemployment rate x weekly sales
sns.scatterplot(x = df4.unemployment, y = df4.weekly_sales);

#### **H12.** Places with a high consumer confidence index sell more
**FALSE** There are no clear relationship

In [None]:
# relationship between CPI and sales over the years
aux = df4[['cpi', 'weekly_sales', 'year_week']].groupby('year_week').mean().reset_index()

sns.lineplot(data = aux, x = 'year_week', y = 'weekly_sales')
plt.xticks(rotation = 90);
ax2 = plt.twinx()
sns.lineplot(data=aux,x = 'year_week', y = 'cpi', color="r", ax=ax2);

In [None]:
# CPI x weekly sales
sns.scatterplot(x = df4.cpi, y = df4.weekly_sales);

#### **8.2.13 Hypothesis summarize**

In [None]:
# Hypothesis Summary to select feature relevance to the model
summary = pd.DataFrame({'Hypothesis':['Larger stores should sell more.',
                                      'Type A stores should sell more.',
                                      'Stores with more departments should sell more.',
                                      'Stores with more aggressive promotions (bigger discounts) should sell more. (markdows)',
                                      'Stores during the Christmas holiday should sell more.',
                                      'Stores should sell more over the years.',
                                      'Stores should sell more in the second half of the year.',
                                      'Stores should sell more after the 2nd week each month.',
                                      'Places with lower temperatures sell more',
                                      'Locations with lower gas prices sell more.',
                                      'Places with higher unemployment rate sell less,',
                                      'Places with a high consumer confidence index sell more.',
                                     ],
                        'True / False':['True', 'True', 'True', 'False', 'True', 'False', 'True', 'False', 'False',
                                        'False','False', 'False'],
                        'Relevance':['High', 'Medium', 'Low', 'Medium', 'Low', 'Low', 'High', 'Low', 'Low', 
                                     'Low', 'Low', 'Low']}, 
                        index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
summary

### 8.3 Multivariate Analysis

- Checking the relationship between the different columns numerically to see how they correlate with the weekly sales in order to confirm the inferences we have evaluated in the above EDA.

In [None]:
# correlation among all variables
correlation = (num_attributes.corr( method = 'pearson' ))
sns.heatmap( correlation, annot = True );

In [None]:
def plot_corr(col):
    a = correlation[col].sort_values(ascending=False).to_frame()
    a.columns = ['']
    a.drop(col, axis=0, inplace=True)
    plot = sns.heatmap( a, annot=True, cmap='Blues').set_title(col);
    
    return plot

plot_corr('weekly_sales');

## 9 DATA PREPARATION
<a id="section-nine"></a>

In [None]:
# copy of dataset
df5 = df4.copy()

The rescaling methods applied below are based on the features distribution shape and boxplot outlier analysis.

- Standard Scaler: applied on variables with a distribution shape similar to a normal distribution;
- Min-Max Scaler: applied on variables with low outliers influence;
- Robust Scaler: applied on variables with high outliers influence.

### 9.1 Split dataframe into training and validation

- Here we are going to split the dataframe into train and validation (Proportion 85/15)
- We are doing that before data preparation to avoid data leakeage.
- Data preparation will be applied into train dataset (fit_transform), and after that it will be applied in the validation dataset (only transform).
- The same will be done in the test data at the end of modelling.
- Using temporal variable is a more reliable way of splitting datasets whenever the dataset includes the date variable, and we want to predict something in the future that depends on date

In [None]:
# checking the max date in the dataframe
df5[['store', 'date']].groupby('store').max().reset_index()['date'][0] 

In [None]:
# checking 22 weeks before
df5[['store', 'date']].groupby('store').max().reset_index()['date'][0] - datetime.timedelta( weeks = 22 )

In [None]:
# Spliting dataframe into train and validation. 
# Validation will have the last 22 weeks of sales which represents 16% of the data
# starting at 12-05-25 until the last day of sales

df5[['store', 'date']].groupby('store').max().reset_index()['date'][0] - datetime.timedelta( weeks = 22 )

# Train dataset
X_train = df5[df5['date'] < '2012-05-25']
y_train = X_train['weekly_sales']

# Validation dataset
X_validation = df5[df5['date'] >= '2012-05-25']
y_validation = X_validation['weekly_sales']

print( 'Training Min Date: {}'.format( X_train['date'].min() ) )
print( 'Training Max Date: {}'.format( X_train['date'].max() ) )

print( '\nValidation Min Date: {}'.format( X_validation['date'].min() ) )
print( 'Validation Max Date: {}'.format( X_validation['date'].max() ) )

In [None]:
# check the proportion of validation data datapoints
X_validation.shape[0]/df5.shape[0] * 100

### 9.2 Checking features outliers presence

In the plot below:
- plot boxplots features in order to check outliers presence

- 'temperature', 'fuel_price', 'mark_down1', 'mark_down2', 'mark_down3', 'mark_down4', 'mark_down5', 'cpi', 'unemployment', 'size', 'year', 'month', 'day', 'week_of_year'

In [None]:
# plot boxplots features in order to check outliers presence
plt.subplot(6, 4, 9)
sns.boxplot(df5['temperature'])

plt.subplot(6, 4, 10)
sns.boxplot(df5['fuel_price'])

plt.subplot(6, 4, 11)
sns.boxplot(df5['mark_down1'])

plt.subplot(6, 4, 12)
sns.boxplot(df5['mark_down2'])

plt.subplot(6, 4, 13)
sns.boxplot(df5['mark_down3'])

plt.subplot(6, 4, 14)
sns.boxplot(df5['mark_down4'])

plt.subplot(6, 4, 15)
sns.boxplot(df5['mark_down5']);

plt.subplot(6, 4, 16)
sns.boxplot(df5['cpi']);

plt.subplot(6, 4, 17)
sns.boxplot(df5['unemployment']);

plt.subplot(6, 4, 18)
sns.boxplot(df5['size']);

plt.subplot(6, 4, 19)
sns.boxplot(df5['year']);

plt.subplot(6, 4, 20)
sns.boxplot(df5['month']);

plt.subplot(6, 4, 21)
sns.boxplot(df5['day']);

plt.subplot(6, 4, 22)
sns.boxplot(df5['week_of_year']);

fig.tight_layout()

### 9.3 Feature Normalization

**None of the attributes follow a normal distribution**

### 9.4 Feature Rescaling

In [None]:
rs = RobustScaler()  #selection of the rescaling method is due to outliers
                     #same formula as minmaxScaler, but uses interquatis itervals as range, being robust to outliers

mms = MinMaxScaler() #super sensitive to outliers

#features for Robust Scalers
#temperature
X_train['temperature'] = rs.fit_transform( X_train[['temperature']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/temperature_scaler.pkl', 'wb'))

# mark_down1
X_train['mark_down1'] = rs.fit_transform( X_train[['mark_down1']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/mark_down1_scaler.pkl', 'wb'))

# mark_down2
X_train['mark_down2'] = rs.fit_transform( X_train[['mark_down2']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/mark_down2_scaler.pkl', 'wb'))

# mark_down3
X_train['mark_down3'] = rs.fit_transform( X_train[['mark_down3']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/mark_down3_scaler.pkl', 'wb'))

# mark_down4
X_train['mark_down4'] = rs.fit_transform( X_train[['mark_down4']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/mark_down4_scaler.pkl', 'wb'))

# mark_down5
X_train['mark_down5'] = rs.fit_transform( X_train[['mark_down5']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/mark_down5_scaler.pkl', 'wb'))

# unemployment
X_train['unemployment'] = rs.fit_transform( X_train[['unemployment']].values ) 
# pickle.dump(rs, open(home_path+'/parameters/unemployment_scaler.pkl', 'wb'))

#Features for MinMaxScaler
# fuel_price
X_train['fuel_price'] = mms.fit_transform( X_train[['fuel_price']].values )
# pickle.dump(mms, open(home_path + '/parameters/fuel_price_scaler.pkl', 'wb'))

# cpi
X_train['cpi'] = mms.fit_transform( X_train[['cpi']].values )
# pickle.dump(mms, open(home_path + '/parameters/cpi_scaler.pkl', 'wb'))

# size
X_train['size'] = mms.fit_transform( X_train[['size']].values )
# pickle.dump(mms, open(home_path + '/parameters/size_scaler.pkl', 'wb'))

# year
X_train['year'] = mms.fit_transform( X_train[['year']].values )
# pickle.dump(mms, open(home_path + '/parameters/year_scaler.pkl', 'wb'))

### 9.5 Feature Transformation

#### Encoding

In [None]:
# is_holiday
X_train['is_holiday'] = X_train['is_holiday'].apply(lambda x: 1 if x == True else 0)

# type - Label Encoder
le = LabelEncoder()
X_train['type'] = le.fit_transform( X_train['type'] )
# pickle.dump(le, open(home_path + '/parameters/type_scaler.pkl', 'wb'))

#### Response (Target) Variable Transformation

In [None]:
X_train['weekly_sales'] = np.log1p( X_train['weekly_sales'] )
sns.distplot(X_train['weekly_sales']);

- The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality

#### Nature Transformation

In [None]:
# month
X_train['month_sin'] = X_train['month'].apply( lambda x: np.sin( x * ( 2. * np.pi/12 ) ) )
X_train['month_cos'] = X_train['month'].apply( lambda x: np.cos( x * ( 2. * np.pi/12 ) ) )

# day 
X_train['day_sin'] = X_train['day'].apply( lambda x: np.sin( x * ( 2. * np.pi/30 ) ) )
X_train['day_cos'] = X_train['day'].apply( lambda x: np.cos( x * ( 2. * np.pi/30 ) ) )

# week_of_year
X_train['week_of_year_sin'] = X_train['week_of_year'].apply( lambda x: np.sin( x * ( 2. * np.pi/52 ) ) )
X_train['week_of_year_cos'] = X_train['week_of_year'].apply( lambda x: np.cos( x * ( 2. * np.pi/52 ) ) )

- for features that have cyclic behavior: it repeats over time

In [None]:
#checking dataframe after scaling to see if everything went through
X_train.head()

### 9.6 Apply Transformations in the Validation dataset

In [None]:
# loading scalers

#Robust Scaler 
temperature_scaler  = pickle.load(open(home_kaggle + 'parameters/temperature_scaler.pkl', 'rb'))
mark_down1_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down1_scaler.pkl', 'rb'))
mark_down2_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down2_scaler.pkl', 'rb'))
mark_down3_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down3_scaler.pkl', 'rb'))
mark_down4_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down4_scaler.pkl', 'rb'))
mark_down5_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down5_scaler.pkl', 'rb'))
unemployment_scaler = pickle.load(open(home_kaggle + 'parameters/unemployment_scaler.pkl', 'rb'))

#MinMax Scaler 
fuel_price_scaler   = pickle.load(open(home_kaggle + 'parameters/fuel_price_scaler.pkl', 'rb'))
cpi_scaler          = pickle.load(open(home_kaggle + 'parameters/cpi_scaler.pkl', 'rb'))
size_scaler         = pickle.load(open(home_kaggle + 'parameters/size_scaler.pkl', 'rb'))
year_scaler         = pickle.load(open(home_kaggle + 'parameters/year_scaler.pkl', 'rb'))

#Label enconder
type_scaler         = pickle.load(open(home_kaggle + 'parameters/type_scaler.pkl', 'rb'))

In [None]:
# Applying all transformations on validation dataset
#Validation dataset features transform - Robust Scaler
X_validation['temperature'] = temperature_scaler.transform( X_validation[['temperature']].values ) 
X_validation['mark_down1']  = mark_down1_scaler.transform( X_validation[['mark_down1']].values ) 
X_validation['mark_down2']  = mark_down2_scaler.transform( X_validation[['mark_down2']].values ) 
X_validation['mark_down3']  = mark_down3_scaler.transform( X_validation[['mark_down3']].values ) 
X_validation['mark_down4']  = mark_down4_scaler.transform( X_validation[['mark_down4']].values ) 
X_validation['mark_down5']  = mark_down5_scaler.transform( X_validation[['mark_down5']].values ) 
X_validation['unemployment']= unemployment_scaler.transform( X_validation[['unemployment']].values ) 

##Validation dataset features transform - MinMaxScaler
X_validation['fuel_price'] = fuel_price_scaler.transform( X_validation[['fuel_price']].values )
X_validation['cpi']        = cpi_scaler.transform( X_validation[['cpi']].values )
X_validation['size']       = size_scaler.transform( X_validation[['size']].values )
X_validation['year']       = year_scaler.transform( X_validation[['year']].values )

##Validation dataset features transform - Label Enconder
X_validation['type'] = type_scaler.transform( X_validation['type'] )

# is_holiday
X_validation['is_holiday'] = X_validation['is_holiday'].apply(lambda x: 1 if x == True else 0)

# target variable
X_validation['weekly_sales'] = np.log1p( X_validation['weekly_sales'] )

##Validation dataset features transform - Natural Transformations
# month
X_validation['month_sin'] = X_validation['month'].apply( lambda x: np.sin( x * ( 2. * np.pi/12 ) ) )
X_validation['month_cos'] = X_validation['month'].apply( lambda x: np.cos( x * ( 2. * np.pi/12 ) ) )

# day 
X_validation['day_sin'] = X_validation['day'].apply( lambda x: np.sin( x * ( 2. * np.pi/30 ) ) )
X_validation['day_cos'] = X_validation['day'].apply( lambda x: np.cos( x * ( 2. * np.pi/30 ) ) )

# week_of_year
X_validation['week_of_year_sin'] = X_validation['week_of_year'].apply( lambda x: np.sin( x * ( 2. * np.pi/52 ) ) )
X_validation['week_of_year_cos'] = X_validation['week_of_year'].apply( lambda x: np.cos( x * ( 2. * np.pi/52 ) ) )

In [None]:
# new y_train with weekly sales rescaled
y_validation = X_validation['weekly_sales']

y_train = X_train['weekly_sales'] 

## 10 FEATURE SELECTION
<a id="section-ten"></a>

- Selecting the most relevant features that describes our dataset (phenomenon). removing collinear features, its explain the same part of the phenomenon.

- Always prefer the simplest model! occla's razors idea

In [None]:
df6 = X_train.copy()

In [None]:
# deleting features after feature engineering derivation and transformations. Deleting original variables.
cols_drop = ['week_of_year', 'day', 'month', 'year_week', 'quarter']
df6 = df6.drop(cols_drop, axis = 1)

In [None]:
#double checking dtypes before run models
df6.dtypes

### 10.1 Boruta as a Feature Selector

- I ran boruta localy due to the computational power required for it, the lines below are commented out so the algorithm doesn't run every time.

In [None]:
# # creating training and test dataset for Boruta, because it can't be a dataframe type
# X_train_n = df6.drop( ['date', 'weekly_sales'], axis=1 ).values
# y_train_n = y_train.values.ravel()

# # Define RandomForestRegressor
# rf = RandomForestRegressor( n_jobs=-1 )

# # Define Boruta
# boruta = BorutaPy( rf, n_estimators='auto', verbose=2, random_state=42 ).fit( X_train_n, y_train_n )

### 10.2 Best Features from Boruta

In [None]:
# cols_selected = boruta.support_.tolist()

# X_train_fs = df6.drop(['date', 'weekly_sales'], axis = 1)
# cols_selected_boruta = X_train_fs.iloc[ :, cols_selected].columns.tolist()

# # Not selected boruta features
# cols_not_selected_boruta = np.setdiff1d(X_train_fs.columns, cols_selected_boruta)

### 10.3 Best Features from Random Forest

In [None]:
X_train = df6.drop( ['date', 'weekly_sales'], axis=1 ).copy()
y_train = df6['weekly_sales'].copy()

In [None]:
# train random forest classifier
rf = RandomForestRegressor(n_estimators = 200, n_jobs =-1, random_state = 42)
rf.fit(X_train, y_train)

# feature importance data frame
feat_imp = pd.DataFrame({'feature': X_train.columns,
                        'feature_importance': rf.feature_importances_})\
                        .sort_values('feature_importance', ascending=False)\
                        .reset_index(drop=True)


# plot feature importance
plt.subplots(figsize=(12,6))
sns.barplot(x='feature_importance', y='feature', data=feat_imp, orient='h', color='royalblue')\
    .set_title('Feature Importance');

### 10.4 Manual Feature Selection

- Features was selected based on the junction of Boruta algorithm, EDA and feature importance by random forest;

In [None]:
cols_selected_boruta = ['store',
'dept',
'is_holiday',
'type',
'cpi',
'size',
'month_cos',
'week_of_year_cos']
# columns to add
feat_to_add = ['date', 'weekly_sales']

# final features

cols_selected_boruta_full = cols_selected_boruta.copy()
cols_selected_boruta_full.extend( feat_to_add )

In [None]:
# cols_selected_boruta = ['store',
# 'dept',
# 'is_holiday',
# 'temperature',
# 'fuel_price',
# 'mark_down3',
# 'cpi',
# 'unemployment',
# 'type',
# 'size',
# 'month_cos',
# 'day_sin',
# 'day_cos',
# 'week_of_year_sin',
# 'week_of_year_cos']
# # columns to add
# feat_to_add = ['date', 'weekly_sales']

# # final features

# cols_selected_boruta_full = cols_selected_boruta.copy()
# cols_selected_boruta_full.extend( feat_to_add )

In [None]:
pd.DataFrame(data = cols_selected_boruta, columns = ['feature_selected'])

## 11 MACHINE LEARNING ALGORITHM MODELS
<a id="section-eleven"></a>

**Five different algorithms are going to be used to predict the target variable:**

- **Average:** averaging model is the model we use most in everyday life, it will always predict the average. It is useful as it is a comparative basis for implementing other models

- **Logistic Regression:** uses a complex cost function, which can be defined as the Sigmoid function. The output of the classification is based on the probability score between 0 and 1 of the input being in one class or another according to a threshold

- **Random Forest:** it is a tree based model build with multiple ensamble decision trees created with the bagging method. Then, all the classifiers take a weighted vote on their predictions. Since the algorithm goal is not trying to find a linear function to describe the event, it works for problems with more complex behaviour

- **XGBoost:** it is also a tree based model but they are built in a different way. While Random Forests builds each tree independently, XGBoost builds one tree at the time learning with its predecessor. Therefore, this algorithm doesn't combine results at the end of the process by taking majority votes, it combines the results along the way

- **LightGBM:** is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

 - Faster training speed and higher efficiency.
 - Lower memory usage.
 - Better accuracy.
 - Support of parallel, distributed, and GPU learning.
 - Capable of handling large-scale data.

- if we have multiple models to choose from, we should choose the least complex model as they generalize learning better.

In [None]:
# Applying selected features by boruta on train and validation datasets
x_train = X_train[ cols_selected_boruta ].copy() #selecting only the columns selected by boruta
x_validation = X_validation[ cols_selected_boruta ].copy()

# Time Series Data Preparation for cross-validation
x_training = df6[ cols_selected_boruta_full ].copy()

### 11.1 Average Model

In [None]:
aux1 = x_validation.copy()
aux1['weekly_sales'] = y_validation.copy()

# Predictions
aux2 = aux1[['store', 'weekly_sales']].groupby('store').mean().reset_index().rename(columns = {'weekly_sales': 'predictions'})
aux1 = pd.merge( aux1, aux2, how= 'left', on='store')
yhat_baseline = aux1['predictions']

# Performance
baseline_result = ml_error( aux1 ,'Average Model', np.expm1( y_validation ), np.expm1( yhat_baseline ))
baseline_result

### 11.2 Linear Regression Model

In [None]:
# Model
lr = LinearRegression().fit(x_train, y_train)

# Prediction 
yhat_lr = lr.predict( x_validation )

# Performance
lr_result = ml_error( x_validation,'Linear Regression', np.expm1(y_validation), np.expm1(yhat_lr))
lr_result

#### Linear Regression Model - Cross Validation

In [None]:
lr_result_cv = cross_validation( x_training, 5, 'Linear Regression', lr, verbose=False )
lr_result_cv

### 11.3 Linear Regression Regularized Model - Lasso

In [None]:
# model
lrr = Lasso( alpha=0.01 ).fit( x_train, y_train )

# prediction
yhat_lrr = lrr.predict( x_validation )

# performance
lrr_result = ml_error( x_validation,'Linear Regression - Lasso', np.expm1( y_validation ), np.expm1( yhat_lrr ) )
lrr_result

#### Linear Regression Regularized Model - Lasso

In [None]:
lrr_result_cv = cross_validation( x_training, 5, 'Linear Regression Regularized Model - Lasso', lrr, verbose=False )
lrr_result_cv

### 11.4 Random Forest Regressor

In [None]:
# model
rf = RandomForestRegressor( n_estimators = 100, n_jobs =-1, random_state=7 ).fit( x_train, y_train )

# prediction
yhat_rf = rf.predict( x_validation )

# performance
rf_result = ml_error( x_validation,'Random Forest Regressor', np.expm1( y_validation ), np.expm1( yhat_rf ) )
rf_result

#### Random Forest Regressor - Cross Validation

In [None]:
rf_result_cv = cross_validation( x_training, 5, 'Random Forest Regressor', rf, verbose=False )
rf_result_cv

### 11.5 XGBoost Regressor

In [None]:
# model
model_xgb = xgb.XGBRegressor( objective='reg:squarederror',
                              n_estimators = 100, random_state=7).fit( x_train, y_train )

# prediction
yhat_xgb = model_xgb.predict( x_validation )

# performance
xgb_result = ml_error(x_validation ,'XGBoost Regressor', np.expm1( y_validation ), np.expm1( yhat_xgb ) )
xgb_result

#### XGBoost Regressor - Cross Validation

In [None]:
xgb_result_cv = cross_validation( x_training, 5, 'XGBoost Regressor', model_xgb, verbose=False )
xgb_result_cv

### 11.6 LightGBM Regressor

In [None]:
# model
model_lgbm = lgbm.LGBMRegressor(n_estimators = 100, n_jobs =-1, random_state=7).fit( x_train, y_train )

# prediction
yhat_lgbm = model_lgbm.predict( x_validation )

# performance
lgbm_result = ml_error(x_validation ,'LightGBM Regressor', np.expm1( y_validation ), np.expm1( yhat_lgbm ) )
lgbm_result

#### LightGBM Regressor - Cross Validation

In [None]:
lgbm_result_cv = cross_validation( x_training, 5, 'LightGBM Regressor', model_lgbm, verbose=False )
lgbm_result_cv

### 11.7 Compare Model's Performance

#### 11.7.1 Single Performance - 1 fold

In [None]:
results = pd.concat( [baseline_result, lr_result, lrr_result, rf_result, xgb_result, lgbm_result ] ).set_index('Model Name')
results.sort_values('RMSE')

#### 11.7.2 Real Performance - Cross Validation - 5 folds

In [None]:
results_cv = pd.concat([lr_result_cv , lrr_result_cv , rf_result_cv, xgb_result_cv, lgbm_result_cv]).set_index('Model Name')
results_cv

- The choosen model was the one with the best performance with crossvalidation -> **Random Forest Regressor**

In [None]:
# Blocked time series k fold Cross-validation strategy used in the models
Image(home_kaggle + 'images2/blocked-time-series-kfold.png')

## 12 HYPERPARAMETERS FINE TUNING
<a id="section-twelve"></a>

### 12.1 Random Search

In [None]:
# param = {
#     'n_estimators': [300, 400] ,
#     'max_features': ['auto'],
#     'max_depth': [21,25,31],
#     'min_samples_split':[5, 7] ,
#     'min_samples_leaf': [1, 2, 5],
#         }

# MAX_EVAL = 5; #quantas iterações temos

In [None]:
# final_result = pd.DataFrame()

# for i in range(MAX_EVAL):
#     # choose values for parameters randomly
#     hp = { k: random.sample(v, 1)[0] for k, v in param.items() }
#     print(hp)
    
#     # model
#     model_rf = RandomForestRegressor(n_estimators = hp['n_estimators'],
#                                   max_features = hp['max_features'],
#                                   max_depth = hp['max_depth'],
#                                   min_samples_split = hp['min_samples_split'],
#                                   min_samples_leaf = hp['min_samples_leaf'],
#                                   random_state = 7,
#                                   n_jobs = -1)
    
#     # performance
#     result = cross_validation( x_training, 5, 'Random Forest Regressor', model_rf, verbose = True)
#     final_result = pd.concat([final_result, result])

# final_result

### 12.2 Final Model

In [None]:
param_tuned = {
    'n_estimators':300,
    'max_features': 'auto',
    'min_samples_split': 5,
    'min_samples_leaf': 1 ,
        }

In [None]:
# model
model_rf_tuned = RandomForestRegressor(n_estimators = param_tuned['n_estimators'],
                                  max_features      = param_tuned['max_features'],
                                  min_samples_split = param_tuned['min_samples_split'],
                                  min_samples_leaf  = param_tuned['min_samples_leaf'],
                                  n_jobs = -1,
                                  random_state = 7).fit(x_train, y_train)

#prediction
yhat_rf_tuned = model_rf_tuned.predict(x_validation)

# performance
rf_tuned_result = ml_error( x_validation,'Random Forest Regressor', np.expm1(y_validation) , np.expm1(yhat_rf_tuned))
rf_tuned_result

## 13 ERROR INTERPRETATION
<a id="section-thirteen"></a>

In [None]:
# selecting validation dataframe to evaluate error 
df13 = X_validation[cols_selected_boruta_full]

# rescale
df13['weekly_sales'] = np.expm1(df13['weekly_sales'])
df13['predictions'] = np.expm1(yhat_rf_tuned)

In [None]:
# checking new error features
df13.head()

### 13.1 Business Performance - Store Granularity

- In this section we are evaluating bussiness peformance for each store. Looking at the sales for the next 22 weeks

In [None]:
# sum of prediction
df131 = df13[['store', 'predictions']].groupby('store').sum().reset_index()

# MAE and MAPE
df13_aux1 = df13[['store', 'weekly_sales', 'predictions']].groupby('store').apply(lambda x: mean_absolute_error(x['weekly_sales'], x['predictions']) ).reset_index().rename(columns = {0: 'MAE'})

df13_aux2 = df13[['store', 'weekly_sales', 'predictions']].groupby('store').apply(lambda x: mean_absolute_percentage_error(x['weekly_sales'], x['predictions']) ).reset_index().rename(columns = {0: 'MAPE'})

df13_aux3 = df13[['store', 'weekly_sales', 'predictions']].groupby('store').apply(lambda x: weighted_mean_absolute_error(df13 ,x['weekly_sales'], x['predictions']) ).reset_index().rename(columns = {0: 'WMAE'})


# merge
df13_aux4 = pd.merge(df13_aux1, df13_aux2, how = 'inner', on = 'store')
df13_aux5 = pd.merge(df13_aux4, df13_aux3, how = 'inner', on = 'store')

df132 = pd.merge(df131, df13_aux5, how = 'inner', on = 'store')

# Scenerios

df132['worst_scenario'] = df132['predictions'] - df132['MAE']
df132['best_scenario'] = df132['predictions'] + df132['MAE']


df132 = df132[['store', 'predictions', 'worst_scenario', 'best_scenario', 'MAE', 'MAPE', 'WMAE']]

In [None]:
# worst and best scenarios
df132.sort_values('WMAE', ascending = True).head().style.format({'predictions': '${0:,.2f}', 'worst_scenario': '${:,.2f}',  'best_scenario': '${0:,.2f}', 'MAE': '${0:,.2f}', 'MAPE': '{:.2%}'})

In [None]:
# worst and best scenarios
df132.sort_values('WMAE', ascending = False).head().style.format({'predictions': '${0:,.2f}', 'worst_scenario': '${:,.2f}',  'best_scenario': '${0:,.2f}', 'MAE': '${0:,.2f}', 'MAPE': '{:.2%}'})

In [None]:
# scatterplot
sns.scatterplot(x = 'store', y = 'WMAE', data = df132 );

### 13.2 Business Performance - Department Granularity

In [None]:
# sum of prediction
df10 = df13[['store', 'dept', 'predictions']].groupby(['store','dept']).sum().reset_index()

# MAE and MAPE
df10_aux1 = df13[['store', 'dept', 'weekly_sales', 'predictions']].groupby(['store','dept']).apply(lambda x: mean_absolute_error(x['weekly_sales'], x['predictions']) ).reset_index().rename(columns = {0: 'MAE'})

df10_aux2 = df13[['store', 'dept', 'weekly_sales', 'predictions']].groupby(['store','dept']).apply(lambda x: mean_absolute_percentage_error(x['weekly_sales'], x['predictions']) ).reset_index().rename(columns = {0: 'MAPE'})

df10_aux3 = df13[['store', 'dept', 'weekly_sales', 'predictions']].groupby(['store','dept']).apply(lambda x: weighted_mean_absolute_error(df13 ,x['weekly_sales'], x['predictions']) ).reset_index().rename(columns = {0: 'WMAE'})


# merge
df10_aux4 = pd.merge(df10_aux1, df10_aux2, how = 'inner', on = ['store','dept'])
df10_aux5 = pd.merge(df10_aux4, df10_aux3, how = 'inner', on = ['store','dept'])

df12 = pd.merge(df10, df10_aux5, how = 'inner', on = ['store','dept'])

# Scenerios

df12['worst_scenario'] = df12['predictions'] - df12['MAE']
df12['best_scenario'] = df12['predictions'] + df12['MAE']


df12 = df12[['store','dept', 'predictions', 'worst_scenario', 'best_scenario', 'MAE', 'MAPE', 'WMAE']]

In [None]:
# scatterplot
sns.scatterplot(x = 'store', y = 'WMAE', data = df12 );

### 13.3 Total Performance

In [None]:
# Total Walmart sales predictions for the next 22 weeks
df133 = df132[['predictions', 'worst_scenario', 'best_scenario']].apply( lambda x: np.sum( x ), axis=0 ).reset_index().rename( columns={'index': 'Scenario', 0:'Values'} )
df133['Values'] = df133['Values'].map( 'R$ {:,.2f}'.format )
df133

### 13.4 Machine Learning Performance 

In [None]:
# defining error and error rate
df13['error'] = df13['weekly_sales'] - df13['predictions']
df13['error_rate'] = df13['predictions'] / df13['weekly_sales']

In [None]:
# ML error analysis
plt.subplot( 2, 2, 1 )
sns.lineplot( x='date', y='weekly_sales', data=df13, label='SALES' )
sns.lineplot( x='date', y='predictions', data=df13, label='PREDICTIONS' )

plt.subplot( 2, 2, 2 )
sns.lineplot( x='date', y='error_rate', data=df13 )
plt.axhline( 1, linestyle='--', color = 'red')

plt.subplot( 2, 2, 3 )
sns.distplot( df13['error'] )

plt.subplot( 2, 2, 4 )
sns.scatterplot( df13['predictions'], df13['error'] );
plt.style.use('tableau-colorblind10');

- The first graph above shows us the predictions (orage line) which tends to follow the sales values (blue line) over the last 22 weeks of sales, meaning that the model predicted well the sales, so it is following a good pattern.

- The second graph shows us the error rate against the sales. The error rate is the ratio between prediction values and observed values. The model does have some high rates, it could performe better.

- One of the premises for a good machine learning model is to have a normal-shaped distribution of residuals with mean zero. In the third graph, we can observe that the errors are centered around zero, and its distribution resembles a normal, bell-shaped curve.

- The last graph is a scatterplot with predictions plotted against the error for each sales day. Ideally, we would have all data points concentrated within a "tube" since it represents low error variance across all values that sales prediction can assume. In our case some predictions have high error values, so we can stress more the model or create better features to get better performance.

## 14 MODEL SUBMISSION
<a id="section-fourteen"></a>

In [None]:
model_rf_tuned

In [None]:
# Save Trained Model
# pickle.dump(model_rf_tuned, open(home_path + 'model/model_walmart.pkl', 'wb'))

### 14.1 Loading Model and Scalers

In [None]:
#loading scalers
#Robust Scaler 
temperature_scaler  = pickle.load(open(home_kaggle + 'parameters/temperature_scaler.pkl', 'rb'))
mark_down1_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down1_scaler.pkl', 'rb'))
mark_down2_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down2_scaler.pkl', 'rb'))
mark_down3_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down3_scaler.pkl', 'rb'))
mark_down4_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down4_scaler.pkl', 'rb'))
mark_down5_scaler   = pickle.load(open(home_kaggle + 'parameters/mark_down5_scaler.pkl', 'rb'))
unemployment_scaler = pickle.load(open(home_kaggle + 'parameters/unemployment_scaler.pkl', 'rb'))

#MinMax Scaler 
fuel_price_scaler   = pickle.load(open(home_kaggle + 'parameters/fuel_price_scaler.pkl', 'rb'))
cpi_scaler          = pickle.load(open(home_kaggle + 'parameters/cpi_scaler.pkl', 'rb'))
size_scaler         = pickle.load(open(home_kaggle + 'parameters/size_scaler.pkl', 'rb'))
year_scaler         = pickle.load(open(home_kaggle + 'parameters/year_scaler.pkl', 'rb'))

#Label enconder
type_scaler         = pickle.load(open(home_kaggle + 'parameters/type_scaler.pkl', 'rb'))

In [None]:
# loading trained model
model = pickle.load(open(home_kaggle + 'models/model_walmart.pkl', 'rb'))

### 14.2 Data ETL

In [None]:
# defining data pipeline
def data_merge(df_test, df_features, df_stores):
    # merge datasets into one
    df_store_feature = df_features.merge(df_stores, on = 'Store', how = 'left')
    
    # main dataframe for exploring
    df_test = df_test.merge(df_store_feature, on = ['Store', 'Date', 'IsHoliday'], how = 'left').sort_values(['Store','Dept','Date'])
    return df_test

def data_cleaning(df1):
    ### Rename Columns
    cols_old = df1.columns
    snakecase = lambda x: inflection.underscore(x)
    
    cols_new = list( map( snakecase, cols_old ) )
    #Rename Columns
    df1.columns = cols_new
    
    # replacing NAs with 0
    df1 = df1.fillna(0)

    #converting feature 'date' to datetime
    df1['date'] = pd.to_datetime( df1[ 'date' ] )
    
    return df1

def data_feature_engineering(df2):
    # year
    df2['year'] = df2['date'].dt.year

    # month
    df2['month'] = df2['date'].dt.month
    
    # day
    df2['day'] = df2['date'].dt.day
    
    # week of year
    df2['week_of_year'] = df2['date'].dt.isocalendar().week.astype('int64')
    
    # year week
    df2['year_week'] = df2['date'].dt.strftime( '%Y-%W' )
    
    # year quarter
    df2['quarter'] = df2['date'].dt.to_period('Q')
    
    return df2

def data_preparation(df5):
    # Applying all transformations on validation dataset
    #Validation dataset features transform - Robust Scaler
    df5['temperature'] = temperature_scaler.transform( df5[['temperature']].values ) 
    df5['mark_down1']  = mark_down1_scaler.transform( df5[['mark_down1']].values ) 
    df5['mark_down2']  = mark_down2_scaler.transform( df5[['mark_down2']].values ) 
    df5['mark_down3']  = mark_down3_scaler.transform( df5[['mark_down3']].values ) 
    df5['mark_down4']  = mark_down4_scaler.transform( df5[['mark_down4']].values ) 
    df5['mark_down5']  = mark_down5_scaler.transform( df5[['mark_down5']].values ) 
    df5['unemployment']= unemployment_scaler.transform( df5[['unemployment']].values ) 
    
    ##Validation dataset features transform - MinMaxScaler
    df5['fuel_price'] = fuel_price_scaler.transform( df5[['fuel_price']].values )
    df5['cpi']        = cpi_scaler.transform( df5[['cpi']].values )
    df5['size']       = size_scaler.transform( df5[['size']].values )
    df5['year']       = year_scaler.transform( df5[['year']].values )
    
    ##Validation dataset features transform - Label Enconder
    df5['type'] = type_scaler.transform( df5['type'] )
    
    # is_holiday
    df5['is_holiday'] = df5['is_holiday'].apply(lambda x: 1 if x == True else 0)
    
    ##Validation dataset features transform - Natural Transformations
    # month
    df5['month_sin'] = df5['month'].apply( lambda x: np.sin( x * ( 2. * np.pi/12 ) ) )
    df5['month_cos'] = df5['month'].apply( lambda x: np.cos( x * ( 2. * np.pi/12 ) ) )
    
    # day 
    df5['day_sin'] = df5['day'].apply( lambda x: np.sin( x * ( 2. * np.pi/30 ) ) )
    df5['day_cos'] = df5['day'].apply( lambda x: np.cos( x * ( 2. * np.pi/30 ) ) )
    
    # week_of_year
    df5['week_of_year_sin'] = df5['week_of_year'].apply( lambda x: np.sin( x * ( 2. * np.pi/52 ) ) )
    df5['week_of_year_cos'] = df5['week_of_year'].apply( lambda x: np.cos( x * ( 2. * np.pi/52 ) ) )
    
    cols_selected = ['store','dept','is_holiday','type','cpi','size','month_cos','week_of_year_cos']
    return df5[cols_selected]

def get_prediction(model, original_data, test_data):
    # prediction
    pred = model.predict(test_data)
    
    # join pred into the original data so people can undestandt it the new table with prediction column
    original_data['weekly_sales'] = np.expm1(pred)
    
    #creatomg id column
    original_data['id'] = original_data['Store'].astype(str) + '_' +  original_data['Dept'].astype(str) + '_' +  original_data['Date'].astype(str)
    original_data = original_data[['id', 'weekly_sales']].copy()
    return original_data

In [None]:
#Applying ETL into test dataset
df1_test = data_merge(df_test_raw, df_features, df_stores)
df2_test = data_cleaning(df1_test)
df3_test = data_feature_engineering(df2_test)
df4_test = data_preparation(df3_test)
df_submission = get_prediction(model, df_test_raw, df4_test)

In [None]:
# Final result sampĺe
df_submission.head()

In [None]:
# final result to csvhttps://www.linkedin.com/in/marxcerqueira/
df_submission.to_csv('./submission.csv',index=False )

## 15 CONCLUSION

- In this project, all necessary steps to deploy a complete Data Science project to production were taken. Using one CRISP-DM project management methodology cycle, a satisfactory model performance was obtained by using the Random Forest algorithm to predict sales revenue for Walmart stores and its departments up to 22 weeks in advance, and useful business information was generated during the exploratory data analysis section. Due to this, the project met the criteria of finding a suitable solution for the company's stakeholders to access sales predictions on a csv taht could later be deployed.

## 16 NEXT STEPS

- In the next cycle of CRISP, we shoud create specific models for stores that were more difficult to make predictions;
- Try to create new features for these stores.
- Put more stress on machine learning models. 
- Improve the predictive model using the ensembling method to combine models and come with better model.
- Create a anomaly detection for better understanding of outliers and sales during holidays and weeks before holidays
- Time Series Analysis
- Deploy the model so it can be access from a smartfone