<img src="https://www.bestdesigns.co/uploads/inspiration_images/4350/990__1511457498_404_walmart.png" alt="WALMART LOGO" />

# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Scope of this project 🖼️

For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales. The dataset has been taken from a Kaggle competition, but we made some changes compared to the original data. Please make sure that you're using **our** custom dataset (available on JULIE). 🤓

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the *preprocessing template*. There will also be some specific transformations to be planned on this dataset, for example on the *Date* column that can't be included as it is in the model. Below are some hints that might help you 🤓

 #### Preprocessing to be planned with pandas

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !
 - Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.
 
**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*

**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*
 


**Target variable/target (Y) that we will try to predict, to separate from the others** : *Weekly_Sales*

 **------------**

 #### Preprocessings to be planned with scikit-learn

 **Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Store, Holiday_Flag
 - Numerical variables : Temperature, Fuel_Price, CPI, Unemployment, Year, Month, Day, DayOfWeek

### Part 2 : Baseline model (linear regression)
Once you've trained a first model, don't forget to assess its performances on the train and test sets. Are you satisfied with the results ?
Besides, it would be interesting to analyze the values of the model's coefficients to know what features are important for the prediction. To do so, the `.coef_` attribute of scikit-learn's LinearRegression class might be useful. Please refer to the following link for more information 😉 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

### Part 3 : Fight overfitting
In this last part, you'll have to train a **regularized linear regression model**. You'll find below some useful classes in scikit-learn's documentation :
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/

In [1]:

#######
# Base lib
#
import pandas as pd
import numpy as np
import time
###
# machine learning libs
#####
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
##
# graphics
#####
import seaborn as sns

from IPython.display import display

import warnings
# to avoid deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # to avoid deprecation warnings

# Since the dataset is small enough
pd.set_option('display.max_rows',150)

 ## File reading and basic exploration 📰📰

In [2]:
# Import dataset
print("Loading dataset...", end=" ")
dataset = pd.read_csv("Walmart_Store_sales.csv")
print("...Done.")

# Basic stats
print("Number of rows : {}".format(dataset.shape[0]))
print()

print("Display of dataset: ")
display(dataset.head())
print()

print("Percentage of missing values: ")
display(100*dataset.isnull().sum()/dataset.shape[0])

dataset.corr().style.background_gradient(cmap='coolwarm').set_precision(2)
#pd.plotting.scatter_matrix(dataset, figsize  = [15, 15])

Loading dataset... ...Done.
Number of rows : 150

Display of dataset: 


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092



Percentage of missing values: 


Store            0.000000
Date            12.000000
Weekly_Sales     9.333333
Holiday_Flag     8.000000
Temperature     12.000000
Fuel_Price       9.333333
CPI              8.000000
Unemployment    10.000000
dtype: float64

  dataset.corr().style.background_gradient(cmap='coolwarm').set_precision(2)


Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
Store,1.0,0.12,-0.03,-0.26,0.18,-0.59,0.22
Weekly_Sales,0.12,1.0,0.04,-0.17,-0.02,-0.29,0.06
Holiday_Flag,-0.03,0.04,1.0,-0.19,-0.12,0.17,0.1
Temperature,-0.26,-0.17,-0.19,1.0,0.05,0.14,-0.03
Fuel_Price,0.18,-0.02,-0.12,0.05,1.0,-0.16,0.09
CPI,-0.59,-0.29,0.17,0.14,-0.16,1.0,-0.35
Unemployment,0.22,0.06,0.1,-0.03,0.09,-0.35,1.0


In [3]:

# Dropping mission value, particulary the Weekly_Sales which is the target variable
dataset = dataset.dropna(subset = ['Weekly_Sales'])

## Since the dataset is small, let's try not to remove too many rows, order is important here !!!
# Outliers are mandatory
# Drop lines containing outliers (using masks)
outliers_col = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']

before = dataset.shape[0]
print('About to remove outliers : ... ', end= " ")

for col in outliers_col:
    to_keep = np.abs(dataset[col]-dataset[col].mean()) <= (3*dataset[col].std())
    dataset = dataset.loc[to_keep,:]

print('Done removing outlier, {} rows removed.'.format(before - dataset.shape[0]))

print("Percentage of missing values: ")
display(100*dataset.isnull().sum()/dataset.shape[0])

About to remove outliers : ...  Done removing outlier, 46 rows removed.
Percentage of missing values: 


Store            0.000000
Date            11.111111
Weekly_Sales     0.000000
Holiday_Flag    11.111111
Temperature      0.000000
Fuel_Price       0.000000
CPI              0.000000
Unemployment     0.000000
dtype: float64

In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*


In [4]:
print ('shape before dropna on date', dataset.shape)
dataset = dataset.dropna(subset = ['Date'])
print ('shape after dropna on date', dataset.shape)

# eventually dataset = dataset.dropna(subset = ['Date'])
# OR
# strongly to gas price : dataset['year'] = pd.DatetimeIndex(dataset['Date']).year
dataset['year'] = pd.DatetimeIndex(dataset['Date']).year
dataset['month'] = pd.DatetimeIndex(dataset['Date']).month
dataset['day'] = pd.DatetimeIndex(dataset['Date']).day.astype(int)
dataset['dayofweek'] = pd.DatetimeIndex(dataset['Date']).dayofweek
dataset['Temperature'] = (dataset['Temperature'] - 32) * 5/9

# Dropping date since we split it and Store which is an id, no meaning full for our purposes
dataset = dataset.drop(columns=['Date'])
dataset.head()

shape before dropna on date (90, 8)
shape after dropna on date (80, 8)


Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,year,month,day,dayofweek
0,6.0,1572117.54,,15.338889,3.045,214.777523,6.858,2011,2,18,4
1,13.0,1807545.43,0.0,5.766667,3.435,128.616064,7.47,2011,3,25,4
4,6.0,1644470.66,0.0,26.05,2.759,212.412888,7.092,2010,5,28,4
6,15.0,695396.19,0.0,21.0,4.069,134.855161,7.658,2011,3,6,6
7,20.0,2203523.2,0.0,4.405556,3.617,213.023622,6.961,2012,3,2,4


In [5]:
dataset.corr().style.background_gradient(cmap='coolwarm').set_precision(2)

  dataset.corr().style.background_gradient(cmap='coolwarm').set_precision(2)


Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,year,month,day,dayofweek
Store,1.0,0.14,-0.05,-0.33,0.19,-0.58,0.26,0.02,-0.24,0.06,-0.05
Weekly_Sales,0.14,1.0,0.01,-0.08,-0.01,-0.42,0.08,-0.05,-0.07,-0.04,0.14
Holiday_Flag,-0.05,0.01,1.0,-0.33,-0.22,0.22,0.09,-0.09,0.36,-0.13,-0.24
Temperature,-0.33,-0.08,-0.33,1.0,-0.09,0.2,-0.25,-0.16,0.01,0.16,0.19
Fuel_Price,0.19,-0.01,-0.22,-0.09,1.0,-0.24,-0.04,0.85,-0.24,0.05,0.02
CPI,-0.58,-0.42,0.22,0.2,-0.24,1.0,-0.09,-0.07,0.21,0.13,0.01
Unemployment,0.26,0.08,0.09,-0.25,-0.04,-0.09,1.0,-0.13,-0.2,-0.07,-0.02
year,0.02,-0.05,-0.09,-0.16,0.85,-0.07,-0.13,1.0,-0.18,-0.07,-0.13
month,-0.24,-0.07,0.36,0.01,-0.24,0.21,-0.2,-0.18,1.0,-0.06,-0.39
day,0.06,-0.04,-0.13,0.16,0.05,0.13,-0.07,-0.07,-0.06,1.0,0.24


In [6]:
# Separate target variable Y from features X
target_name = 'Weekly_Sales'

print("Separating labels from features...")
Y = dataset.loc[:,target_name]

# Keeping all the rest of the columns
X = dataset.loc[:,[c for c in dataset.columns if c!=target_name]]
X_cols = X.columns

# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X = X.values
Y = Y.tolist()

print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

Separating labels from features...
Convert pandas DataFrames to numpy arrays...
Dividing into train and test sets...


In [7]:
# Create pipeline for numeric features
# Temperature  Fuel_Price         CPI  Unemployment 
numeric_features = [2,3,4,5,6,7,8,9] # Positions of numeric columns in X_train/X_test
numeric_transformer = Pipeline(steps=[
    # missing values will be replaced by columns' median
    ('imputer', SimpleImputer(strategy='median')), 
    # Normalize numeric features
    ('scaler', StandardScaler())
])

# Create pipeline for categorical features
# Stores, Holiday_Flag
categorical_features = [0,1] # Positions of categorical columns in X_train/X_test
# No date version
categorical_transformer = Pipeline(
    steps=[
    # missing values will be replaced by most frequent value
    ('imputer', SimpleImputer(strategy='most_frequent')), 
    # first column will be dropped to avoid creating correlations between features
    ('encoder', OneHotEncoder(handle_unknown='ignore', drop='first')) 
    ])

# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [8]:
dict_regressor = [
    {
    'classifier' : 'LinearRegression',
    'init_params': "n_jobs=-1",
    'best_params': "n_jobs=-1",
    'grid_params': "{}",   
    },
    {
    'classifier' : 'Ridge',
    'init_params': "",
    'best_params': "alpha=0.0",
    'grid_params': "{'alpha': np.arange(0,100,0.5)}",
    },
    {
    'classifier' : 'Lasso',
    'init_params': '',
    'best_params': "alpha=1e-12",
    'grid_params': "{'alpha' : [10**(-a) for a in range(100)]}",
    },
]
dfmodels = pd.DataFrame(dict_regressor)
dfmodels['train_r2score']=0
dfmodels['test_r2score']=0
dfmodels['train_support']=0
dfmodels['test_support']=0
dfmodels['elapsed']=0

In [9]:
###########
# Run best params
############################
t0_total = time.time()

for i, row in dfmodels.iterrows():
    t0 = time.time()
    classifier = eval(dfmodels["classifier"][i]+"("+dfmodels["best_params"][i]+")")
    classifier.fit(X_train, Y_train)
    Y_train_pred = classifier.predict(X_train)
    Y_test_pred = classifier.predict(X_test)
    dfmodels.loc[i, "train_r2score"] = r2_score(Y_train, Y_train_pred)
    dfmodels.loc[i, "test_r2score"] = r2_score(Y_test, Y_test_pred)
    dfmodels.loc[i, "train_support"] = len(X_train)
    dfmodels.loc[i, "test_support"] = len(X_test)
    dfmodels.loc[i, "elapsed"] = time.time() - t0

    print(dfmodels.loc[i,'classifier'], dfmodels.loc[i,'train_r2score'], dfmodels.loc[i,'test_r2score'])

print("Total test time is : ", time.time() - t0_total)

dfmodels.sort_values('test_r2score', ascending=False)

LinearRegression 0.9829071179842509 0.9636785841392872
Ridge 0.9829071179842509 0.963678584139299
Lasso 0.9803920892875131 0.9755048475960104
Total test time is :  0.019078969955444336


  model = cd_fast.enet_coordinate_descent(


Unnamed: 0,classifier,init_params,best_params,grid_params,train_r2score,test_r2score,train_support,test_support,elapsed
2,Lasso,,alpha=1e-12,{'alpha' : [10**(-a) for a in range(100)]},0.980392,0.975505,64,16,0.005705
1,Ridge,,alpha=0.0,"{'alpha': np.arange(0,100,0.5)}",0.982907,0.963679,64,16,0.003896
0,LinearRegression,n_jobs=-1,n_jobs=-1,{},0.982907,0.963679,64,16,0.007276


In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.exceptions import ConvergenceWarning 
ConvergenceWarning('ignore')

###########
# Find best params
############################
t0_total = time.time()

for i, row in dfmodels.iterrows():
    t0 = time.time()
    classifier = eval(dfmodels["classifier"][i]+ "(" +dfmodels["init_params"][i]+ ")" )
    params = eval(dfmodels["grid_params"][i])
    best_classifier = GridSearchCV(classifier, params, n_jobs=-1, verbose=0, cv=4)
    best_classifier.fit(X_train, Y_train)
    Y_train_pred = best_classifier.predict(X_train)
    Y_test_pred = best_classifier.predict(X_test)
    dfmodels.loc[i, "train_r2score"] = r2_score(Y_train, Y_train_pred)
    dfmodels.loc[i, "test_r2score"] = r2_score(Y_test, Y_test_pred)
    dfmodels.loc[i, "train_support"] = len(X_train)
    dfmodels.loc[i, "test_support"] = len(X_test)
    dfmodels.loc[i, "elapsed"] = time.time() - t0

    print(dfmodels.loc[i,'classifier'], 
        dfmodels.loc[i,'train_r2score'], 
        dfmodels.loc[i,'test_r2score'],
        best_classifier.best_params_
        )
print("Total test time is : ", time.time() - t0_total)

dfmodels.sort_values('train_r2score', ascending=False)

In [15]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, Y_train)
linear_reg.coef_

array([  -46179.95853135,   -67273.55580543,  2028871.29049168,
        -109978.60486507,  -163053.98752843,     9380.77732587,
         -42732.95417749,   -23787.82404723,   272750.62023671,
       -1496729.1993555 ,  4936466.51518072, -1493219.41030603,
         -83360.68187364,   161960.34937282, -1033695.80074445,
       -1464761.13974251,  4910458.28824236,   135002.76810253,
        4844914.15321229,  2141935.82875519,  3367889.04926786,
         -70634.45652752,  3500795.5065363 ,  3719025.06971062,
        4041589.53886296,   745479.2305262 ,  -179510.39947708])

In [16]:
np.argmax(abs(linear_reg.coef_))

10

In [17]:
display(X_train[0])

array([ 0.58589224,  1.56199819, -1.05536176,  0.30814094,  0.18751465,
       -1.00659072, -1.10218284,  1.66003771,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ])