# **Project Scope**

Having a well-defined structure before performing a task helps in efficient execution of the task. This is true even in cases of building a machine learning model. Once you have built a model on a dataset, you can easily break down the steps and define a structured Machine learning pipeline.

This notebook coveres the process of building an end-to-end Machine Learning pipeline and implementing it on  BigMart sales prediction dataset.

The dataset contains information about the stores, products and historical sales. We will predict the sales of the products in the stores.

We will start by building a prototype machine learning pipeline that will help us define the actual machine learning pipeline.

In [None]:
#Importing libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Data Exploration and Preprocessing

In [None]:
#loading train data
train = pd.read_csv("../input/big-mart-sales-prediction/Train.csv")

In [None]:
#check for missing values
train.isna().sum()

Only Item_Weight and Outlet_Size have missing values.

Item_Weight is a continuous variable. We can use either mean or median to impute the missing values, but here we will use mean.

Outlet_Size is a categorical variable so will use mode to impute the missing values in the column.

In [None]:
#impute missing values in Item_Weight using mean
train.Item_Weight.fillna(train.Item_Weight.mean(), inplace=True)
train.Item_Weight.isna().sum()

In [None]:
#impute missing values in Outlet_Size using mode
train.Outlet_Size.fillna(train.Outlet_Size.mode()[0], inplace=True)
train.Outlet_Size.isna().sum()

Machine learning models cannot work with categorical(string) data. We will convert the categorical variables into numeric types.

In [None]:
#checking categorical variables in the data
train.dtypes

Our data has the following categorical variables

* Item_Identifier
* Item_Fat_Content
* Item_Type
* Outlet_Identifier
* Outlet_Size
* Outlet_Type
* Outlet_Location_Type

We will use the categorical_encorders library to convert these variables into binary variables. We will not convert Item_Identifier.

In [None]:
import category_encoders as ce

#create an object of OneHotEncorder
OHE = ce.OneHotEncoder(cols=['Item_Fat_Content',
                            'Item_Type',
                            'Outlet_Identifier',
                            'Outlet_Size',
                            'Outlet_Location_Type',
                            'Outlet_Type'],use_cat_names=True)

#encode the variables
train = OHE.fit_transform(train)

In [None]:
train.head()

Now that we have taken care of our categorical variables, we move on to the continous variables.
We will nnormalize the data in such a way that the range of all variables is almost similar.
We will use the StandardScaler function to do this.

In [None]:
from sklearn.preprocessing import StandardScaler
#create an object of the StandardScaler
scaler = StandardScaler()

#fit with the Item_MRP
scaler.fit(np.array(train.Item_MRP).reshape(-1,1))

#transform the data
train.Item_MRP = scaler.transform(np.array(train.Item_MRP).reshape(-1,1))

# Building the Model
We will use the Linear Regression and the Random Forest Regressor to predict the sales. We will create a validation set using the train_test_split() function.

test_size = 0.25 such that the validation set holds 25% of the data points while the train set has 75%.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#seperate the independent and target variable
train_X = train.drop(columns=['Item_Identifier', 'Item_Outlet_Sales'])
train_Y = train['Item_Outlet_Sales']

#split the data
train_x, valid_x, train_y, valid_y = train_test_split(train_X, train_Y, test_size=0.25) 

#shape of train test splits
train_x.shape, valid_x.shape, train_y.shape, valid_y.shape

Now that we have split our data, we will train a linear regression model on this data and check its performance on the validation set. We will use RMSE as an evaluation metric.

In [None]:
#LinearRegression
LR = LinearRegression()

#fit the model
LR.fit(train_x, train_y)

#predict the target on train and validation data
train_pred = LR.predict(train_x)
valid_pred = LR.predict(valid_x)

# RMSE on train and validation data
print('RMSE on train data: ', mean_squared_error(train_y, train_pred)**(0.5))
print('RMSe on validation data: ', mean_squared_error(valid_y, valid_pred)**(0.5))

We will train a random forest regressor and see if we can get an improvement on the train and validation errors.

In [None]:
#RandomForestRegressor
RFR = RandomForestRegressor(max_depth=10)

#fitting the model
RFR.fit(train_x, train_y)

#predict the target on train and validation data
train_pred = RFR.predict(train_x)
valid_pred = RFR.predict(valid_x)

#RMSE on train and test data
print('RMSE on train data :', mean_squared_error(train_y, train_pred)**(0.5))
print('RMSE on validation data :', mean_squared_error(valid_y, valid_pred)**(0.5))


We can see a significant improvement on the RMSE values. The random forest algorithm gives us 'feature importance for all the variables in the data.

We have 45 features and not all of these features may be useful in forecasting. We will select the top 7 features which had a major contribution in forecasting sales values.

If the model performance is similar in both cases (by using 45 features and by using 7 features), then we should only use the top 7 features, in order to keep the model simple and efficient.

The goal is to have a less complex model without compromising on the overall model performance.

In [None]:
#plot the 7 most important features
plt.figure(figsize=(10,8))
feat_importances = pd.Series(RFR.feature_importances_, index = train_x.columns)
feat_importances.nlargest(7).plot(kind='barh');

In [None]:
#training data with top 7 features
train_x_7 = train_x[['Item_MRP',
                      'Outlet_Type_Grocery Store',
                      'Item_Visibility',
                      'Outlet_Identifier_OUT027',
                      'Outlet_Type_Supermarket Type3',
                      'Item_Weight',
                      'Outlet_Establishment_Year']]

#validation data with top 7 important features
valid_x_7 = valid_x[['Item_MRP',
                      'Outlet_Type_Grocery Store',
                      'Item_Visibility',
                      'Outlet_Identifier_OUT027',
                      'Outlet_Type_Supermarket Type3',
                      'Item_Weight',
                      'Outlet_Establishment_Year']]

#create an object of the RandomForestRegressor Model
RFR_with_7 = RandomForestRegressor(max_depth=10, random_state=2)


In [None]:
#fit the model
RFR_with_7.fit(train_x_7, train_y)

#predict the target on the training and validation data
pred_train_with_7 = RFR_with_7.predict(train_x_7)
pred_valid_with_7 = RFR_with_7.predict(valid_x_7)

#RMSE on train and validation data
print('RMSE on train data: ', mean_squared_error(train_y, pred_train_with_7)**(0.5))
print('RMSE on validation data: ', mean_squared_error(valid_y, pred_valid_with_7)**(0.5))

Using only 7 features has given us almost the same perfomance as the previous model where we were using 45 features. Now we will identify the final set of features that we need and the preprocessing steps for each of them.

# Identifying features to build the Machine Learning pipeline
We must list down the final set of features and necessary preprocessing steps for each of them, to be used in the ML pipeline. Since the RandomForestRegressor model with 7 features gave us almost the same performance as the previous model with 45 features, we will only use these features for our ML pipeline.

# Selected features and preprocessing steps
* **Item_MRP:** It holds the price of the products. During the preprocessing step we used a standard scaler to scale these values.
* **Outlet_Type_Grocery Store:** A binary column which indcates if the outlet type is a grocery store or not. To use this information in the model building process, we will add a binary feature in the existing data that contains 1 (if outlet type is a grocery store) and 0 (if the outlet type is something else).
* **Item_Visibility:** Denotes visibility of products in the store. Since this variable had a small value range and no missing values, we did not apply any preprocessing steps on this variable.
* **Outlet_Type_Supermarket Type3:** Another binary column indicating if the outlet type is a 'supermarket_type_3' or not. To capture this information we will create a binary feature that stores 1 (if outlet type is supermarket_type_3) and 0 (if not).
* **Outlet_Identifier_OUT027:8** This feature specifies whether the outlet identifier is 'OUT027' or not. Similar to the  previous example, we will create a seperate column that carries 1 (if outlet identifier is OUT027) or 0 (if otherwise).
* **Outlet_Establishment_Year:** This describes the year of establishment of the stores. Since we did not perform any transformation on values in this column, we will not preprocess it in the pipeline.
* **Item_Weight:** During preprocessing we observed that this column had missing values. These missing values were imputed using the average of the column. This has to be taken into account while building the pipeline.

We will drop the other columns since we will not use them to train the model.


# Pipeline Design
We have built a prototype to understand the preprocessing requirement for our data. It is now time to form a pipeline design based on our learning from the prototype. We will define the pipeline in 3 stages:

1. Create the required binary features
2. Perform required data preprocessing and transformations:
*  Drop the columns that are not required
*  Missing value imputation (Item_Weight) by average
*  Scale the Item_MRP
3. Random Forest Regressor

# 1. Create the required binary features
We will create a custom transformer that will add 3 new binary columns to the existing data.

* Outlet_Type: Grocery Store
* Outlet_Type: Supermarket Type3
* Outlet_Identifier_OUT027

# 2. Data Preprocessing and transformations
We will use a column transformer to do the required transformations. It will contain 3 steps:

* Drop the columns that are not required for model training
* Impute missing values in the column Item_Weight using the average
* Scale the column Item_MRP using StandardScaler()

# 3. Use the model to predict the target on the cleaned data
This will be the final step in the pipeline. In the last two steps we preprocessed the data and made it ready for the model building process. We will use this data and build a machine learning model to predict the Item Outlet Sales.

# Building the pipeline
We will read the data set and seperate the independent and target variable from the training dataset.

In [None]:
#importing required libraries
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import category_encoders as ce 
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

In [None]:
#read training dataset
train = pd.read_csv("../input/big-mart-sales-prediction/Train.csv")

In [None]:
#seperate the independent and target variables
train_x = train.drop(columns=['Item_Outlet_Sales'])
train_y = train['Item_Outlet_Sales']

We need to create 3 new binary columns using a custom transformer. Here are the steps we need to follow to create a custom transformer.

* Define a class OutletTypeEncoder
* Add the parameter BaseEstimator while defining the class
* The class must contain fit and transform methods
* In the transform method, we will define all the 3 columns that we want after the first stage in our ML pipeline.

In [None]:
# import the BaseEstimator
from sklearn.base import BaseEstimator

# define the class OutletTypeEncoder
# This will be our custom transformer that will create 3 new binary columns
# custom transformer must have methods fit and transform

class OutletTypeEncoder(BaseEstimator):

    def __init__(self):
        pass

    def fit(self, documents, y=None):
        return self

    def transform(self, x_dataset):
        x_dataset['outlet_grocery_store'] = (x_dataset['Outlet_Type'] == 'Grocery Store')*1
        x_dataset['outlet_supermarket_3'] = (x_dataset['Outlet_Type'] == 'Supermarket Type3')*1
        x_dataset['outlet_identifier_OUT027'] = (x_dataset['Outlet_Identifier'] == 'OUT027')*1
        
        return x_dataset

Next we will define the pre-processing steps required before the model building process.

* Drop the columns – Item_Identifier, Outlet_Identifier, Item_Fat_Content, Item_Type, Outlet_Identifier, Outlet_Size, Outlet_Location_Type and Outlet_Establishment_Year
* Impute missing values in column Item_Weight with mean
* Scale the column Item_MRP using StandardScaler().
This will be the second step in our machine learning pipeline. After this step, the data will be ready to be used by the model to make predictions.

In [None]:
# Drop the columns - 
# Impute the missing values in column Item_Weight by mean
# Scale the data in the column Item_MRP
pre_process = ColumnTransformer(remainder='passthrough',
                                transformers=[('drop_columns', 'drop', ['Item_Identifier',
                                                                        'Outlet_Identifier',
                                                                        'Item_Fat_Content',
                                                                        'Item_Type',
                                                                        'Outlet_Identifier',
                                                                        'Outlet_Size',
                                                                        'Outlet_Location_Type',
                                                                        'Outlet_Type'
                                                                       ]),
                                              ('impute_item_weight', SimpleImputer(strategy='mean'), ['Item_Weight']),
                                              ('scale_data', StandardScaler(),['Item_MRP'])])

# Predict the target
This will be the final block of the machine learning pipeline. We will specify 3 steps – create binary columns, preprocess the data, train a model.

When we use the fit() function with a pipeline object, all three steps are executed. Post the model training process, we use the predict() function that uses the trained model to generate the predictions.

In [None]:
# Define the Pipeline
"""
Step1: get the oultet binary columns
Step2: pre processing
Step3: Train a Random Forest Model
"""
model_pipeline = Pipeline(steps=[('get_outlet_binary_columns', OutletTypeEncoder()), 
                                 ('pre_processing',pre_process),
                                 ('random_forest', RandomForestRegressor(max_depth=10,random_state=2))
                                 ])
# fit the pipeline with the training data
model_pipeline.fit(train_x,train_y)

# predict target values on the training data
model_pipeline.predict(train_x)

Now, we will read the test data set and we call predict function only on the pipeline object to make predictions on the test data.

In [None]:
# read the test data
test_data = pd.read_csv("../input/big-mart-sales-prediction/Test.csv")

# predict target variables on the test data 
y_sub = model_pipeline.predict(test_data)

In [None]:
y_sub

In [None]:
sub = pd.read_csv("../input/big-mart-sales-prediction/Submission.csv")
sub["Item_Outlet_Sales"] = y_sub
sub.head()

In [None]:
sub.to_csv("submission.csv", index=False)