# **PROJECT SCOPE**
This notebook covers the process of using XGBoost to predict the outlet sales on Big Mart sales prediction data. 

# The dataset
The dataset contains information about the stores, products and historical sales. We will predict the sales of the products in the stores.

# XGBoost
This is a Machine Learning algorithm that deals with structured data, and uses the gradient boosting (GBM) framework at its core.

Boosting is a sequential technique which works on the principle of an ensemble. It combines a set of weak learners and delivers improved prediction accuracy. At any instant t, the model outcomes are weighted based on the outcomes of previous instant t-1. The oucomes predicted correctly are given a lower weight and the ones miss-classified are weighted higher.

Thus the basic idea behind boosting algorithms is building a weak model, making conclusions about the various feature importance and parameters, and then using those conclusions to build a new, stronger model and capitalize on the misclassification error of the previous model and try to reduce it. 

The default base learners of XGBoost are tree ensembles. The tree ensemble model is a set of classification and regression trees (CART). Trees are grown one after another, and attempts to reduce the misclassification rate are made in subsequent iterations. 




In [None]:
#Importing libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Data Exploration and Preprocessing

In [None]:
#loading data
train = pd.read_csv("../input/big-mart-sales-prediction/Train.csv")

In [None]:
train.shape

Our data contains 8523 rows of data with 12 columns.

In [None]:
train.head()

In [None]:
train.info()

In [None]:
#check for missing values
train.isna().sum()

Only Item_Weight and Outlet_Size have missing values.

Item_Weight is a continuous variable. We can use either mean or median to impute the missing values, but here we will use mean.

Outlet_Size is a categorical variable so will use mode to impute the missing values in the column.

In [None]:
#impute missing values in Item_Weight using mean
train.Item_Weight.fillna(train.Item_Weight.mean(), inplace=True)

In [None]:
train.Item_Weight.isna().sum()

In [None]:
#impute missing values in Outlet_Size using mode
train.Outlet_Size.fillna(train.Outlet_Size.mode()[0], inplace=True)

In [None]:
train.Outlet_Size.isna().sum()

Machine learning models cannot work with categorical(string) data. We will convert the categorical variables into numeric types.

In [None]:
#checking categorical variables in the data
train.dtypes

Our data has the following categorical variables

* Item_Identifier
* Item_Fat_Content
* Item_Type
* Outlet_Identifier
* Outlet_Size
* Outlet_Type
* Outlet_Location_Type

We will use target encording to convert these variables variables. 

In [None]:
#target encorders
Item_Fat_Content_mean = train.groupby('Item_Fat_Content')['Item_Outlet_Sales'].mean()
train['Item_Fat_Content'] = train['Item_Fat_Content'].map(Item_Fat_Content_mean)
Item_Type_mean = train.groupby('Item_Type')['Item_Outlet_Sales'].mean()
train['Item_Type'] = train['Item_Type'].map(Item_Type_mean)
Outlet_Identifier_mean = train.groupby('Outlet_Identifier')['Item_Outlet_Sales'].mean()
train['Outlet_Identifier'] = train['Outlet_Identifier'].map(Outlet_Identifier_mean)
Outlet_Size_mean = train.groupby('Outlet_Size')['Item_Outlet_Sales'].mean()
train['Outlet_Size'] = train['Outlet_Size'].map(Outlet_Size_mean)
Outlet_Location_Type_mean = train.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].mean()
train['Outlet_Location_Type'] = train['Outlet_Location_Type'].map(Outlet_Location_Type_mean)
Outlet_Type_mean = train.groupby('Outlet_Type')['Item_Outlet_Sales'].mean()
train['Outlet_Type'] = train['Outlet_Type'].map(Outlet_Type_mean)

In [None]:
train.head()

In [None]:
train.shape

Now that we have taken care of our categorical variables, we move on to the continous variables. We will nnormalize the data in such a way that the range of all variables is almost similar. We will use the StandardScaler function to do this.

In [None]:
from sklearn.preprocessing import StandardScaler
#create an object of the StandardScaler
scaler = StandardScaler()

#fit with the Item_MRP
scaler.fit(np.array(train.Item_MRP).reshape(-1,1))

#transform the data
train.Item_MRP = scaler.transform(np.array(train.Item_MRP).reshape(-1,1))

# The model
We will build the model using Trees as base learners using XGBoost's scikit-learn compatible API.

In [None]:
#importing libraries
import xgboost as xgb
from sklearn.metrics import mean_squared_error

Separate the target variable and rest of the variables using .iloc to subset the data.

In [None]:
X, y = train.iloc[:,:-1],train.iloc[:,-1]

In [None]:
X

In [None]:
#drop Item Identifier column
X = X.drop(columns=['Item_Identifier']) 

In [None]:
X

In [None]:
y

Now we will convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and gives it acclaimed performance and efficiency gains. 

In [None]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

Now, we will create the train and test set for cross-validation of the results using the train_test_split function from sklearn's model_selection module with test_size size equal to 20% of the data. Also, to maintain reproducibility of the results, a random_state is also assigned.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments.

In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

Fit the regressor to the training set and make predictions on the test set using the familiar .fit() and .predict() methods.

In [None]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)

Compute the rmse by invoking the mean_sqaured_error function from sklearn's metrics module.

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

The RMSE for the outlet sales prediction came out to be around 1661.80

# k-fold Cross Validation using XGBoost

In order to build more robust models, we will do a k-fold cross validation where all the entries in the original training dataset are used for both training as well as validation. Also, each entry is used for validation just once. 

We will create a hyper-parameter dictionary params which holds all the hyper-parameters and their values as key-value pairs but will exclude the n_estimators from the hyper-parameter dictionary because we will use num_boost_rounds (denotes the number of tress we build, analogous to n_estimators) instead.

You will use these parameters to build a 3-fold cross validation model by invoking XGBoost's cv() method and store the results in a cv_results DataFrame. We are using the Dmatrix object we created before.

In [None]:
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)

cv_results contains train and test RMSE metrics for each boosting round.

In [None]:
cv_results.head()

Extract and print the final boosting round metric.

In [None]:
print((cv_results["test-rmse-mean"]).tail(1))

We can see that your RMSE for the outlet sales prediction has reduced as compared to last time and came out to be around 1120.77.

# Visualize Boosting Trees and Feature Importance

We will now visualize individual trees from the fully boosted model that XGBoost creates using the entire dataset. XGBoost has a plot_tree() function that makes this type of visualization easy. Once we train a model using the XGBoost learning API, we can pass it to the plot_tree() function along with the number of trees we want to plot using the num_trees argument.

In [None]:
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

We will visualize our XGBoost models by examining the importance of each feature column in the original dataset within the model.

This involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost plot_importance() function allows us to do exactly this.

In [None]:
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [8, 6]
plt.show()

As we can see, the feature Item_Weight has been given the highest importance score among all the features.