# Perdiction of sales

### Problem Statement
The dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the data.

data = pd.read_csv("../DS_auticon_week_3/data_feature_engineering_exercise.csv", delimiter=',')

In [3]:
data.shape

(8523, 44)

In [4]:
data.head(10)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Outlet_Sales,Item_Weight_missing_ind,Outlet_Size_missing_ind,...,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Identifier_Category_DR,Item_Identifier_Category_FD,Item_Identifier_Category_NC
0,9.3,1,0.016047,249.8092,1999,2,3,3735.138,0,0,...,0,0,1,0,1,0,0,0,1,0
1,5.92,2,0.019278,48.2692,2009,2,1,443.4228,0,0,...,0,0,0,0,0,1,0,1,0,0
2,17.5,1,0.01676,141.618,1999,2,3,2097.27,0,0,...,0,0,1,0,1,0,0,0,1,0
3,19.2,2,0.0,182.095,1998,0,1,732.38,0,1,...,0,0,0,1,0,0,0,0,1,0
4,8.93,0,0.0,53.8614,1987,3,1,994.7052,0,0,...,0,0,0,0,1,0,0,0,0,1
5,10.395,2,0.0,51.4008,2009,2,1,556.6088,0,0,...,0,0,0,0,0,1,0,0,1,0
6,13.65,2,0.012741,57.6588,1987,3,1,343.5528,0,0,...,0,0,0,0,1,0,0,0,1,0
7,19.0,1,0.12747,107.7622,1985,2,1,4022.7636,1,0,...,0,0,0,0,0,0,1,0,1,0
8,16.2,2,0.016687,96.9726,2002,0,2,1076.5986,0,1,...,1,0,0,0,1,0,0,0,1,0
9,19.2,2,0.09445,187.8214,2007,0,2,4710.535,0,1,...,0,0,0,0,1,0,0,0,1,0


In [5]:
# Identify the features X and target variable y.

y = data["Item_Outlet_Sales"]
X = data.drop("Item_Outlet_Sales", axis=1)

In [6]:
X.head(10)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Weight_missing_ind,Outlet_Size_missing_ind,Outlet_Operating_Year,...,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Identifier_Category_DR,Item_Identifier_Category_FD,Item_Identifier_Category_NC
0,9.3,1,0.016047,249.8092,1999,2,3,0,0,21,...,0,0,1,0,1,0,0,0,1,0
1,5.92,2,0.019278,48.2692,2009,2,1,0,0,11,...,0,0,0,0,0,1,0,1,0,0
2,17.5,1,0.01676,141.618,1999,2,3,0,0,21,...,0,0,1,0,1,0,0,0,1,0
3,19.2,2,0.0,182.095,1998,0,1,0,1,22,...,0,0,0,1,0,0,0,0,1,0
4,8.93,0,0.0,53.8614,1987,3,1,0,0,33,...,0,0,0,0,1,0,0,0,0,1
5,10.395,2,0.0,51.4008,2009,2,1,0,0,11,...,0,0,0,0,0,1,0,0,1,0
6,13.65,2,0.012741,57.6588,1987,3,1,0,0,33,...,0,0,0,0,1,0,0,0,1,0
7,19.0,1,0.12747,107.7622,1985,2,1,1,0,35,...,0,0,0,0,0,0,1,0,1,0
8,16.2,2,0.016687,96.9726,2002,0,2,0,1,18,...,1,0,0,0,1,0,0,0,1,0
9,19.2,2,0.09445,187.8214,2007,0,2,0,1,13,...,0,0,0,0,1,0,0,0,1,0


In [7]:
y.head(10)

0    3735.1380
1     443.4228
2    2097.2700
3     732.3800
4     994.7052
5     556.6088
6     343.5528
7    4022.7636
8    1076.5986
9    4710.5350
Name: Item_Outlet_Sales, dtype: float64

Split the data in 80% train set and 20% test set.

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
                                                    test_size=0.2)

In [10]:
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (6818, 43)
Shape of X_test: (1705, 43)
Shape of y_train: (6818,)
Shape of y_test: (1705,)


First, we will make a baseline model.

In [11]:
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [12]:
base = DummyRegressor(strategy = 'mean')

In [13]:
base.fit(X_train, y_train)

DummyRegressor()

In [14]:
y_base = base.predict(X_test)

In [15]:
y_base

array([2182.13049698, 2182.13049698, 2182.13049698, ..., 2182.13049698,
       2182.13049698, 2182.13049698])

In [16]:
MSE_b = mean_squared_error(y_test, y_base)
print(f"MSE_b : {MSE_b}")

MSE_b : 2936287.9236949654


In [17]:
r2_b = r2_score(y_test, y_base)
print(f"r2_b : {r2_b}")

r2_b : -6.027462453372934e-06


Now, we will recreate the Linear Regression model from Tuesday.

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

In [19]:
model = LinearRegression()

In [20]:
param_grid = {'normalize': [False, True]}

In [21]:
grid_search = GridSearchCV(model, param_grid, cv=10, n_jobs=-1)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=LinearRegression(), n_jobs=-1,
             param_grid={'normalize': [False, True]})

In [22]:
grid_search.best_params_

{'normalize': False}

In [23]:
grid_search.best_score_

0.5590678052936487

In [24]:
y_pred = grid_search.predict(X_test)

In [25]:
MSE = mean_squared_error(y_test, y_pred)
print(f"MSE : {MSE}")

MSE : 1291684.6549837454


In [26]:
r2 = r2_score(y_test, y_pred)
print(f"r2 : {r2}")

r2 : 0.5600933988315882


For my own interest, I will also create a Decision Tree model with Randomized Search for comparison.

In [27]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [28]:
tree = DecisionTreeRegressor()

In [29]:
param_grid = {'max_depth': randint(1, 8)}

In [30]:
rand_search = RandomizedSearchCV(tree, param_grid, cv=10, n_jobs=-1)
rand_search.fit(X_train, y_train)

RandomizedSearchCV(cv=10, estimator=DecisionTreeRegressor(), n_jobs=-1,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a8ccc4880>})

In [31]:
rand_search.best_params_

{'max_depth': 5}

In [32]:
# Note that the default score for the Sklearn Decision Tree 
# model is the R-squared value.

rand_search.best_score_

0.584209333496771

In [33]:
y_tree = rand_search.predict(X_test)

In [34]:
MSE_tree = mean_squared_error(y_test, y_tree)
print(f"MSE_tree : {MSE_tree}")

MSE_tree : 1170882.1635574733


In [35]:
r2_tree = r2_score(y_test, y_tree)
print(f"r2_tree : {r2_tree}")

r2_tree : 0.6012348749735932


In [36]:
rand_search.cv_results_

{'mean_fit_time': array([0.02268949, 0.01092403, 0.00832596, 0.01862173, 0.02087927,
        0.01007688, 0.01348097, 0.02246566, 0.00817056, 0.01904771]),
 'std_fit_time': array([0.00113894, 0.00093334, 0.00189362, 0.00147438, 0.00142158,
        0.00122568, 0.00163738, 0.00159237, 0.00205464, 0.00158407]),
 'mean_score_time': array([0.00278664, 0.00201802, 0.00198467, 0.00197303, 0.00206256,
        0.00195262, 0.00193529, 0.00210321, 0.00211966, 0.00272202]),
 'std_score_time': array([1.52357202e-03, 1.10590442e-04, 1.15474713e-04, 1.12304325e-04,
        1.28027003e-04, 5.43824835e-05, 6.51743951e-05, 1.88597121e-04,
        4.36298707e-04, 6.39638880e-04]),
 'param_max_depth': masked_array(data=[6, 2, 1, 5, 6, 2, 3, 6, 1, 4],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 6},
  {'max_depth': 2},
  {'max_depth': 1},
  {'max_depth': 5},
  {'max

We have covered data preparation and feature engineering last week. Now, it's time to do some predictive models.

## Model Building

### Ensemble Models

Try different  ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)

Calculate the mean squared error on the test set. Explore how different parameters of the model affect the results and the performance of the model

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Linear Regression model from Tuesday.

We may use Randomized Search instead of Grid Search.

# Case 1: Random Forest Model

In [37]:
from sklearn.ensemble import RandomForestRegressor

In [38]:
model_1 = RandomForestRegressor()

In [39]:
param_grid_1 = {'max_depth': randint(5, 7),
                'min_samples_split': randint(2, 11),
                'n_estimators': randint(100, 150)}

In [40]:
rand_search_1 = RandomizedSearchCV(model_1, param_grid_1, cv=10, n_jobs=-1)
rand_search_1.fit(X_train, y_train)

RandomizedSearchCV(cv=10, estimator=RandomForestRegressor(), n_jobs=-1,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a899d35e0>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a899d3910>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a899d3850>})

In [41]:
rand_search_1.best_params_

{'max_depth': 5, 'min_samples_split': 7, 'n_estimators': 137}

In [42]:
# Note that the default score for the Sklearn Random Forest 
# model is the R-squared value.

rand_search_1.best_score_

0.5950101966395361

In [43]:
y_pred_1 = rand_search_1.predict(X_test)

In [44]:
MSE_1 = mean_squared_error(y_test, y_pred_1)
print(f"MSE_1 : {MSE_1}")

MSE_1 : 1162023.7454943722


In [45]:
r2_1 = r2_score(y_test, y_pred_1)
print(f"r2_1 : {r2_1}")

r2_1 : 0.6042517696675358


I will ask for mentor help about putting bounds on hyperparameters.

On January 27, 2021, I received mentor help from Socorro at 1 PM.  My questions 
and answers are in my copybook.

In [46]:
rand_search_1.cv_results_

{'mean_fit_time': array([1.62313588, 1.01547337, 1.24892569, 1.36835306, 1.40013597,
        1.05551167, 1.35125673, 1.19309118, 1.36727667, 1.40882213]),
 'std_fit_time': array([0.02487504, 0.0063711 , 0.0527874 , 0.07553921, 0.00958761,
        0.00802106, 0.01444976, 0.01353514, 0.07400479, 0.02434636]),
 'mean_score_time': array([0.01442261, 0.01017711, 0.01251714, 0.01275613, 0.01275415,
        0.01051559, 0.01274304, 0.01157579, 0.01232021, 0.0121676 ]),
 'std_score_time': array([0.000364  , 0.00016011, 0.00148096, 0.00084708, 0.00022324,
        0.00015073, 0.00026388, 0.00047937, 0.00059586, 0.00052956]),
 'param_max_depth': masked_array(data=[6, 5, 6, 5, 6, 5, 5, 5, 6, 5],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_min_samples_split': masked_array(data=[4, 3, 2, 7, 5, 3, 6, 4, 5, 9],
              mask=[False, False, False, False, False, False, False

It seems that the 'max_depth' hyperparameter is best at around a value of 5.  
The optimal values for the other hyperparameters seems random.

# Case 2: Gradient Boosting Model

In [47]:
from sklearn.ensemble import GradientBoostingRegressor
from scipy.stats import uniform

In [48]:
model_2 = GradientBoostingRegressor()

In [49]:
param_grid_2 = {'max_depth': randint(2, 4),
                'min_samples_split': randint(2, 11),
                'n_estimators': randint(100, 150),
                'learning_rate': uniform(0.01, 0.1),
                'subsample': uniform(0, 1)}
                #'min_samples_leaf': randint(2, 11)}

In [50]:
rand_search_2 = RandomizedSearchCV(model_2, param_grid_2, cv=10, n_jobs=-1)
rand_search_2.fit(X_train, y_train)

RandomizedSearchCV(cv=10, estimator=GradientBoostingRegressor(), n_jobs=-1,
                   param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a899e6e50>,
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a8cf13550>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a899e61f0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a89a8ef40>,
                                        'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a899e6b80>})

In [51]:
rand_search_2.best_params_

{'learning_rate': 0.04634554097276373,
 'max_depth': 2,
 'min_samples_split': 4,
 'n_estimators': 140,
 'subsample': 0.9090431586326877}

In [52]:
# Note that the default score for the Sklearn GradientBoostingRegressor 
# model is the R-squared value.

rand_search_2.best_score_

0.5972017776157801

In [53]:
y_pred_2 = rand_search_2.predict(X_test)

In [54]:
MSE_2 = mean_squared_error(y_test, y_pred_2)
print(f"MSE_2 : {MSE_2}")

MSE_2 : 1156605.057788428


In [55]:
r2_2 = r2_score(y_test, y_pred_2)
print(f"r2_2 : {r2_2}")

r2_2 : 0.6060972018961512


In [56]:
rand_search_2.cv_results_

{'mean_fit_time': array([0.31973617, 0.68302219, 0.8185509 , 0.50844073, 0.54572222,
        0.31283934, 0.11216178, 0.25887923, 0.57497835, 0.88707361]),
 'std_fit_time': array([0.00969105, 0.00945327, 0.01963141, 0.01541329, 0.00977021,
        0.02718302, 0.00272342, 0.01269476, 0.03203303, 0.02781488]),
 'mean_score_time': array([0.00304732, 0.00387919, 0.00356357, 0.00292382, 0.0031533 ,
        0.0035825 , 0.0044229 , 0.00327804, 0.00346222, 0.00327628]),
 'std_score_time': array([0.00041156, 0.00090494, 0.00042412, 0.00015587, 0.00018084,
        0.00064437, 0.00146648, 0.00036558, 0.00067343, 0.00067103]),
 'param_learning_rate': masked_array(data=[0.08477888132427576, 0.02407443992262292,
                    0.08019536581033582, 0.04838739032826756,
                    0.06644192852557367, 0.017208636423399374,
                    0.0683922121809434, 0.09156803834806808,
                    0.10914837462045224, 0.04634554097276373],
              mask=[False, False, False, Fal

Note that I tried Grid Search with cv equal to 2 but it ran longer than 
Randomized Search and the results from Randomized Search were better.

It seems that the 'max_depth' hyperparameter is best at around a value of 2.  
Adding in the 'subsample' hyperparameter to Randomized Search seems to help.
Adding in the 'min_samples_leaf' hyperparameter to Randomized Search seemed to 
worsen the results so it was removed.
The optimal values for the other hyperparameters seems random.

# Case 3: XGBBoost Model

In [57]:
from xgboost import XGBRegressor
from scipy.stats import expon

In [58]:
model_3 = XGBRegressor()

In [59]:
param_grid_3 = {'max_depth': randint(2, 4), 
                'n_estimators': randint(100, 150),
                'learning_rate': uniform(0, 0.1),
                'colsample_bytree': uniform(0.5, 0.5)}
                #'lambda': uniform(0, 1)}
                #'gamma': expon()}
                #'subsample': uniform(0, 1)}


In [60]:
rand_search_3 = RandomizedSearchCV(model_3, param_grid_3, cv=10, n_jobs=-1)
rand_search_3.fit(X_train, y_train)

RandomizedSearchCV(cv=10,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_estimators=100,...
                   param_distributions={'colsample_bytree': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8a8d483670>,
                                        'learning_rate': <s

In [61]:
rand_search_3.best_params_

{'colsample_bytree': 0.8912210916706643,
 'learning_rate': 0.05679670474973517,
 'max_depth': 2,
 'n_estimators': 131}

In [62]:
# Note that the default score for the XGBoost model is the R-squared value.

rand_search_3.best_score_

0.597266843598183

In [63]:
y_pred_3 = rand_search_3.predict(X_test)

In [64]:
MSE_3 = mean_squared_error(y_test, y_pred_3)
print(f"MSE_3 : {MSE_3}")

MSE_3 : 1160073.5248812167


In [65]:
r2_3 = r2_score(y_test, y_pred_3)
print(f"r2_3 : {r2_3}")

r2_3 : 0.6049159526150932


In [66]:
rand_search_3.cv_results_

{'mean_fit_time': array([0.62325201, 0.64959438, 0.76644573, 0.92232895, 0.84989266,
        0.72423356, 0.60058713, 0.61549513, 0.61118762, 0.73147058]),
 'std_fit_time': array([0.03801657, 0.03247576, 0.04620511, 0.03200696, 0.05120935,
        0.0224101 , 0.03172412, 0.02597211, 0.02311299, 0.09291206]),
 'mean_score_time': array([0.0043303 , 0.00441606, 0.00430689, 0.0041934 , 0.00466993,
        0.00465193, 0.00396867, 0.00448618, 0.00360906, 0.00364094]),
 'std_score_time': array([0.00061387, 0.0005286 , 0.0005401 , 0.0007186 , 0.00077485,
        0.00077197, 0.00061395, 0.00055283, 0.0005065 , 0.00071052]),
 'param_colsample_bytree': masked_array(data=[0.7035202163422261, 0.8521224661841749,
                    0.8457319731576225, 0.7513272083944118,
                    0.8912210916706643, 0.7534496194855722,
                    0.5790655148269472, 0.8025272258583427,
                    0.5901643021317555, 0.5338622254034191],
              mask=[False, False, False, False, Fal

It seems that the 'max_depth' hyperparameter is best at around a value of 2 or 3.  
Adding in the 'subsample' hyperparameter to Randomized Search seemed to worsen 
the results so it was removed.  
Adding in the 'colsample_bytree' hyperparameter sampled from a continuous 
uniform distribution on [0.5, 1] to Randomized Search seemed to improve the 
results.  The optimal values for the other hyperparameters seems random.  
Adding in the 'gamma' hyperparameter to Randomized Search seemed to worsen 
the results so it was removed.  
Adding in the 'lambda' hyperparameter to Randomized Search seemed to worsen 
the results so it was removed.  

In [67]:
print(f"MSE_b : {MSE_b}")
print(f"MSE : {MSE}")
print(f"MSE_tree : {MSE_tree}")
print(f"MSE_1 : {MSE_1}")
print(f"MSE_2 : {MSE_2}")
print(f"MSE_3 : {MSE_3}")

MSE_b : 2936287.9236949654
MSE : 1291684.6549837454
MSE_tree : 1170882.1635574733
MSE_1 : 1162023.7454943722
MSE_2 : 1156605.057788428
MSE_3 : 1160073.5248812167


It seems that all 3 ensemble models perform better than the Linear Model.  
The 2 Boosting models (Cases 2 and 3) seem to work best.  
The Decision Tree model seems to work as well as the ensemble models.

# Update

The Random Forest Model was improved by reducing the range of the 'max_depth' 
hyperpararmeter in Randomized Search to be just {5, 6}.