 # Predicting Black Friday Sale Values

From Kaggle:<br>

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.<br>
The data set also contains customer demographics (age, gender, marital status, citytype, stayincurrentcity), product details (productid and product category) and Total purchaseamount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products

## Import Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import os
warnings.filterwarnings('ignore')
%matplotlib inline

## Load Data

In [None]:
data_train=pd.read_csv('../input/black-friday-sales-prediction/train_oSwQCTC (1)/train.csv')
data_test=pd.read_csv('../input/black-friday-sales-prediction/test_HujdGe7 (1)/test.csv')

Due to memory limits of Kaggle notebooks I am forced to subset the data

In [None]:
data_train=data_train[0:10000]
data_test=data_test[0:10000]

# Brief look at the data<br>
What columns do we have? 

In [None]:
print(data_train.columns.values)

What types are they?

In [None]:
print(data_train.dtypes)

View the data 

In [None]:
data_train.head(5)

Are there NAs in cols?

In [None]:
data_train.isna().sum()

Only NAs in product category, but since products can have multiple categories this is okay.

Is the target variable normally distributed?

In [None]:
import pylab 
import scipy.stats as stats

In [None]:
stats.probplot(data_train['Purchase'], dist="norm", plot=pylab)
pylab.show()

Histogram

In [None]:
plt.hist(data_train['Purchase'])
plt.show()

Not very normal - let's log transform. 

In [None]:
data_train['Purchase_log']=np.log(data_train['Purchase'])
stats.probplot(data_train['Purchase_log'], dist="norm", plot=pylab)
pylab.show()

This did not improve the distribution. A boxcox transformation would be more appropriate here...

In [None]:
data_train['Purchase'], lmbda = stats.boxcox(data_train['Purchase'])
stats.probplot(data_train['Purchase_log'], dist="norm", plot=pylab)
pylab.show()
plt.hist(data_train['Purchase'])
plt.show()

The data now appears reasonably normal. We will need to remember to back-transform the outcome using inv boxcox later.

 Formatting categories

In [None]:
category_cols=['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years',
               'Marital_Status', 'Product_Category_1', 'Product_Category_2', 'Product_Category_3']

In [None]:
for col in category_cols:
    data_train[col] = pd.Categorical(data_train[col])
    
for col in category_cols:
    data_test[col] = pd.Categorical(data_test[col])

How many unique values are in each cat? If too many, we may not be able to grab dummies.

In [None]:
cat_uniques = pd.DataFrame([[i, len(data_train[i].unique())] for i in data_train[category_cols].columns], columns=['Variable', 'Unique Values']).set_index('Variable')
print(cat_uniques)
    
### Visualisation
n=len(category_cols)
fig,ax = plt.subplots(n,1, figsize=(6,n*2), sharex=True)
for i in range(n):
    plt.sca(ax[i])
    col = category_cols[i]
    sns.barplot(x=col, y='Purchase', data=data_train)
    
## Encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [None]:
for col in category_cols:
    data_train[col] = le.fit_transform(data_train[col])
    
for col in category_cols:
    data_test[col] = le.fit_transform(data_test[col])
    

# Split data

In [None]:
X=data_train.drop('Purchase', axis=1).drop(['Purchase_log', 'User_ID', 'Product_ID'], axis=1)
y=data_train['Purchase']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline Regression Algo Testing<br>
Import regressors

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso, Ridge

RMSE is required as the primary performance metric. We'll also define some others.

In [None]:
from sklearn.metrics import r2_score

In [None]:
def rmse(y_true, y_preds):
    return np.sqrt(((y_preds - y_true) ** 2).mean())

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
reg_algos = [
    RandomForestRegressor(),
    SVR(kernel='rbf'),#
    DecisionTreeRegressor(),
    GradientBoostingRegressor(),
    MLPRegressor(),
    Ridge(),
    Lasso()]

In [None]:
for algo in reg_algos:
    algo.fit(X_train, y_train)
    name = algo.__class__.__name__
    
    print("_"*30)
    print(name)
    
    print('****Results****')
    train_predictions = algo.predict(X_test)
    
    # calculate score
    RMSE=rmse(y_test, train_predictions)
    r2=r2_score(y_test, train_predictions)
    MAPE=mean_absolute_percentage_error(y_test, train_predictions)
    
    print("RMSE: {:.4}".format(RMSE))
    print("R^2: {:.4}".format(r2))
    print("MAPE: {:.4}".format(MAPE))
    
print("_"*30)

The top performing algos are GB and RF. This is most likely because there are many interactions between variables which I did not feature engineer.<br>
Lets train and stack these.

# Tuning Random Forest

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
n_estimators=[100, 500, 1000, 1500, 2000] # Define params
max_features=['auto', 'sqrt']
max_depth=[1, 10,  50, 100]
max_depth.append(None)
min_samples_split=[2,5,10,20,50]
min_samples_leaf=[1,5,10]

In [None]:
grid_params={'n_estimators':n_estimators,
             'max_features':max_features,
             'max_depth':max_depth,
             'min_samples_split':min_samples_split,
             'min_samples_leaf':min_samples_leaf}

In [None]:
rf=RandomForestRegressor(random_state=40) # Initiate base model

Fit RF CV Search

In [None]:
rf_rand = RandomizedSearchCV(estimator=rf, 
                             param_distributions=grid_params, 
                             scoring='neg_root_mean_squared_error',
                             n_iter=500,
                             cv=3,
                             random_state=40,
                             verbose = 2,n_jobs=-1) 

In [None]:
rf_rand.fit(X_train, y_train)
print("Best parameter (CV score=:",  rf_rand.best_score_*-1)
print("Best RF params:")
print (rf_rand.best_params_)

Make final rf object

In [None]:
tuned_rf=RandomForestRegressor(**rf_rand.best_params_)

# Tuning Gradient Boost

In [None]:
learning_rate = [1, 0.5, 0.1, 0.05, 0.01, 0.001]
n_estimators = [100, 500, 1000, 1500, 2000]
max_depths = [1, 10,  50, 100]
min_samples_splits = [2,5,10,20,50]
min_samples_leafs = [1,5,10]

In [None]:
grid_params={'n_estimators':n_estimators,
             'learning_rate':learning_rate,
             'max_depth':max_depth,
             'min_samples_split':min_samples_split,
             'min_samples_leaf':min_samples_leaf}

In [None]:
gb=GradientBoostingRegressor(random_state=40)
gb_rand = RandomizedSearchCV(estimator=gb, 
                             param_distributions=grid_params, 
                             scoring='neg_root_mean_squared_error',
                             n_iter=500,
                             cv=3,
                             random_state=40,
                             verbose = 2,n_jobs=-1) 

In [None]:
gb_rand.fit(X_train, y_train)
print("Best parameter (CV score=:",  gb_rand.best_score_*-1)
print("Best GB params:")
print (gb_rand.best_params_)

Make final gb object

In [None]:
tuned_gb=GradientBoostingRegressor(**gb_rand.best_params_)

# Voting Regressor

In [None]:
from sklearn.ensemble import VotingRegressor

In [None]:
ensemble_reg = VotingRegressor(estimators=[('tuned_rf',tuned_rf), 
               ('tuned_gb',tuned_gb)])

Now lets test how each of these perform on the hold out set

In [None]:
final_algos = [
    tuned_rf,
    tuned_gb,
    ensemble_reg]

In [None]:
performance=[]
for algo in final_algos:
    
    algo.fit(X_train, y_train)
    name = algo.__class__.__name__
    
    print("_"*30)
    print(name)
    
    print('****Results****')
    train_predictions = algo.predict(X_test)
    
    # calculate score
    RMSE=rmse(y_test, train_predictions)
    r2=r2_score(y_test, train_predictions)
    MAPE=mean_absolute_percentage_error(y_test, train_predictions)
    
    print("RMSE: {:.4}".format(RMSE))
    print("R^2: {:.4}".format(r2))
    print("MAPE: {:.4}".format(MAPE))
    
    cols=["Algo", "RMSE"]
    performance_df = pd.DataFrame([[name, RMSE]], columns=cols)
    performance.append(performance_df)
    
print("_"*30)

In [None]:
performance=pd.concat(performance, axis=0).sort_values(by='RMSE')
best_algo=performance['Algo'].values[0] # get best algo

In [None]:
print('The best performing algorithm is: ' , best_algo)

In [None]:
if best_algo=='VotingRegressor':
    print('Ensembling via voting improved performance')
else:
    print('Ensembling via voting did not improve performance')
    
# Assign best algo for final predictions
if best_algo=='VotingRegressor':
    final_algo=ensemble_reg
if best_algo=='GradientBoostingRegressor':
    final_algo=tuned_gb
else:
    final_algo=tuned_rf
    
### Final Predictions
# Lastly, I will predict on the test data
user_ids=data_test['User_ID']
product_ids=data_test['Product_ID']
data_test=data_test.drop(['User_ID', 'Product_ID'], axis=1)
final_preds = final_algo.predict(data_test)

Inverse BoxCox

In [None]:
from scipy.special import inv_boxcox
final_preds=inv_boxcox(final_preds, lmbda)

Format for submission

In [None]:
final_df=pd.DataFrame({'Purchase':final_preds,
                      'User_ID':user_ids,
                      'Product_ID':product_ids})

In [None]:
final_df.to_csv('submission.csv')