# Introduction

A retail company wants to comprehend and interpret the purchase behavior of their customers, with respect to various products belonging to different categories, for the Black Friday sale. The data available for the company is the purchase summary of customers which includes customer demographics, product information, the total purchase amount, for chosen high volume products sold during the Black Friday period of last year.

The objective is to develop a model to predict the purchasing capacity with respect to various products that would aid them in creating personalized offers for customers against different products along with understanding which areas make more sales during Black Friday

# Importing and installing Libraries

In [None]:
pip install pywaffle

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.metrics import make_scorer
from pywaffle import Waffle

# Loading and Understanding Data

In [None]:
train_data = pd.read_csv("../input/black-friday/train.csv")
test = pd.read_csv("../input/black-friday/test.csv")

Checking the variable names and the first 5 rows of the data frame

In [None]:
train_data.head()

Checking the information and data types of the variable of the data

In [None]:
train_data.info()

Understanding the summary statistics of all variables in the data set
1. Most of the variables like occupation, Product categories in the dataset are masked with integers and City_Category is masked with alphabets
2. Product P00265242 is the most popular product! with 1880 occurences.
3. Male buyers are more frequent in the dataset than female buyers.
4. Age group with most transactions was 26-35.
5. Occupation '4' had the most transactions.
6. City Category with most transactions was B.
7. Highest number of purchasers had '1 year stay' in the current city.
8. Data set has more singles (Marital status 0) than married people (Marital status 1).


In [None]:
train_data.describe(include = 'all')

Checking the unique values in each of the variables. User ID and product ID have a lot unique categories and hence can't be used directly in the model. We need to explore what features can be extracted from these two columns.


In [None]:
train_data.nunique()

Checking the number of null values for each of the variables. Product category 2 and 3 have high proportion of nulls and we need to explore what can be done with these null values


In [None]:
train_data.isna().sum()

# Data Visualization and Gathering Insights

Exploring the frequency distributions of all the columns. There are noticeable variations in the variables and this can be helpful in explaining the variation of the purchase amounts in the data set.

In [None]:
train_data.hist(figsize=(20,10), color = 'teal')

Average purchase value of single men is slightly higher than married people, while the trend is opposite in case of women !

In [None]:
gender = train_data['Gender'].value_counts()

fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=10,
    values=gender,
    title={'label': 'Gender Distribution', 'loc': 'center','size':20},
    labels=["{}({})".format(a, b) for a, b in zip(gender.index, gender) ],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1,1)},
    font_size=35, 
    icons = ['male','female'],
    icon_legend=True,
    figsize=(12, 8)
)
Marital_Status = train_data['Marital_Status'].value_counts()
fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=10,
    values=Marital_Status,
    title={'label': 'Marital Status Distribution', 'loc': 'center','size':20},
    labels=["{}({})".format(a, b) for a, b in zip(Marital_Status.index, Marital_Status) ],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1,1)},
    font_size=35,
    icons = 'ring',
    icon_legend=True,
    figsize=(12, 8)
)

The variation in Gender can help us explain the variations in the purchase amount. However, variation in marital status is too little to explain any price variation in the purchase amounts. Average purchase value of single men is slightly higher than married people, while the trend is opposite in case of women !

In [None]:
fig,ax = plt.subplots(figsize=(20,6),ncols=2,nrows=1)
sns.barplot(x="Gender",y="Purchase",hue="Marital_Status",estimator=np.mean,data=train_data,ax=ax[0] , palette="Set2").set_title(label = 'Gender vs Marital Status Average Purchase Distribution', size =15)
sns.barplot(x="Gender",y="Purchase",hue="Marital_Status",estimator=np.sum,data=train_data,ax=ax[1] , palette="Set2").set_title(label = 'Gender vs Marital Status Purchase Distribution', size =15)

In [None]:
City = train_data['City_Category'].value_counts()
fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=10,
    values=City,
    title={'label': 'City Category Distribution', 'loc': 'center','size':20},
    labels=["{}({})".format(a, b) for a, b in zip(City.index, City) ],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1,1)},
    font_size=35,
    icons = 'star',
    icon_legend=True,
    figsize=(12, 8)
)

Stay = train_data['Stay_In_Current_City_Years'].value_counts()
fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=10,
    values=Stay,
    title={'label': 'Stay in Current City Distribution', 'loc': 'center','size':20},
    labels=["{}({})".format(a, b) for a, b in zip(Stay.index, Stay) ],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1,1)},
    font_size=35,
    icons = 'circle', 
    icon_legend=True,
    figsize=(12, 8)
)

Customers from city C buy products of higher purchase value, while stay is current city shows some variation!

In [None]:
fig,ax = plt.subplots(figsize=(20,8),ncols=1,nrows=1)
sns.barplot(x="City_Category",y="Purchase",hue="Stay_In_Current_City_Years",estimator=np.mean,data=train_data, palette="Set2").set_title(label = 'City Category vs Stay in Current City Purchases Distribution', size =20)

Maximum number of purchasers are in the 26-35 age group, while the average purchase vs Age shows positive trend!

In [None]:

Age = train_data['Age'].value_counts()
fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=10,
    values=Age,
    title={'label': 'Age Distribution', 'loc': 'center','size':20},
    labels=["{}({})".format(a, b) for a, b in zip(Age.index, Age) ],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1,1)},
    font_size=35,
    icon_legend=True,
    figsize=(12, 8)
)

In [None]:
fig,ax = plt.subplots(figsize=(12,8),ncols=1,nrows=1)
sns.barplot(x="Age",y="Purchase",estimator=np.mean,data=train_data, palette="Set2", order=["0-17", "18-25","26-35","36-45","46-50","51-55","55+"]).set_title(label = 'Age bucket vs Average Purchase Distribution', size =20)

52% of sales are shared among 5 occupations while occupation wise average user spend shows some variation

In [None]:
Occupation_percent =train_data.groupby('Occupation').Purchase.agg(['sum'])
Occupation_percent=Occupation_percent.apply(lambda x: 100 * x / float(x.sum())).reset_index()
Occupation_percent=Occupation_percent.sort_values(by = ['sum'],ascending=False)
explode = (0,0,0,0,0,0,0,0,0,0,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.5,0.5,0.5,0.5)
plt.figure(figsize=(20,12))
plt.pie(Occupation_percent['sum'],labels=Occupation_percent['Occupation'], explode= explode,autopct='%1.0f%%', counterclock=True,startangle=45, colors = sns.color_palette('Set2'))
plt.title(label= 'Average Purchase Amount By Occupation', loc = 'center', size = 20)
plt.show

In [None]:
fig,ax = plt.subplots(figsize=(12,8),ncols=1,nrows=1)
sns.barplot(x="Occupation",y="Purchase",estimator=np.mean,data=train_data, palette="Set2").set_title(label = 'Occupation vs Average Purchase Distribution', size =20)

Number 10 in product category 1 has the maximum mean purchase value followed by Number 7. Number 5 is bought most frequently

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.barplot(x="Product_Category_1",y="Purchase",estimator=np.mean,data=train_data,ax=ax[0], palette="Set2").set_title(label = 'Product Category 1 vs Average Purchase Distribution', size =20)
sns.countplot(x="Product_Category_1",data=train_data,ax=ax[1], palette="Set2").set_title(label = 'Total Count Distribution In Product Category 1', size =20)

Number 10 in product category 2 & number 3 in product category 2 has premium products with high average purchase value

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.barplot(x="Product_Category_2",y="Purchase",estimator=np.mean,data=train_data,ax=ax[0], palette="Set2").set_title(label = 'Product Category 2 vs Average Purchase Distribution', size =20)
sns.countplot(x="Product_Category_2",data=train_data,ax=ax[1], palette="Set2").set_title(label = 'Total Count Distribution In Product Category 2', size =20)

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.barplot(x="Product_Category_3",y="Purchase",estimator=np.mean,data=train_data,ax=ax[0], palette="Set2").set_title(label = 'Product Category 3 vs Average Purchase Distribution', size =20)
sns.countplot(x="Product_Category_3",data=train_data,ax=ax[1], palette="Set2").set_title(label = 'Total Count Distribution In Product Category 3', size =20)

50% of the times customer tends to buy a product with a purchase value between (5500,13000) during Black Friday

In [None]:
def triple_plot(x, title):
    fig, ax = plt.subplots(1,3,figsize=(15,4),sharex=True)
    sns.distplot(x, ax=ax[0], color = 'teal')
    ax[0].set(xlabel=None)
    ax[0].set_title('Histogram + KDE')
    sns.boxplot(x, ax=ax[1], color = 'lightslategrey')
    ax[1].set(xlabel=None)
    ax[1].set_title('Boxplot')
    sns.violinplot(x, ax=ax[2], color = 'springgreen')
    ax[2].set(xlabel=None)
    ax[2].set_title('Violin plot')
    fig.suptitle(title, fontsize=16)
    plt.tight_layout(pad=3.0)
    plt.show()
triple_plot(train_data['Purchase'],'Distribution of Purchase')

# Data Processing

Creating initial file for Submission

In [None]:
submission = test[['User_ID','Product_ID']]

Splitting the input train data into Train and Development data, to train the data on train and test it on development data. And then use that model for predictions on the test. Started with 80-20 split and after fitting the model, used 99% of the data for final predictions.

In [None]:
train = train_data.sample(frac=0.999,random_state=0) #random state is a seed value
dev = train_data.drop(train.index)

Converting Age and Stay in Current City Buckets to numbers so as to cast the data from categorical type to numeric

In [None]:
train['Age'] = train['Age'].replace(['0-17','18-25','26-35','36-45','46-50','51-55','55+'],[9,22,31,41,48,53,60])
train['Stay_In_Current_City_Years'] = train['Stay_In_Current_City_Years'].replace(['4+'],[5])

dev['Age'] = dev['Age'].replace(['0-17','18-25','26-35','36-45','46-50','51-55','55+'],[9,22,31,41,48,53,60])
dev['Stay_In_Current_City_Years'] = dev['Stay_In_Current_City_Years'].replace(['4+'],[5])

test['Age'] = test['Age'].replace(['0-17','18-25','26-35','36-45','46-50','51-55','55+'],[9,22,31,41,48,53,60])
test['Stay_In_Current_City_Years'] = test['Stay_In_Current_City_Years'].replace(['4+'],[5])

Converting Categories to Numerical type because LightGBM expects the categorical data to be encoded as numbers

In [None]:
train['Gender'] = train['Gender'].replace(['M','F'],[0,1])
train['City_Category'] = train['City_Category'].replace(['A','B','C'],[1,2,3])

dev['Gender'] = dev['Gender'].replace(['M','F'],[0,1])
dev['City_Category'] = dev['City_Category'].replace(['A','B','C'],[1,2,3])

test['Gender'] = test['Gender'].replace(['M','F'],[0,1])
test['City_Category'] = test['City_Category'].replace(['A','B','C'],[1,2,3])

There are a lot of missing values in Product Category to 2,3. Instead of imputing all of them, creating a new category '0'. Tried using the mode, but using it as new category gave the best score

In [None]:
train = train.fillna(0)
dev = dev.fillna(0)
test = test.fillna(0)

As we have to drop User ID and Product ID columns, created new features with approximately resembles them. Average Cost of Products for Product ID and Buying Power (total amount spent) for User ID. Test data doesn't have the price data to calculate this, so using the data from train to impute it in Test and Dev. If there are any new users in Test or Dev, we are imputing it with Global averages.

In [None]:
train['Average_Cost'] = train.groupby(['Product_ID'])['Purchase'].transform('mean')
train['Buying_Power'] =  train.groupby(['User_ID'])['Purchase'].transform('mean')

product_price = train[['Average_Cost','Product_ID']].drop_duplicates()
average_cost = product_price['Average_Cost'].mean()
print("Average Cost of Products is ", average_cost)

user_buying_power =  train[['Buying_Power','User_ID']].drop_duplicates()
buying_power = user_buying_power['Buying_Power'].mean()
print("Average Buying Power of users is ", buying_power)

In [None]:
print("Dev dimensions before adding features ", dev.shape)
print("Test dimensions before adding features ", test.shape)

dev = dev.merge(product_price, how = 'left', left_on = 'Product_ID',right_on = 'Product_ID')
dev = dev.merge(user_buying_power, how = 'left', left_on = 'User_ID',right_on = 'User_ID')

test = test.merge(product_price, how = 'left', left_on = 'Product_ID',right_on = 'Product_ID')
test = test.merge(user_buying_power, how = 'left', left_on = 'User_ID',right_on = 'User_ID')

print("Dev dimensions after adding features ", dev.shape)
print("Test dimensions after adding features ", test.shape)

In [None]:
print("Nulls in Dev before imputing features ", dev.isna().sum())
print("Nulls in Test before imputing features ", test.isna().sum())

dev['Average_Cost']  = dev['Average_Cost'].fillna(average_cost)
dev['Buying_Power']  = dev['Buying_Power'].fillna(buying_power)

test['Average_Cost']  = test['Average_Cost'].fillna(average_cost)
test['Buying_Power']  = test['Buying_Power'].fillna(buying_power)

print("Nulls in Dev after imputing features ", dev.isna().sum())
print("Nulls in Test after imputing features ", test.isna().sum())

Creating a list of categorical columns, which can be used as input for LightGBM

In [None]:
categorical_columns = ["Gender", "Occupation", "City_Category", "Stay_In_Current_City_Years",
                       "Marital_Status", "Product_Category_1", "Product_Category_2", "Product_Category_3"]

Commented out the Hot encoding code, as these are required only for Linear Regression, Decision Tree and Random Forest. LightGBM has an internal way of handling categorical data 

In [None]:
# train = pd.get_dummies(train, columns= categorical_columns)
# dev = pd.get_dummies(dev, columns= categorical_columns)
# test = pd.get_dummies(test, columns= categorical_columns)

# for i in list(train.columns):
#     if i not in list(dev.columns):
#         dev[i] = 0
#     if i not in list(test.columns):
#         test[i] = 0

Dropping the User ID and Product ID columns

In [None]:
train = train.drop(columns =['User_ID','Product_ID'])
dev = dev.drop(columns =['User_ID','Product_ID'])
test = test.drop(columns =['User_ID','Product_ID'])

Bringing the Target Column to the last and commented out the code, which is not required for LightGBM

In [None]:
train = train[ [ col for col in train.columns if col != 'Purchase' ] + ['Purchase'] ]
# dev = dev[train.columns]
# test = test[train.columns]
# test = test.drop(columns =['Purchase'])

Checking correlation of numerical variables

In [None]:
sns.heatmap(train[ [ col for col in train.columns if col not in categorical_columns ]].corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':10})
#run this only with LightGBM Columns

Checking scatter plots of numerical variables

In [None]:
sns.pairplot(train[ [ col for col in train.columns if col not in categorical_columns ]], palette = 'Set2') #run this only with LightGBM Columns

Creating the datasets for modelling

In [None]:
X_train = train.iloc[:,:-1].values
X_dev = dev.iloc[:,:-1].values
y_train = train.iloc[:,-1].values
y_actual = dev.iloc[:,-1].values

X_test = test.iloc[:,:].values

Scaling the data based on the traingdata so that all columns are in the same range 

In [None]:
Scaler = StandardScaler()

Scaler.fit(X_train)
X_train = Scaler.transform(X_train)
X_dev = Scaler.transform(X_dev)
X_test = Scaler.transform(X_test)

# Train, Predict and Test Model Performance

Creating a scorer to measure the model performance

In [None]:
def rmse(predictions, targets): 
  return np.sqrt(((predictions - targets) ** 2).mean())

rmse_score = make_scorer(rmse, greater_is_better=False)

Ran all the basic models to check RSME and selected the model with highest score to  perform further Hypertuning

In [None]:
# reg = LinearRegression()
# reg.fit(X_train,y_train)
# y_pred = reg.predict(X_dev)
# print("Linear Regression RMSE on 20% data is ", rmse(y_pred,y_actual))

In [None]:
# reg = DecisionTreeRegressor()
# reg.fit(X_train,y_train)
# y_pred = reg.predict(X_dev)
# print("Decision Tree Regressor RMSE on 20% data is ", rmse(y_pred,y_actual))

In [None]:
# reg = RandomForestRegressor()
# reg.fit(X_train,y_train)
# y_pred = reg.predict(X_dev)
# print("Random Forest Regressor RMSE on 20% data is ", rmse(y_pred,y_actual))

In [None]:
# reg = LGBMRegressor(metric = 'rsme')
# reg.fit(X_train,y_train)
# y_pred = reg.predict(X_dev)
# print("LightGBM RMSE on 20% data is ", rmse(y_pred,y_actual))

Light GBM gave us the best score. We will further tune this model to improve model performance along with Cross Validation over 4 folds

Performing a step wise grid search for tuning the model

In [None]:
lgb = LGBMRegressor(metric = 'rmse', categorical_columns = categorical_columns,subsample = 0.5, num_leaves = 500, num_iterations =200,  random_state=0 )
param_test ={'learning_rate' : [0.05,0.1,0.2,0.3]}

Total_sets = 100

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
gs = RandomizedSearchCV(
    estimator=lgb, param_distributions=param_test, 
    n_iter=Total_sets,
    scoring=rmse_score,
    cv=4,
    refit=True,
    random_state=314,
    n_jobs = 4,
    verbose=True)
gs.fit(X_train, y_train)
print('Best score reached: {} with params: {} '.format(-1*gs.best_score_, gs.best_params_))
y_pred  = gs.predict(X_dev)
score =  rmse(y_pred,y_actual)
y_test = gs.predict(X_test)
submission['Purchase'] = pd.DataFrame(y_test) 
submission.to_csv("./submission_jupyter.csv")

In [None]:
lgb = LGBMRegressor(metric = 'rmse', categorical_columns = categorical_columns,subsample = 0.5, num_leaves = 500, num_iterations =200,  random_state=0,learning_rate = 0.1)
lgb.fit(X_train, y_train)
y_predicted = lgb.predict(X_train)

This solution gave a rank in the Top 10 percentile in the Hackathon

Plotting the feature importance. As expected, Average Cost and Buying Power are the features with highest importance. As seen in the visualization- City, Stay in city, Marital Status, Gender did not have a lot of variance which led to a lower effect on Purchase Amount. Occupation and Product Categories also had significant effect on Purchase amount.

In [None]:
sorted(zip(lgb.feature_importances_, train.columns), reverse=True)
feature_imp = pd.DataFrame(sorted(zip(lgb.feature_importances_,train.columns)), columns=['Value','Feature'])

plt.figure(figsize=(15, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title(label= 'LightGBM Feature Importance', size = 20)
plt.tight_layout()
plt.show()

Checking the Scatter of Purchase vs Predicted Purcahse for the complete dataset

In [None]:
plt.scatter(y_train, y_predicted, c = 'green')
plt.xlabel("Purchase")
plt.ylabel("Predicted Purchase")
plt.title(label = "Purchase vs Predicted Purchase", size = 20)
plt.show()

Checking the Scatter of Purchase vs Residual for the complete dataset

In [None]:
plt.scatter(y_train, y_train -y_predicted, c = 'red')
plt.xlabel("Purchase")
plt.ylabel("Residual")
plt.title(label = "Purchase vs Residual", size = 20)
plt.show()