<h1><b><u>Predict Black Friday Sales</u></b></h1>
<h2><b>Background About Data</b></h2>
<p>A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.</p>

<h2><b>Problem Statement</b></h2>
<p>Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products</p>




In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


<h2><b> Understanding the Data</b></h2>
<p> The data set consists of 2 files:
<ol>
<li>train.csv: This file will be used to build the model</li>
<li>test.csv: This file will be used to predict the purchase </li>
</ol>
The data set consists of following Columns:
<ul>
<li>User_ID : User id of the customer</li>
<li>Product_ID: Product id of the product</li>
<li>Gender: male of female</li>
<li>Age: Age in bins i.e 0-17, 18-25, 26-35, 36-45, 46-50, 51-55, 55+</li>
<li>Occupation: Occupation (Masked)</li>
<li>City_Category: Category of the City (A,B,C)</li>
<li>Stay_In_Current_City_Years: Number of years stay in current city</li>
<li>Marital_Status: 0-Unmarried, 1-Married</li>
<li>Product_Category_1: Product Category (Masked)</li>
<li>Product_Category_2: Product may belongs to other category also (Masked)</li>
<li>Product_Category_3: Product may belongs to other category also (Masked)</li>
<li>Purchase: Purchase Amount (Target Variable)</li></ul>

<p><b>Questions that may be intersting to folllow-up:</b>
<ol>
<li>Which type of client spends more?</li>
<li>Which product category had the highest sales?</li>
<li>Who spend more married or unmarried>/li>
<li>According to age and sex what are the most bought products?</li>
</ol>

<h2><b>Analysis step</b></h2>
<p>Trying to identify the most important variables and defining the best regression model for predicting target variable.
Hence, this analysis will be divided into five stages:
<ol>
<li>Exploratory data analysis (EDA)</li>
<li>Data Pre-processing</li>
<li>Feature engineering</li>
<li>Modeling</li>
<li>Improving the Model (Hyperparameter tuning)</li>
<li>Ensembling</li>







In [None]:
import pandas as pd
filename = "../input/black-friday/train.csv"

train=  pd.read_csv(filename)

train.head()

In [None]:
test= pd.read_csv("../input/black-friday/test.csv")
test.head()

In [0]:
train.describe()

<p> It seems that columns Product_Catogory_2 and Product_category_2 has null values

In [None]:
# Checking for null values
train['Product_Category_1'].isna().mean()*100, train['Product_Category_2'].isna().mean()*100, train['Product_Category_3'].isna().mean()*100

<p>it looks like product category 3 has more null values which is close to 70 percent of the data, so we delete the feature. 
keep product category 2 and 1.</p>

In [None]:
# droping Product_Category_3 column
train.drop(["Product_Category_3"],  axis=1, inplace=True)


In [0]:
train.columns

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

<h2><b>1.Exploaratory Data Analysis</b></h2>



<h3>Distribution of the target variable: Purchase</h3>


In [None]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,7))
sns.distplot(train.Purchase, bins = 25)
plt.xlabel("Amount spent in Purchase")
plt.ylabel("Number of Buyers")
plt.title("Purchase amount Distribution")

<p>It seems like our target variable has an almost Gaussian distribution/ Normal Distribution.</p>

<p>Now that we’ve analysed our target variable, let’s consider our predictors(IV). Let’s start by seeing which of our features are numeric.</p>

In [None]:
numeric_features = train.select_dtypes(include=[np.number])
numeric_features.dtypes

<h3> Distribution of the variable Marital_Status</h3>

In [None]:
sns.countplot(train.Marital_Status)

<p>As expected there are more single people buying products on Black Friday than married people</p>



<h3>Distribution of the variable Product_Category_1</h3>

In [None]:
sns.countplot(train.Product_Category_1)
plt.xticks()

<p>From the distribution for products from category one, it is clear that three products stand out, number 1, 5 and 8. Unfortunately, we do not know which product each number represents.</p>

<h3>Distribution of the variable Product_Category_2</h3>

In [None]:
sns.countplot(train.Product_Category_2)
plt.xticks(rotation=90)

<h3>Correlation between Numerical Predictor( IV) and Target variable(DV)<h3>

In [None]:
corr = numeric_features.corr()


In [None]:
#correlation matrix
f, ax = plt.subplots(figsize=(20, 9))
sns.heatmap(corr,  annot=True,annot_kws={'size': 15})

<p>There seems to be no multicollinearity with our predictors which is a good thing, although there is some correlation among the product categories</p>

<h2>Analysis on Categorical Predictors</h2>



<h3>Distribution of the variable Gender<

In [None]:
sns.countplot(train.Gender)

Most of the buysrs are male

In [None]:
sns.countplot(train.Age)

<p>most purchases are made by people between 18 to 45 years old.</p>

In [None]:
sns.countplot(train.City_Category)

<p>most of the buyrs are from city B</p>

In [None]:
sns.countplot(train.Stay_In_Current_City_Years)

The tendency looks like the longest someone is living in that city the less chance they are to buy new things. Hence, if someone is new in town and needs a great number of new things for their house that they’ll take advantage of the low prices in Black Friday to purchase all the things needed.

<h2> Bivariate Analysis</h2>

now it is time to understand the relationship between our target variable and predictors as well as the relationship among predictors.

In [None]:
marital_status_pivot= train.pivot_table(index='Marital_Status',values='Purchase', aggfunc=np.mean)
marital_status_pivot

In [None]:
marital_status_pivot.plot(kind='bar', color='blue',figsize=(12,7))
plt.xlabel("Marital_Status")
plt.ylabel("Purchase")
plt.title("Marital_Status and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()

<p>We had more single customers than married. However, on average an individual customer tends to spend the same amount independently if his/her is married or not</p>

In [None]:
Product_category_1_pivot = train.pivot_table(index='Product_Category_1', values="Purchase", aggfunc=np.mean)
Product_category_1_pivot

In [None]:
Product_category_1_pivot.plot(kind='bar', color='green',figsize=(12,7))
plt.xlabel("Marital_Status")
plt.ylabel("Purchase")
plt.title("Marital_Status and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()

<p>Although there were more products bought for categories 1,5,8 the average amount spent for those three is not the highest. It is interesting to see other categories appearing with high purchase values despite having low impact on sales number.

In [None]:
Product_category_2_pivot = train.pivot_table(index='Product_Category_2', values="Purchase", aggfunc=np.mean)


In [None]:
Product_category_2_pivot.plot(kind='bar', color='brown',figsize=(12,7))
plt.xlabel("Product_Category_2")
plt.ylabel("Purchase")
plt.title("Product_Category_2 and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()

In [None]:
gender_pivot = train.pivot_table(index='Gender', values="Purchase", aggfunc=np.mean)
gender_pivot

In [None]:
gender_pivot.plot(kind='bar', color='orange',figsize=(12,7))
plt.xlabel("Gender")
plt.ylabel("Purchase")
plt.title("Gender and Purchase Analysis " "AVERAGE")
plt.xticks(rotation=0)
plt.show()

On average the male gender spends more money on purchase contrary to female, and also the percentage of male buyers is higher than female buyers

In [None]:
age_pivot = train.pivot_table(index='Age', values="Purchase", aggfunc=np.sum)
age_pivot

In [None]:
age_pivot.plot(kind='bar', color='pink',figsize=(12,7))
plt.xlabel("Age")
plt.ylabel("Purchase")
plt.title("Age and Purchase Analysis " "AVERAGE")
plt.xticks(rotation=0)
plt.show()

Total amount spent in purchase is in accordance with the number of purchases made, distributed by age.

In [None]:
city_pivot = train.pivot_table(index='City_Category', values="Purchase", aggfunc=np.mean)
city_pivot

In [None]:
city_pivot.plot(kind='bar', color='blue',figsize=(12,7))
plt.xlabel("City_Category")
plt.ylabel("Purchase")
plt.title("City_Category and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()


We saw previously that city type ‘B’ had the highest number of purchases registered. However, the city whose buyers spend the most is city type ‘C’.

In [None]:
Stay_In_Current_City_Years_pivot = train.pivot_table(index='Stay_In_Current_City_Years', values="Purchase", aggfunc=np.mean)
Stay_In_Current_City_Years_pivot

In [None]:
Stay_In_Current_City_Years_pivot.plot(kind='bar', color='red',figsize=(12,7))
plt.xlabel("Stay_in_Current_City_Years")
plt.ylabel("Purchase")
plt.title("Stay_in_Current_City_Years and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()

Again, we see the same pattern seen before which show that on average people tend to spend the same amount on purchases regardeless of their group. People who are new in city are responsible for the higher number of purchase, however looking at it individually they tend to spend the same amount independently of how many years the have lived in their current city.

<h2><b>Data Pre-Processing</b></h2>

In [None]:
test.head()

It is generally a good idea to combine both test and train sets into one, in order to perform data cleaning and feature engineering and later divide them again. With this step we do not have to go through the trouble of repeting twice the same code, for both datasets. Let’ s combine them into a dataframe datawith a sourcecolumn specifying where each observation belongs

In [None]:
# Join Train and Test Dataset
train['source']='train'
test['source']='test'

df = pd.concat([train,test], ignore_index = True, sort = False)

print(train.shape, test.shape, df.shape)

Since train set do not contain column product_category_3 , it has to be deleted from test as well as combined datt frame

In [None]:
test.drop(["Product_Category_3"],  axis=1, inplace=True)
df.drop(["Product_Category_3"],  axis=1, inplace=True)

In [None]:
print(train.shape, test.shape, df.shape)

<h3> Dealing with Null-Values</h3>

In [None]:
#Check the percentage of null values per variable
df.isnull().mean()*100

In [None]:
# Replacing Null Values in Product_Category_2 with the median of the column
df["Product_Category_2"].fillna(train["Product_Category_2"].median(), inplace = True)


<p>Removing Product_Category_1 group 19 and 20 from Train as this is not in Product_Category_2</p>


In [None]:
#Get index of all columns with product_category_1 equal 19 or 20 from train

ind = df.index[(df.Product_Category_1.isin([19,20])) & (df.source == "train")]
df = df.drop(ind)

In [None]:
df.shape

<h3> Delaing with Categorical Values

In [None]:
df.dtypes

<p> The categorical columns are Product_ID, Gender, Age, City_Category, Stay_In_Current_City_Years and Source</p>

In [None]:
#Filter categorical variables and get dataframe will all strings columns names except Item_identfier and outlet_identifier
category_cols = df.select_dtypes(include=['object']).columns.drop(["source"])
#Print frequency of categories
for col in category_cols:
    #Number of times each value appears in the column
    frequency = df[col].value_counts()
    print("\nThis is the frequency distribution for " + col + ":")
    print(frequency)

<h2><b>Feature Engineering</b></h2>

<h3>Converting gender to binary</h3>

In [None]:
gender_dict = {'F':0, 'M':1}
df["Gender"] = df["Gender"].apply(lambda x: gender_dict[x])

df["Gender"].value_counts()

<h3>Converting Age to numeric values</h3>

In [None]:
age_dict={'0-17':0, '18-25':1, '26-35':2, '36-45':3, '46-50':4, '51-55':5, '55+':6}
df['Age']=df['Age'].apply(lambda x:age_dict[x])
df['Age'].value_counts()

<h3>Converting city_category to Numeric</h3>

In [None]:
city={'A':0,'B':1,'C':2}
df['City_Category']=df['City_Category'].apply(lambda x: city[x])
df['City_Category'].value_counts()

<h3> Converting Stay_In_Current_City_Year to numeric</h3>

In [None]:
def stay(Stay_In_Current_City_Years):
        if Stay_In_Current_City_Years == '4+':
            return 4
        else:
            return Stay_In_Current_City_Years
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].apply(stay).astype(int) 

<h2>Exporting Data</h2>

In [None]:
#Divide into test and train:
train = df.loc[df['source']=="train"]
test = df.loc[df['source']=="test"]

#Drop unnecessary columns:
test.drop(['source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)

#Export files as modified versions:
train.to_csv("train_clean.csv",index=False)
test.to_csv("test_clean.csv",index=False)

In [None]:
train= pd.read_csv('train_clean.csv')
train.head()

In [None]:
test= pd.read_csv('test_clean.csv')
test.head()

<h2><b>Modelling</b></h2>

In [None]:
X = train.drop(['Product_ID','User_ID','Purchase'], axis=1)
y = train["Purchase"]

In [None]:
# splitting train and test set
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2, random_state=42)

In [None]:
#model
%time



rf_regressor = RandomForestRegressor(n_jobs=-1, 
                              random_state=42)

rf_regressor.fit(X_train, y_train)


In [None]:
rf_regressor.score(X_test, y_test)

<h3> Preiction and Metrices </h3>

In [None]:
y_pred = rf_regressor.predict(X_test)

In [None]:
R2 = r2_score(y_test, y_pred)
MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)
print("R2_Score_tune: {}\n Mean_absolute_error_: {}\n Mean_Square_error_tune: {}".format(R2, MAE, MSE))

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax = plt.scatter(y_test, y_pred, c="brown")

<h2><b>Improving Model</b></h3>

Optimizing hyperparameters for machine learning models is a key step in making accurate predictions. Hyperparameters define characteristics of the model that can impact model accuracy and computational efficiency. They are typically set prior to fitting the model to the data. In contrast, parameters are values estimated during the training process that allow the model to fit the data. Hyperparameters are often optimized through trial and error; multiple models are fit with a variety of hyperparameter values, and their performance is compared.

Cross-validation is often used to determine the optimal values for hyperparameters; we want to identify a model structure that performs the best on records it has not been trained on. A variety of hyperparameter values should be considered.

In [None]:
# 1.RandomSearchCV 

grid =  {"n_estimators": [10,50,100],
       "max_depth": [None,10,20,30,40,50,],
       "max_features": ["auto", "sqrt"],
       "min_samples_leaf": [2,10,15],
       "min_samples_split": [2,5,20]}

In [None]:
randomsearchCV = RandomizedSearchCV(rf_regressor, param_distributions = grid, n_iter = 5, cv=5,  verbose = True, n_jobs=-1)

In [None]:
%time

randomsearchCV.fit(X_train, y_train)

In [None]:
randomsearchCV.best_params_

<p> Fitting the data with Best Parameters

In [None]:
rf_regressor_tune = RandomForestRegressor(n_estimators=100, max_depth = 40, max_features = 'auto', min_samples_leaf =15,
                                     min_samples_split=5 )

In [None]:
rf_regressor_tune.fit(X_train, y_train) 

In [None]:
y_pred_tune = rf_regressor_tune.predict(X_test)


In [None]:
R2_tune= r2_score(y_test, y_pred_tune)
MAE_tune= mean_absolute_error(y_test, y_pred_tune)
MSE_tune= mean_squared_error(y_test, y_pred_tune)
print("R2_Score_tune: {}\n Mean_absolute_error_: {}\n Mean_Square_error_tune: {}".format(R2_tune, MAE_tune, MSE_tune))

In [None]:
# compare prediction before and after Tunning

compare = {"R^2_score":[R2_tune, R2],
            "Mean Squared Error": [MSE_tune, MSE],
            "Mean Absolute Error": [MAE_tune, MAE]}


Compare = pd.DataFrame(compare, index=[["After_tune", "Before Tune"]])
Compare


In all the three cases our model performed good while tuning hyperparameters: We have got higher R^2, and lower MSE & MAE compared to same values before tuning the hyperparameters

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax = plt.scatter(y_test, y_pred_tune, c="brown")

<p>Make predictions on test data with the model whose hyperparameter are tuned</p>

In [None]:
test.head()

In [None]:
# No. of features used to train the model must match with the input
# Droppping User_id , Product_id and purchase columns
predicted= test[['User_ID','Product_ID']]
test =test.drop(['User_ID','Product_ID','Purchase'],axis=1)

In [None]:
test.head()

In [None]:
test_pred = rf_regressor_tune.predict(test)
test_pred

In [None]:
predicted['Predicted_Purchase']=test_pred

In [None]:
predicted.head()

In [None]:
#saving calculated purchase in a csv file
predicted.to_csv("predict.csv",index=False)
