* A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
* The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

* Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

In [None]:
#Load Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns #importing seaborn module 
import warnings
from collections import Counter
from sklearn.preprocessing import LabelEncoder, StandardScaler 
from sklearn import metrics
warnings.filterwarnings('ignore')  #this will ignore the warnings.it wont display warnings in notebook
plt.style.use('ggplot')
plt.rcParams['figure.figsize']=[6,3]
plt.rcParams['figure.dpi']=80

In [None]:
#Load train and test files 
data = pd.read_csv('../input/black-friday/train.csv')
test = pd.read_csv('../input/black-friday/test.csv')

Step 1 : Explore train and test datasets

In [None]:
#First look at train
data.sample(5)

In [None]:
#First look at test
test.sample(5)

In [None]:
#Shape of train and test
print('There are {} rows and {} columns in train'.format(data.shape[0],data.shape[1]))
print('There are {} rows and {} columns in train'.format(test.shape[0],test.shape[1]))

In [None]:
#Check Missing values in train
data.isna().sum()

In [None]:
#Check Missing values in test
test.isna().sum()

Product_Category_2 & Product_Category_3 have many missing vaules in train & test

In [None]:
#Check data types in train
data.info()

In [None]:
#Check data types in test
test.info()

In [None]:
#Lets describe train
data.describe()

In [None]:
#Lets describe test
test.describe()

In [None]:
#Lets concatenate train & test
df=pd.concat([data,test])
df.shape 

Step 2 : Data cleaning 

In [None]:
#Explore numerical variable - Stay_In_Current_City_Years
df.Stay_In_Current_City_Years.value_counts()
#Total 5 unique vaues

In [None]:
#Lets remove '+' symbol and convert to object
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].apply(lambda x: x.replace('+', '')
                                if isinstance(x, str) else x).astype(int)
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].astype('object')
df.Stay_In_Current_City_Years.value_counts()

Variable "Stay_In_Current_City_Years" cleaned

Step 3 : Exploratory Data Analysis (EDA)

In [None]:
#Explore categorical variables - Gender & Age
cat_col_1 = [
 'Gender',
 'Age',
 ]
count = 1
for cols in cat_col_1:
    plt.subplot(2, 2, count)
    df[cols].value_counts().plot.pie(shadow=True,autopct='%1.1f%%',radius=1.5,textprops={'fontsize': 10} )
    count +=1
    plt.subplot(2, 2, count)
    plt.tight_layout()
    sns.countplot(cols, data=df)
    fig=plt.gcf()
    fig.set_size_inches(12,7)
    plt.xticks(fontsize=10)
    plt.yticks(fontsize=3)
    plt.xticks(rotation=30)
    count+=1

* Gender : 75.3% is male Vs 24.7% female
* Age    : Maximum buyers(39.9%) belong to age group 26-35 and minimum (2.7%) belong to 0-17 age group 

In [None]:
#Explore categorical variables - Marital_Status, City_Category & Stay_In_Current_City_Years
cat_col_2 = [
 'Marital_Status',
 'City_Category',
 'Stay_In_Current_City_Years',
 ]
count = 1
for cols in cat_col_2:
    plt.subplot(3, 2, count)
    df[cols].value_counts().plot.pie(shadow=True,autopct='%1.1f%%',radius=1.2,textprops={'fontsize': 10} )
    count +=1
    plt.subplot(3, 2, count)
    plt.tight_layout()
    sns.countplot(cols, data=df)
    fig=plt.gcf()
    fig.set_size_inches(10,7)
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=3)
    plt.xticks(rotation=30)
    count+=1 

* Marital_status : 59% Not married Vs 41% married
* City_Category  : Maximum buyers belong to City_Category=B (42.1%) and min to City_Category=A (26.8%)
* Stay_In_Current_City_Years : Maximum buyers have stayed in city for 1 year and minimum buyers for 0 years (i.e. recently moved)

In [None]:
#Explore categorical variable - Occupation
cat_col_3 = ['Occupation',]
count = 1
for cols in cat_col_3:
    plt.subplot(1, 2, count)
    df[cols].value_counts().plot.pie(shadow=True,autopct='%1.1f%%',radius=1.4,textprops={'fontsize': 9} )
    count +=1
    plt.subplot(1, 2, count)
    plt.tight_layout()
    #df.Occupation.value_counts().sort_values().plot(kind = 'bar')
    sns.countplot(x="Occupation", data=df,facecolor=(0, 0, 0, 0), linewidth=5, edgecolor=sns.color_palette("dark", 5))
    fig=plt.gcf()
    fig.set_size_inches(12,7)
    plt.xticks(fontsize=10)
    plt.yticks(fontsize=3)
    plt.xticks(rotation=30)
    count+=1

Occupation : Maximum buyers belong to Occupation category = 4 and minimum buyers belong to category = 8

In [None]:
#Explore categorical variable - Product_Category_1
cat_col_4 = ['Product_Category_1',]
count = 1
for cols in cat_col_4:
    plt.subplot(1, 2, count)
    df[cols].value_counts().plot.pie(shadow=True,autopct='%1.1f%%',radius=1.9,textprops={'fontsize': 8} )
    count +=1
    plt.subplot(1, 2, count)
    plt.tight_layout()
    plt.style.use('ggplot')
    df.Product_Category_1.value_counts().sort_values().plot(kind = 'bar')
    fig=plt.gcf()
    plt.title("Product_Category_1", fontsize=15) 
    fig.set_size_inches(15,7)
    plt.xticks(fontsize=10)
    plt.yticks(fontsize=9)
    plt.xticks(rotation=30)
    count+=1

In [None]:
#Explore categorical variable - Product_Category_2
cat_col_5 = ['Product_Category_2',]
count = 1
for cols in cat_col_5:
    plt.subplot(1, 2, count)
    df[cols].value_counts().plot.pie(shadow=True,autopct='%1.1f%%',radius=1.9,textprops={'fontsize': 8} )
    count +=1
    plt.subplot(1, 2, count)
    plt.tight_layout()
    plt.style.use('fivethirtyeight')
    df.Product_Category_2.value_counts().sort_values().plot(kind = 'bar')
    fig=plt.gcf()
    plt.title("Product_Category_2", fontsize=15) 
    fig.set_size_inches(15,10)
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=9)
    plt.xticks(rotation=30)
    count+=1

For Product_Category_2 : eight is highest and 7.0 is lowest

In [None]:
#Explore categorical variable - Product_Category_3
cat_col_6 = ['Product_Category_3',]
count = 1
for cols in cat_col_6:
    plt.subplot(1, 2, count)
    df[cols].value_counts().plot.pie(shadow=True,autopct='%1.1f%%',radius=1.9,textprops={'fontsize': 8} )
    count +=1
    plt.subplot(1, 2, count)
    plt.tight_layout()
    plt.style.use('ggplot')
    df.Product_Category_3.value_counts().sort_values().plot(kind = 'bar')
    fig=plt.gcf()
    plt.title("Product_Category_3", fontsize=15) 
    fig.set_size_inches(15,10)
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=9)
    plt.xticks(rotation=30)
    count+=1

For Product_Category_3 16.0  is max Vs 3.0 is min

In [None]:
#Lets explore numerical variables - User_ID , Product_ID & Purchase
# Lets see how Purchase analysis looks like
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    df['Purchase'], norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='Purchase', ylabel='Count');

* Maximum frequency (>50k times)can be seen betwee Purchase value (5000-8600)
* Few amounts as high as 23961 is also seen
* Min value is 12

In [None]:
# Lets see how User_ID analysis looks like
sns.set(style='darkgrid', palette="rocket", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    df['User_ID'], norm_hist=False, kde=False, bins=10, hist_kws={"alpha": 1}
).set(xlabel='User_ID', ylabel='Count');

Frequency distribution looks almost similar for all types of users

In [None]:
#variables - Product_ID 
#Lets emove P and convert product id to a integer
df['Product_ID'] = df['Product_ID'].apply(lambda x: x.replace('P', '')
                                if isinstance(x, str) else x).astype(int)

df['Product_ID'] = df['Product_ID'].astype('int')

In [None]:
# Lets see how Product_ID analysis looks like
sns.set(style='darkgrid', palette="Set1", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    df['Product_ID'], norm_hist=False, kde=False, bins=50, hist_kws={"alpha": 1}
).set(xlabel='Product_ID', ylabel='Frequency');

Maximum frequency of product id can be seen from 110000 to 120000

Lets see relationship between categorical and numerical variables

In [None]:
#Variable - Age Vs Purchase
#Catplot Age+Purchase
sns.catplot(x='Age',y='Purchase',kind='point',data=df, order=['0-17', '18-25', '26-35', '36-45',  '46-50', '51-55', '55+'],)

* Age group 51-55 is the highest purchase group with value 9520+ as median
* Age group 0-17 is the lowest purchase group with value 8920+ as median as they are non-adults and need to depend upon parents for buying 

In [None]:
#Variable - Age Vs Purchase with hue = Gender
#Catplot Age+Purchase
sns.catplot(x='Age',y='Purchase',kind='point',data=df, order=['0-17', '18-25', '26-35', '36-45',  '46-50', '51-55', '55+'],hue='Gender')

* Male have higher buying than female across all age categories
* 51-55 age group is highest in bothh Genders

In [None]:
#Catplot Age+Purchase+City_Category+Gender
sns.catplot(x='Age',y='Purchase',kind='point',data=df,col='City_Category',hue='Gender', order=['0-17', '18-25', '26-35', '36-45',  '46-50', '51-55', '55+'])

* In City category B & C : Probabiity of purchases is more for male Gender Vs Female
* For City category A : For age group 45-50 & 55+, female buyers are more than male buyers, for other age groups male are more than female
* In City category B : Probabiity of purchases is more for age group 0-17 than 18-25, 26-35, 36-45 & 46-50 which is interesting

In [None]:
#Catplot Age+Purchase+Stay_In_Current_City_Years+Gender
sns.catplot(x='Age',y='Purchase',kind='point',data=df,col='Stay_In_Current_City_Years',hue='Gender', order=['0-17', '18-25', '26-35', '36-45',  '46-50', '51-55', '55+'])

Irrespective of stay in current city, male purchasing is more than female

In [None]:
#Catplot Marital_Status+Purchase
sns.catplot(x='Marital_Status',y='Purchase',kind='point',data=df)

Unmarried people buy more than married people 

In [None]:
#Catplot Age+Purchase+Marital_Status
sns.catplot(x='Age',y='Purchase',kind='point',data=df,hue='Marital_Status', order=['0-17', '18-25', '26-35', '36-45',  '46-50', '51-55', '55+'])

* age group 46-50 and married spent more than non-married
* In all other age groups , unmarried people spent more than married

In [None]:
#Catplot Age+Purchase+Marital_Status+Gender
sns.catplot(x='Age',y='Purchase',kind='point',data=df,col='Marital_Status',hue='Gender', order=['0-17', '18-25', '26-35', '36-45',  '46-50', '51-55', '55+'])

* Irrespective of marital status, male purchasing is more than female across all age groups
* age group 0-17 is not married because of obvious reasons

In [None]:
# Boxplot of Age Vs Purchase in ascending order of purchase
sorted_nb = df.groupby(['Age'])['Purchase'].median().sort_values()
sns.boxenplot(x=df['Age'], y=df['Purchase'], order=list(sorted_nb.index))

we can see again that 51-55 is having highest mean of the Purchase amount and is the highest spender

In [None]:
# Boxplot of Age Vs Purchase in horizontal orientation across Gender
g = sns.catplot(x="Purchase", y="Age", row="Gender",
                kind="box", orient="h", height=2.5, aspect=3,
                data=df)
g.set(xscale="log")

In [None]:
#Facetgrid for Occupation  + Purchase + City_Category  & hue = Gender
cond_plot = sns.FacetGrid(data=df, col='Occupation', hue='Gender', col_wrap=4)
cond_plot.map(sns.stripplot, 'City_Category', 'Purchase');

* For Occupation=9 and city category = B, only female buyers are present
* For Occupation=9 and city category = C, female buyers have very good presence

In [None]:
#Facetgrid for Occupation  + Purchase + City_Category  & hue = Gender
cond_plot = sns.FacetGrid(data=df, col='Occupation', hue='Gender', col_wrap=4)
cond_plot.map(sns.scatterplot, 'Age', 'Purchase');

In [None]:
df.sample(5)

In [None]:
'''#Impute mode into Product_Category_2 & Product_Category_3
df['Product_Category_2'].fillna(df['Product_Category_2'].value_counts().index[0], inplace=True)
df['Product_Category_3'].fillna(df['Product_Category_3'].value_counts().index[0], inplace=True)
#df.isna().sum()'''

# fill the missing data
df['Product_Category_2'].fillna(method='bfill', inplace=True)
df['Product_Category_3'].fillna(method='bfill', inplace=True)

In [None]:
df['Product_Category_2'].fillna(df['Product_Category_2'].value_counts().index[0], inplace=True)
df['Product_Category_3'].fillna(df['Product_Category_3'].value_counts().index[0], inplace=True)

In [None]:
df.isnull().sum()

All missing values are imputed 

In [None]:
#Convert Product_Category_1, Product_Category_2 & Product_Category_3 from object to numerical int
df['Product_Category_1']  = df['Product_Category_1'].astype('int')
df['Product_Category_2']  = df['Product_Category_2'].astype('int')
df['Product_Category_3']  = df['Product_Category_3'].astype('int')

In [None]:
df.info()

In [None]:
#Creating train and test set
df_backup = df.copy()

In [None]:
#Shape of train and test
print('There are {} rows and {} columns in train'.format(df.shape[0],df.shape[1]))

In [None]:
#Final Look at data
df.head()

In [None]:
#Converting categorocal variables to dummy variables 
df=pd.get_dummies(df,drop_first=True)

In [None]:
#Lets do a correlation plot for entire dataframe
sns.heatmap(df.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':10})
fig=plt.gcf()
fig.set_size_inches(18,12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()

We can see negative correlation of Purchase with product category 1

In [None]:
# Segregating train and test from df
train=df[:data.shape[0]]
test1=df[data.shape[0]:]

In [None]:
#Shape of train and test
print('There are {} rows and {} columns in train'.format(train.shape[0],train.shape[1]))
print('There are {} rows and {} columns in test'.format(test1.shape[0],test1.shape[1]))

In [None]:
train.head(5)

In [None]:
test1.head(5)

Drop target column(Purchase) from train and test

In [None]:
train.drop('Purchase', axis = 1, inplace = True)
test1.drop('Purchase', axis = 1, inplace = True)

Train and Test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.values, data['Purchase'].values, test_size = 0.3, random_state = 4)

In [None]:
#### Scale input values ####
sc_x = StandardScaler() 
X_train = sc_x.fit_transform(X_train)  
X_test = sc_x.transform(X_test)
test1_sc = sc_x.transform(test1)

Model Building starts

We will use XGB Regressor to predict Puchase prices 

Approx Run time : 23 mins

In [None]:
#XGBoost Regressor
# Import XGBoost Regressor
from xgboost import XGBRegressor

#Create a XGBoost Regressor
reg = XGBRegressor(n_estimators=3600, learning_rate=0.05)

# Train the model using the training sets 
reg.fit(X_train, y_train,early_stopping_rounds=5,eval_set=[(X_test, y_test)], verbose=0)

In [None]:
#Predicting Test data with the model
y_test_pred = reg.predict(X_test)

In [None]:
# Model Evaluation
acc_xgb = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_xgb)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
RMSE_xgb=np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))

> **We got RMSE value of 2525.15 with XGB**

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_test, y_test_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()

In [None]:
# Checking residuals
plt.scatter(y_test_pred,y_test-y_test_pred)
plt.title("Predicted vs residuals")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()

Prepare submission file

In [None]:
#Predict on final test data set
predicted_prices = reg.predict(test1_sc)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [None]:
#Prepare submission file
my_submission = pd.DataFrame({'Purchase': predicted_prices, 'User_ID': test.User_ID,'Product_ID': test.Product_ID })
# you could use any filename. We choose submission here
my_submission.to_csv('./submission_rahulpednekar.csv', index=False)


> **Please upVote if you have liked my Kernel **