<a href="https://colab.research.google.com/github/saumilhj/projects/blob/main/BFS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BLACK FRIDAY SALE**

Dataset from Kaggle: https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale

In [None]:
import pandas as pd
import plotly.express as px

Import data

In [None]:
df = pd.read_csv('data_bfs.csv')

In [None]:
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     550068 non-null  int64  
 1   Product_ID                  550068 non-null  object 
 2   Gender                      550068 non-null  object 
 3   Age                         550068 non-null  object 
 4   Occupation                  550068 non-null  int64  
 5   City_Category               550068 non-null  object 
 6   Stay_In_Current_City_Years  550068 non-null  object 
 7   Marital_Status              550068 non-null  int64  
 8   Product_Category_1          550068 non-null  int64  
 9   Product_Category_2          376430 non-null  float64
 10  Product_Category_3          166821 non-null  float64
 11  Purchase                    550068 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB


Check duplicates

In [None]:
df.duplicated().sum()

0

Check NaN

In [None]:
df.isna().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

NaN values in product category imply that it was not bought. Hence, replace NaN value with 0.

In [None]:
df['Product_Category_2'] = df['Product_Category_2'].fillna(0)
df['Product_Category_3'] = df['Product_Category_3'].fillna(0)

In [None]:
df=df.astype({'Product_Category_2': 'int', 'Product_Category_3': 'int'})

Total customers and total products

In [None]:
print(f"Total customers = {df['User_ID'].nunique()}\n")
print(f"Total products sold:\nProduct Category 1 = {df['Product_Category_1'].sum()}\nProduct Category 2 = {df['Product_Category_2'].sum()}\nProduct Category 3 = {df['Product_Category_3'].sum()}")

Total customers = 5891

Total products sold:
Product Category 1 = 2972716
Product Category 2 = 3704948
Product Category 3 = 2113329


Spread of average spend by each customer

In [None]:
df_user_purchase = df.groupby(['User_ID'], as_index=False).mean()[['User_ID', 'Purchase']]
df_user_purchase.head()

Unnamed: 0,User_ID,Purchase
0,1000001,9545.514286
1,1000002,10525.61039
2,1000003,11780.517241
3,1000004,14747.714286
4,1000005,7745.292453


In [None]:
fig = px.histogram(df_user_purchase, x=df_user_purchase['Purchase'],
                   title='Spread of average spending by customers')
fig.show()

Average purchase per user maps very similar to the normal curve

Average spending based on age group and gender

In [None]:
df_age_gender = df.groupby(['Age', 'Gender'], as_index=False).mean()[['Age', 'Gender', 'Purchase']]
df_age_gender

Unnamed: 0,Age,Gender,Purchase
0,0-17,F,8338.771985
1,0-17,M,9235.17367
2,18-25,F,8343.180201
3,18-25,M,9440.942971
4,26-35,F,8728.251754
5,26-35,M,9410.337578
6,36-45,F,8959.844056
7,36-45,M,9453.193643
8,46-50,F,8842.098947
9,46-50,M,9357.471509


In [None]:
fig = px.bar(df_age_gender, y='Purchase', x='Age', color='Gender',
             title='Average spend of age groups segregated into gender',
             barmode='group')
fig.show()

Male spending is consistently higher than female spending for all age groups. However, variance in total spend by all age groups is almost negligible.

Correlation between city and average purchase amount split by age groups

In [None]:
df_city_age = df.groupby(['City_Category', 'Age'], as_index=False).mean()[['City_Category', 'Age','Purchase']]
df_city_age

Unnamed: 0,City_Category,Age,Purchase
0,A,0-17,8615.110456
1,A,18-25,8833.734084
2,A,26-35,8952.503004
3,A,36-45,8990.333997
4,A,46-50,8348.526752
5,A,51-55,9508.505001
6,A,55+,8485.945424
7,B,0-17,8917.295308
8,B,18-25,9031.706985
9,B,26-35,9149.193178


In [None]:
fig = px.bar(df_city_age, x='Age', y='Purchase', color='City_Category',
             barmode='group')
fig.show()

City C shows higher average spend by customers of all age groups except 55+

Average purchase based on occupation

In [None]:
df_occ = df.groupby(['Occupation'], as_index=False).mean()[['Occupation', 'Purchase']].sort_values(by='Purchase')
df_occ['Occupation'] = df_occ['Occupation'].astype(str)
df_occ

Unnamed: 0,Occupation,Purchase
9,9,8637.743761
19,19,8710.627231
20,20,8836.494905
2,2,8952.481683
1,1,8953.19327
10,10,8959.355375
0,0,9124.428588
18,18,9169.655844
3,3,9178.593088
11,11,9213.845848


In [None]:
fig = px.bar(df_occ, y='Occupation', x='Purchase',color='Purchase', orientation='h', color_continuous_scale=px.colors.sequential.Plasma_r)
fig.show()

Occupation number 17, 12 and 15 show the top 3 highest average spending in order. 