<a href="https://colab.research.google.com/github/siddhartha1506/pandas-quiz/blob/main/Mall_customer_segmentation_using_KMeans_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler 
from sklearn.cluster import KMeans

In [None]:
path = '/content/sample_data/Mall_Customers.csv'

In [None]:
df = pd.read_csv(path)

In [None]:
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
df.dtypes

CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object

In [None]:
#customerID not important
df = df.drop('CustomerID',axis=1)

In [None]:
df.describe()

Unnamed: 0,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0
mean,38.85,60.56,50.2
std,13.969007,26.264721,25.823522
min,18.0,15.0,1.0
25%,28.75,41.5,34.75
50%,36.0,61.5,50.0
75%,49.0,78.0,73.0
max,70.0,137.0,99.0


To check null values

In [None]:
df.isnull().sum()

Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

**Exploratory Data Analysis**

Gender

In [None]:
df['Gender'].value_counts()

Female    112
Male       88
Name: Gender, dtype: int64

In [None]:
fig1 = px.pie(df,names='Gender',title='Ratio of Females vs Males')
fig1.show()

Age

In [None]:
fig2 = px.box(df,x='Gender',y='Age',color='Gender')
fig2.show()

Minimum age for both males and females is 18.
The average age for males is 37 and for females is 35.

In [None]:
fig3 = px.histogram(df,x='Age',nbins=15,color='Gender',title='Age group with most number of customers')
fig3.show()

Age group 30-34 has more customers,followed by 35-39 and 45-49.

In [None]:
fig4 = px.histogram(df,x='Age',y='Spending Score (1-100)',color='Gender',nbins=15,title='Spending Score for Age group')
fig4.show()

Age group 30-34 has a higher spending Score,followed by 35-39 and 20-24

In [None]:
fig5 = px.histogram(df,x='Age',y='Annual Income (k$)',nbins=15,color='Gender',title='Annual Income Age Group Wise')
fig5.show()

Age group 30-34 has a higher Annual Income,followed by 35-39 and 45-49.

In [None]:
fig6 = px.scatter(df, x='Annual Income (k$)',y='Spending Score (1-100)',color='Age',title= 'Annual Income vs Spending Score Scatter Plot')
fig6.show()

Young people have spend more but have less income.                
Age group 30-34 & 35-39 have a high Annual Income and also have a High Spending Score and have a higher count in the number of customers.
So Age group 30-34 are regular customers and they should be the utmost priority.

**Customer Segmentation using K-Means**

In [None]:
df.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,Male,19,15,39
1,Male,21,15,81
2,Female,20,16,6
3,Female,23,16,77
4,Female,31,17,40


**Data Standardisation**

Assign numerical values to categorical data

In [None]:
df['Gender'] = df['Gender'].map({'Male':1,'Female':0})

In [None]:
df.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,19,15,39
1,1,21,15,81
2,0,20,16,6
3,0,23,16,77
4,0,31,17,40


In [None]:
X = df.copy()
X = StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 1.12815215, -1.42456879, -1.73899919, -0.43480148],
       [ 1.12815215, -1.28103541, -1.73899919,  1.19570407],
       [-0.88640526, -1.3528021 , -1.70082976, -1.71591298],
       [-0.88640526, -1.13750203, -1.70082976,  1.04041783],
       [-0.88640526, -0.56336851, -1.66266033, -0.39597992]])

**To find the best value of k**

**Elbow method**

In [None]:
wcss=[]
for i in range(1,21):
    km=KMeans(n_clusters=i)
    km.fit(X)
    wcss.append(km.inertia_)

In [None]:
fig = px.line(x=np.arange(1,21,1),y=wcss,markers=True)
fig.update_layout(
    title="Elbow Method",
    xaxis_title="K Value",
    yaxis_title="WCSS",
    legend_title="Legend Title",
)
fig.show()

From the above graph we choose 5 as the k values

**KMeans model**

In [None]:
k_means = KMeans(init="k-means++", n_clusters=5, n_init=15)
k_means.fit(X)

KMeans(n_clusters=5, n_init=15)

In [None]:
print("Cluster Centers: ",k_means.cluster_centers_)

Cluster Centers:  [[-0.88640526  0.7517978  -0.51757746 -0.4420241 ]
 [-0.88640526 -0.75047453 -0.00501655  0.6979562 ]
 [ 1.12815215 -0.76072691  0.05496398  0.83369302]
 [ 0.25517727  0.0729628   1.14279271 -1.32381522]
 [ 1.12815215  1.22385356 -0.4498575  -0.44231533]]


In [None]:
df1 = df.copy()

In [None]:
df1['cluster']  = k_means.labels_
df1['Gender'] = df1['Gender'].map({1:'Male',0:'Female'})
df1.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100),cluster
0,Male,19,15,39,2
1,Male,21,15,81,2
2,Female,20,16,6,0
3,Female,23,16,77,1
4,Female,31,17,40,0


**Visualization of KMeans Cluster**

In [None]:
fig7 = px.scatter_3d(df1,x='cluster',y='Age',z='Spending Score (1-100)',color ='Annual Income (k$)',title='3-D Visualisation of KMeans')
fig7.show()

In [None]:
fig8 = px.scatter(df1,x='Annual Income (k$)',y='Spending Score (1-100)',color='cluster',title='2-D Visualisation of KMeans')
fig8.show()

In [None]:
fig9 = px.scatter_3d(df1,x='cluster',y='Age',z='Spending Score (1-100)',color ='cluster')
fig9.show()

In [None]:
df1.groupby(['cluster']).mean()

Unnamed: 0_level_0,Age,Annual Income (k$),Spending Score (1-100)
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,49.325581,47.0,38.813953
1,28.392857,60.428571,68.178571
2,28.25,62.0,71.675
3,39.866667,90.5,16.1
4,55.903226,48.774194,38.806452


Clusters 0 and 1 have similar Age Group(28) with similar Annual Income and higher Spending Score                            
Cluster 3 with Age Group(40) have highest Annual Income but less Spending Score