# Mall Customer Segmentation Analysis

In this notebook, I am trying to cluster the customers based on their annual income and spending score and make an analysis out of it.

# 1. Exploratory Data Analysis

In [None]:
# Importing libraries needed
from sklearn.preprocessing import StandardScaler
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv') # Read dataset file
df.head(5) # Get first 5 rows of the dataset

In [None]:
df.info() # Get dataset info

In [None]:
# Renaming columns
df.rename(index=str, columns={'Annual Income (k$)': 'Income',
                              'Spending Score (1-100)': 'Score'}, inplace=True)
df.head()

In [None]:
df.shape # Get the dataset dimension

In [None]:
df.columns # Get column indexes

In [None]:
# Check if there are any missing values
df.isnull().any().any()

In [None]:
# Get the descriptive statistics of the dataset
df.describe(include='all')

In [None]:
# Create distribution plot for Annual Income, Customer's Age, and Spending Score

import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 5)

# Distribution plot for annual income
plt.subplot(1, 3, 1)
sns.set(style = 'whitegrid')
sns.distplot(df['Income'])
plt.title('Distribution of Annual Income', fontsize = 16)
plt.xlabel('Range of Annual Income')
plt.ylabel('Count')

# Distribution plot for customer's age
plt.subplot(1, 3, 2)
sns.set(style = 'whitegrid')
sns.distplot(df['Age'], color = 'red')
plt.title('Distribution of Customer''s Age', fontsize = 16)
plt.xlabel('Range of Age')
plt.ylabel('Count')
plt.show()

plt.subplot (1, 3, 3)
sns.set(style = 'whitegrid')
sns.distplot(df['Score'], color = 'orange')
plt.title('Distribution of Spending Score', fontsize = 16)
plt.xlabel('Range of Age')
plt.ylabel('Count')
plt.show()

In [None]:
# Visualization on number of customer based on gender

plt.figure(1 , figsize = (15 , 5))
sns.countplot(y = 'Gender' , data = df)
plt.show()

In [None]:
# Visualization on customers gender's percentage

labels = ['Female', 'Male']
size = df['Gender'].value_counts()
colors = ['lightgreen', 'orange']
explode = [0, 0.1]

# Plot pie chart
plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Gender', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()

In [None]:
# Customer's distribution based on age
plt.figure(figsize=(20,5))
sns.countplot(df['Age'])
plt.xticks(rotation=90)
plt.title('Age Distribution')

Peoples of age between 25 to 40 are mostly visiting mall than other age groups

In [None]:
# Spending score comparison based on gender
sns.boxplot(df['Gender'],df['Score'])
plt.title('Spending Score Comparison Based on Gender')

This diagram shows the average spending score of female and male. We can observe that the average spending score of female is greater than male, they have higher spending score than male, and their least spending core is greater than male's.

In [None]:
# Customer's distribution based on annual income

plt.figure(figsize=(25,5))
sns.countplot(df['Income'])
plt.title('Annual Income Distribution')

Peoples of salary 54k and 78k are the mostly visited persons in mall.

In [None]:
# Heatmap correlation

plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
plt.title('Heatmap Correlation on Mall Customer Segmentation')

In [None]:
# Pair plot visualization to see if genders has direct relation on customer segmentation
X = df.drop(['CustomerID', 'Gender'], axis=1)
sns.pairplot(df.drop('CustomerID', axis=1), hue='Gender', aspect=1.5)
plt.title('Pair Plot Visualization on Gender')
plt.show()

In [None]:
#Visualization of Spending score over income

plt.bar(df['Income'],df['Score'])
plt.title('Spending Score Over Income')
plt.xlabel('Income')
plt.ylabel('Score')

Peoples of income in the range of 20k-40k and 70k-100k have the highest spend score

In [None]:
# Show the data that we are going to cluster

plt.scatter(df['Income'],df['Score'])
plt.title('Spending Score over Income')
plt.xlabel('Income')
plt.ylabel('Spend Score')

From the manual observation, we can determine that this data can be clustered into 5 clusters.

# 2. Clustering Model: K-Means Clustering

For K-Means Clustering Algorithm, we have to determine the K value that represents the number of clusters. To find the optimal K value, I use Elbow Method to determine it.

In [None]:
# Defining elbow point to determine K value
from sklearn.cluster import KMeans

# Inertia list
clusters = []
for i in range(1,11):
  km = KMeans(n_clusters=i).fit(X)
  clusters.append(km.inertia_)

# Plot inertia
fig, ax = plt.subplots(figsize=(8, 4))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax, marker=".", markersize=10)
ax.set_title('Elbow Method')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')

In [None]:
# Defining Elbow Point

fig, ax = plt.subplots(figsize=(8, 4))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax, marker=".", markersize=10)
ax.set_title('Elbow Method')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
ax.annotate('Optimal Elbow Point', xy=(5, 80000), xytext=(5, 150000), xycoords='data',          
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))

plt.show()

As we can see, the optimal Elbow Point is 5. So I will assign K = 5.

In [None]:
# Making K-Means object
km5 = KMeans(n_clusters=5).fit(X)

# Add column labels on dataset
X['Labels'] = km5.labels_

# Plot 5-clusters K-Means
plt.figure(figsize=(8,4))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'],
                palette=sns.color_palette('hls', 5))
plt.title('5-Clusters K-Means')
plt.show()

To measure the quality of my clustering model, we are using Silhouette Coefficient.

In [None]:
# Silhouette Coefficient of K-Means Model

from sklearn import metrics
round(metrics.silhouette_score(X, X['Labels']), 2)

# 3. Clustering Model: DBSCAN Clustering

DBSCAN Clustering is quite different from K-Means Clustering. For DBSCAN Clustering, we have to determine the value of minimum points/samples and Epsilon. Minimum points is the fewest number of points required to form a cluster, while epsilon is the maximum distance two points can be from one another while still belonging to the same cluster.

In [None]:
# Determine MinPts and Epsilon

data_c = pd.DataFrame({'Age': df['Age'], 'Income': df['Income'], 'Score': df['Score']})
from sklearn.neighbors import NearestNeighbors
neighbors = NearestNeighbors(n_neighbors=6) #n_neighbors is the MinPts
neighbors_fit = neighbors.fit(data_c)
distances, indices = neighbors_fit.kneighbors(data_c)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)


From this graph the epsilon value is determined around 11.0. In the next step, we empirically determine MinPts until reach n of labels desired

In [None]:
# Create DBSCAN Object

from sklearn.cluster import DBSCAN 

db = DBSCAN(eps=11, min_samples=6).fit(X)

X['Labels'] = db.labels_
plt.figure(figsize=(12, 4))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'], 
                palette=sns.color_palette('hls', np.unique(db.labels_).shape[0]))
plt.title('DBSCAN with epsilon 11, min samples 6')
plt.show()

In [None]:
# Silhouette Coefficient of DBSCAN Model

round(metrics.silhouette_score(X, X['Labels']), 2)

As we can see DBSCAN doesn't perform very well because the density in the data is not that thick. Label -1 means outliers so it will appear most as outliers. We may have performed better if we had had a bigger data.

# 4. Clustering Model: Mean Shift Clustering

In Mean Shift Clustering, we need to determine a bandwidth value. The bandwidth is the distance/size scale of the kernel function, i.e. what the size of the “window” is across which you calculate the mean. This parameter can be set manually, but can be estimated using the provided `estimate_bandwidth` function, which is called if the bandwidth is not set.

Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number of clusters in advance. The number of clusters is determined by the algorithm with respect to the data.

In [None]:
# Create Mean Shift Object

from sklearn.cluster import MeanShift, estimate_bandwidth

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth).fit(X)

X['Labels'] = ms.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'], 
                palette=sns.color_palette('hls', np.unique(ms.labels_).shape[0]))
plt.plot()
plt.title('MeanShift')
plt.show()

In [None]:
# Silhouette Coefficient of Mean Shift Model

round(metrics.silhouette_score(X, X['Labels']), 2)

# 5. Evaluation on Clustering Models

Since the dataset doesn't contain labeling, I can't perform clustering accuracy measurement using Rand Index or cross validation. However, I can measure the quality of our clustering models using silhouette coefficient. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

My K-Means model has Silhouette Coefficient of 0.44, DBSCAN model has value of 0.18, and Mean Shift model has 0.45. We can see that our Mean Shift model has the highest coefficient value but not really significant if compared to our K-Means model.

# 6. Conclusion

Taken from Mean Shift clustering results, we can analyze our 5 clusters in detail now:

- `Label 0` is mid income and mid spending
- `Label 1` is high income and high spending
- `Label 2` is high income and low spending
- `Label 3` is low income and high spending
- `Label 4` is low income and low spending

In [None]:
# Plot swarmplot to analyze clusters
X['Labels'] = ms.labels_

fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(121)
sns.swarmplot(x='Labels', y='Income', data=X, ax=ax)
ax.set_title('Labels According to Annual Income')

ax = fig.add_subplot(122)
sns.swarmplot(x='Labels', y='Score', data=X, ax=ax)
ax.set_title('Labels According to Scoring')

plt.show()