# CUSTOMER SEGMENTATION AND CLUSTURING

# CONTEXT

You are owing a supermarket mall and through membership cards , you have some basic data about your customers like Customer ID, age, gender, annual income and spending score.
Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

# PROBLEM STATEMENT

Gaining insights of customer in a mall.
We had to create a system in which we had to arrange clusters of
customers based on their purchases and other factors from the
dataset.
One group claims to have spent a lot of money at the mall, while
the other claims to have spent much less.
This influences a mall's decision-making to improve their
company's marketing plan.

Clustering comes under unsupervised learning

# WORK FLOW 

1. Gathering relevant data.
2. Comprehension of the k-means clustering technique.
3. How the k-means algorithm segments customers.
4. Utilizing the panda's framework and data vis. to Analyze data.
5. Data slicing (data abstraction).
6. Using an algorithm to the data.
7. Graphing an elbow.

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

# A. IMPORTING THE DATA


In [2]:
customer_data.drop(['CustomerID'],axis=1,inplace=True)
customer_data.columns = ['Gender','Age','Income','Spending']

NameError: name 'customer_data' is not defined

In [None]:
customer_data = pd.read_csv('customer_segmen.csv')

# B. EXPLORING THE DATA

This project is a part of the Mall Customer Segmentation Data competition held on Kaggle.

In [None]:
customer_data.head()

In [None]:
customer_data.shape

In [None]:
customer_data.info()

In [None]:
customer_data.isnull().sum()
#no null values

In [None]:
customer_data['Gender'] = [1 if each == "Female" else 0 for each in  customer_data.loc[:,'Gender']]
customer_data.head()

# C. VISUALIZATION THE DATA

---> Pie chart to check the distribution of male and female population in the dataset. 

In [None]:
labels = ['Female', 'Male']
size = customer_data['Gender'].value_counts()
colors = ['blue', 'orange']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Gender', fontsize = 15)
plt.axis('off')
plt.legend()
plt.show()

The female population clearly outweighs the male counterpart.

---> Distribution of number of customers in each age group.

In [None]:
age18_25 = customer_data.Age[(customer_data.Age <= 25) & (customer_data.Age >= 18)]
age26_35 = customer_data.Age[(customer_data.Age <= 35) & (customer_data.Age >= 26)]
age36_45 = customer_data.Age[(customer_data.Age <= 45) & (customer_data.Age >= 36)]
age46_55 = customer_data.Age[(customer_data.Age <= 55) & (customer_data.Age >= 46)]
age55above = customer_data.Age[customer_data.Age >= 56]

x = ["18-25","26-35","36-45","46-55","55+"]
y = [len(age18_25.values),len(age26_35.values),len(age36_45.values),len(age46_55.values),len(age55above.values)]

plt.figure(figsize=(15,6))
sns.barplot(x=x, y=y, palette="rocket")
plt.title("Number of Customer and Ages")
plt.xlabel("Age")
plt.ylabel("Number of Customer")
plt.show()

26–35 age group outweighs every other age group.

---> Bar plot to visualize the number of customers according to their spending scores. 

In [None]:
ss1_20 = customer_data["Spending"][(customer_data["Spending"] >= 1) & (customer_data["Spending"] <= 20)]
ss21_40 = customer_data["Spending"][(customer_data["Spending"] >= 21) & (customer_data["Spending"] <= 40)]
ss41_60 = customer_data["Spending"][(customer_data["Spending"] >= 41) & (customer_data["Spending"] <= 60)]
ss61_80 = customer_data["Spending"][(customer_data["Spending"] >= 61) & (customer_data["Spending"] <= 80)]
ss81_100 = customer_data["Spending"][(customer_data["Spending"] >= 81) & (customer_data["Spending"] <= 100)]

ssx = ["1-20", "21-40", "41-60", "61-80", "81-100"]
ssy = [len(ss1_20.values), len(ss21_40.values), len(ss41_60.values), len(ss61_80.values), len(ss81_100.values)]

plt.figure(figsize=(15,6))
sns.barplot(x=ssx, y=ssy, palette="nipy_spectral_r")
plt.title("Spending Scores")
plt.xlabel("Score")
plt.ylabel("Number of Customer Having the Score")
plt.show()



The majority of the customers have spending score in the range 41–60.

---> Bar plot to visualize the number of customers according to their annual income.

In [None]:
ai0_30 = customer_data["Income"][(customer_data["Income"] >= 0) & (customer_data["Income"] <= 30)]
ai31_60 = customer_data["Income"][(customer_data["Income"] >= 31) & (customer_data["Income"] <= 60)]
ai61_90 = customer_data["Income"][(customer_data["Income"] >= 61) & (customer_data["Income"] <= 90)]
ai91_120 = customer_data["Income"][(customer_data["Income"] >= 91) & (customer_data["Income"] <= 120)]
ai121_150 = customer_data["Income"][(customer_data["Income"] >= 121) & (customer_data["Income"] <= 150)]


aix = ["$ 0 - 30,000", "$ 30,001 - 60,000", "$ 60,001 - 90,000", "$ 90,001 - 120,000", "$ 120,001 - 150,000"]
aiy = [len(ai0_30.values), len(ai31_60.values), len(ai61_90.values), len(ai91_120.values), len(ai121_150.values)]


plt.figure(figsize=(15,6))
sns.barplot(x=aix, y=aiy, palette="Set2")
plt.title("Annual Incomes")
plt.xlabel("Income")
plt.ylabel("Number of Customer")
plt.show()


 The majority of the customers have annual income in the range 60000 and 90000.

In [None]:
plt.figure(figsize=(20,5))

# Age
plt.subplot(1,3,1)
sns.histplot(customer_data['Age'],kde=True,stat="density", kde_kws=dict(cut=3), alpha=.4)

# Income
plt.subplot(1,3,2)
sns.histplot(customer_data['Income'],kde=True,stat="density", kde_kws=dict(cut=3), alpha=.4)

# Spending
plt.subplot(1,3,3)
sns.histplot(customer_data['Spending'],kde=True,stat="density", kde_kws=dict(cut=3), alpha=.4)

# D. CLUSTERING ALGORITHMS


# a. K-Means Clustering

The steps can be summarized in the below steps:

1. Compute K-Means clustering for different values of K by varying K from 1 to 10 clusters.

2. For each K, calculate the total within-cluster sum of square (WCSS).

3. Plot the curve of WCSS vs the number of clusters K.

4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

FINDING CLUSTERS

In [None]:
from sklearn.cluster import KMeans
wcss = []

for each in range(1,15):
    kmeans = KMeans(n_clusters=each,init="k-means++")
    kmeans.fit(customer_data)
    wcss.append(kmeans.inertia_)

In [None]:
plt.plot(range(1,15), wcss, marker='o')
plt.xlabel('number of k values')
plt.ylabel('wcss')
plt.show()

OPTIMUM NUMBER OF CLUSTERS = 5

TRAINING THE K-MEANS CLUSTERING MODEL

In [None]:
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=0)
Y = kmeans.fit_predict(customer_data)
print(Y)

This Clustering Analysis gives us a very clear insight about the different segments of the customers in the Mall. There are clearly Five segments of Customers namely Miser, General, Target, Spendthrift, Careful based on their Annual Income and Spending Score which are reportedly the best factors/attributes to determine the segments of a customer in a Mall.

VISUALIZING CLUSTERS

In [None]:
kmeans_2 = KMeans(n_clusters=5, init="k-means++")

clusters = kmeans_2.fit_predict(customer_data)
customer_data['label'] = clusters

centroid_1 = customer_data[customer_data.label == 0]
centroid_2 = customer_data[customer_data.label == 1]
centroid_3 = customer_data[customer_data.label == 2]
centroid_4 = customer_data[customer_data.label == 3]
centroid_5 = customer_data[customer_data.label == 4]


plt.scatter(centroid_1.Spending,centroid_1.Income,color='red')
plt.scatter(centroid_2.Spending,centroid_2.Income,color='blue')
plt.scatter(centroid_3.Spending,centroid_3.Income,color='orange')
plt.scatter(centroid_4.Spending,centroid_4.Income,color='yellow')
plt.scatter(centroid_5.Spending,centroid_5.Income,color='purple')
plt.scatter(kmeans_2.cluster_centers_[:,0],kmeans_2.cluster_centers_[:,1], color = 'black')
plt.show()

Cluster 1 contains customers with high annual income but low spending score.

Cluster 2 contains customers with average annual income and average spending score.

Cluster 3 contains customers with high annual income and high spending score.

Cluster 4 contains customers with low annual income but high spending score.

Cluster 5 contains customers with low annual income and low spending score.

# b. Hierarchial Clustering

In [None]:
from scipy.cluster.hierarchy import linkage,dendrogram
merg = linkage(customer_data,method='ward')

dendrogram(merg, leaf_rotation=90)
plt.xlabel("Data points")
plt.ylabel("Euclidean distance")
plt.show()

# CONCLUSION

As you can see from the dendrogram, when we draw a horizontal line on the farthest line, we find that there are 5 points that intersect this line. So the optimal k value for Hierarchial Clustering is 5.