# **Customer Segmentation Analysis : Clustering (K-Means)**

## Problem Statement

To leverage customer data from diverse shopping complexes for conducting relevant segmentation analysis, and subsequently reinforce our brand's marketing strategy in new and existing markets.

## Stakeholders

Following are the few stakeholders for our analysis: 
1. Customer: The customers for both the brand as well as the shopping complex.
2. Occupants: The shopkeepers and the workers in the shopping complex.
3. Mall Management: Mall management and the marketing team which will take decisions based on the analysis.
4. Brand Top Level Management: Marketing team which will take decision based on the analysis.
5. Government and Local Statuary Bodies
6. Investors of the brand
7. Employees in the company
8. Vendors (including the shopping complex)

## Business Metric

To quantify the objective of the brand, the Key Performance Indicators (KPI) are:
1. Customer Behaviour and Purchasing Data (Spending Score)
2. Annual Income 
3. Ratio of Spending Score with Annual Income (0-1)
4. Customer Retention
5. Gender Ratio

## Data Science Metric

Metrics to evaluate our K-Means clustering model:
1. Comparison of various clustering techniques/algorithms
2. Elbow Method (calculation of optimal number of clusters)
3. Tightness of clusters

#### Import necessary libraries

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering 
from sklearn.cluster import DBSCAN 
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn import metrics
import plotly as py
import plotly.graph_objs as go

## Data Extraction, Transformation and Loading (ETL)

#### Load dataset into the notebook

In [None]:
# importing the dataset
data = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
org_data = data
data.shape

#### Review Dataset

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.describe()

## Exploratory Data Analysis

#### Gender Distribution

In [None]:
labels = ['Female', 'Male']
size = data['Gender'].value_counts()
colors = ['#ffb4c4', '#008bff']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (9, 9)
plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')
plt.title('Gender', fontsize = 15)
plt.axis('off')
plt.legend()
plt.show()

By looking at the above pie chart which explains about the distribution of Gender in the Mall.

The females have a higher market share that of 56% whereas the males have a share of 44%, that's a huge gap specially when the population of males is comparatively higher than females.

#### Annual Income and Age Distribution

In [None]:
plt.rcParams['figure.figsize'] = (14, 6)

plt.subplot(1, 2, 1)
sns.set(style = 'whitegrid')
sns.distplot(data['Annual Income (k$)'])
plt.title('Distribution of Annual Income', fontsize = 15)
plt.xlabel('Range of Annual Income')
plt.ylabel('Count')


plt.subplot(1, 2, 2)
sns.set(style = 'whitegrid')
sns.distplot(data['Age'], color = 'orange')
plt.title('Distribution of Age', fontsize = 15)
plt.xlabel('Range of Age')
plt.ylabel('Count')
plt.show()

Here, in the above plots we can see the distribution pattern of Annual Income and Age.

By looking at these plots, we can infer that there are few people who earn more than 100k US Dollars. Most of the people have an earning of around 50-75k US Dollars. Also, we can say that the least Income is around 20k US Dollars.

The most regular customers for the Mall has age around 30-35 years of age. Whereas the the senior citizens age group is the least frequent visitor in the Mall. Youngsters are lesser in number as compared to the Middle aged people.

#### Age Distribution

In [None]:
plt.rcParams['figure.figsize'] = (16, 8)
sns.countplot(data['Age'], palette = 'winter')
plt.title('Distribution of Age', fontsize = 14)
plt.show()

In the above bar graph the distribution of each Age Group in the shopping complex is clearly depicted.

By looking at the graph above, it is clear that the age group of 27 to 39 are very frequent but there is no clear pattern. It can be deduced that older age groups are lesser frequent in comparison to the younger ages. Interestingly, there are equal number of visitors in the complex for the age 18 and 60. People aged about 55, 56, 69, 64 are very less frequent in the shopping complex. People aged about 32 are the most frequent visitors in the shopping complex.

#### Annual Income Distribution

In [None]:
temp = data['Annual Income (k$)'].astype('int')
plt.rcParams['figure.figsize'] = (21, 8)
sns.countplot(temp, palette = 'summer')
plt.title('Distribution of Annual Income', fontsize = 15)
plt.show()

This graph better explains the distribution of each income level. Customers whose annual income is ranging from 15k US dollars to 137k US dollars have a comparable visiting frequency to the shopping complex. Customers having annual income of 54k and 60k have maximum visits to the shopping complex as per the data provided.

#### Spending Score Distribution

In [None]:
plt.rcParams['figure.figsize'] = (21, 8)
sns.countplot(data['Spending Score (1-100)'], palette = 'autumn')
plt.title('Distribution Grpah of Spending Score', fontsize = 15)
plt.show()

This is the most important graph from the shopping complex and brands' perspective, as it provides some intuition and idea about the Spending Score of the customers visiting the complex.

On a general level, we may conclude that most of the customers have their Spending Score in the range of 35-60. Interestingly, there are customers having the Spending Score of 1 and 99 respectively, which shows that the mall caters to the variety of customers with varying needs and requirements available in the shopping complex.

#### Boxplots and Swarmplots

In [None]:
plt.figure(1 , figsize = (15 , 7))
n = 0 
for cols in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    n += 1 
    plt.subplot(1 , 3 , n)
    plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
    sns.violinplot(x = cols , y = 'Gender' , data = data , palette = 'vlag')
    sns.swarmplot(x = cols , y = 'Gender' , data = data)
    plt.ylabel('Gender' if n == 1 else '')
    plt.title('Boxplots & Swarmplots' if n == 2 else '')
plt.show()

##### Male Population

Male population is spread in such a way that as the age increases the datapoints decrease. Also, data provided has many datapoints of annual income ranging from 50-100k US dollars and then drastically decreases after 100k. Lastly, the spending score is consistently distributed among the population.

##### Female Population

Female population is spread in such a way that as the age increases the datapoints decrease and maximum datapoints are ranging from age 20-40. Also, data provided has many datapoints of annual income ranging from 50-100k US dollars and then drastically decreases after 100k. Lastly, the spending score is consistently distributed among the population with a slight peak at 50 points.

## Data Analysis 

#### Pair Plots and Correlation

In [None]:
sns.pairplot(data)
plt.title('Pairplot for the Data', fontsize = 15)
plt.show()

The above graph for showing the correlation between the different attributes of the dataset. This pair plot reflects the most correlated features with straight lines and non-correlated features as scattered through.

We can clearly see that these attributes do not have good correlation among them, that's why we will proceed with all of the features.

#### Scatter Plots

In [None]:
plt.figure(1 , figsize = (15 , 6))
for gender in ['Male' , 'Female']:
    plt.scatter(x = 'Age' , y = 'Annual Income (k$)' , data = data[data['Gender'] == gender] ,
                s = 200 , alpha = 0.5 , label = gender)
plt.xlabel('Age'), plt.ylabel('Annual Income (k$)') 
plt.title('Age vs Annual Income w.r.t Gender')
plt.legend()
plt.show()

From the above graph, it is clear that people between the age of 30 and 40 are earning more than anyone else. Also, according to the data, as the age is increasing there is a saturation in the annual income. 

In [None]:
plt.figure(1 , figsize = (15 , 6))
for gender in ['Male' , 'Female']:
    plt.scatter(x = 'Annual Income (k$)',y = 'Spending Score (1-100)' ,
                data = data[data['Gender'] == gender] ,s = 200 , alpha = 0.5 , label = gender)
plt.xlabel('Annual Income (k$)'), plt.ylabel('Spending Score (1-100)') 
plt.title('Annual Income vs Spending Score w.r.t Gender')
plt.legend()
plt.show()

This graph clearly shows that there exists a cluster of people with annual income between 40-70k US dollars have spending score between 40-60.

## Clustering using K-Means

#### The Elbow Method

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.
1. Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
2. Inertia: It is the sum of squared distances of samples to their closest cluster center.

We iterate the values of k from 1 to 9 and calculate the values of distortions for each value of k and calculate the distortion and inertia for each value of k in the given range.

#### Segmentation using Age and Spending Score

In [None]:
X1 = data[['Age' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
    algorithm.fit(X1)
    inertia.append(algorithm.inertia_)

Selecting N Clusters based in Inertia (Squared Distance between Centroids and data points, should be less)

In [None]:
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

In [None]:
algorithm = (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
algorithm.fit(X1)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_

In [None]:
h = 0.02
x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1
y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()]) 

In [None]:
plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age' ,y = 'Spending Score (1-100)' , data = data , c = labels1 , 
            s = 200 )
plt.scatter(x = centroids1[: , 0] , y =  centroids1[: , 1] , s = 300 , c = 'blue' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')
plt.show()

According to our intuition by looking at the above clustering plot between the age of the customers and their corresponding spending scores, we have aggregated them into 4 different categories namely variety seeking customers (blue), loyalty customers (grey), senior citizen target customers (yellow), young target customers (green). Then after getting the results we can accordingly make different marketing strategies and policies to optimize the spending scores of the customer in the shopping complex.

#### Segmentation using Annual Income and Spending Score

In [None]:
X2 = data[['Annual Income (k$)' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
    algorithm.fit(X2)
    inertia.append(algorithm.inertia_)

In [None]:
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

In [None]:
algorithm = (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
algorithm.fit(X2)
labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_

In [None]:
h = 0.02
x_min, x_max = X2[:, 0].min() - 1, X2[:, 0].max() + 1
y_min, y_max = X2[:, 1].min() - 1, X2[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z2 = algorithm.predict(np.c_[xx.ravel(), yy.ravel()]) 

In [None]:
plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z2 = Z2.reshape(xx.shape)
plt.imshow(Z2 , interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)' , data = data , c = labels2 , s = 200 )
plt.scatter(x = centroids2[: , 0] , y =  centroids2[: , 1] , s = 300 , c = 'blue' , alpha = 1)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Annual Income (k$)')
plt.show()

We can see that the mall customers can be broadly grouped into 5 groups based on their purchases made in the mall.

In cluster 4(yellow colored) we can see people have low annual income and low spending scores, this is quite reasonable as people having low salaries prefer to buy less, in fact, these are the wise people who know how to spend and save money. The shops/mall will be least interested in people belonging to this cluster.

In cluster 2(green colored) we can see that people have low income but higher spending scores, these are those people who for some reason love to buy products more often even though they have a low income. Maybe it’s because these people are more than satisfied with the mall services. The shops/malls might not target these people that effectively but still will not lose them.
In cluster 5(pink colored) we see that people have average income and an average spending score, these people again will not be the prime targets of the shops or mall, but again they will be considered and other data analysis techniques may be used to increase their spending score.

In cluster 1(grey colored) we see that people have high income and high spending scores, this is the ideal case for the mall or shops as these people are the prime sources of profit. These people might be the regular customers of the mall and are convinced by the mall’s facilities.

In cluster 3(orange colored) we see that people have high income but low spending scores, this is interesting. Maybe these are the people who are unsatisfied or unhappy by the mall’s services. These can be the prime targets of the mall, as they have the potential to spend money. So, the mall authorities will try to add new facilities so that they can attract these people and can meet their needs.

#### Segmentation using Age , Annual Income and Spending Score

In [None]:
X3 = data[['Age' , 'Annual Income (k$)' ,'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
    algorithm.fit(X3)
    inertia.append(algorithm.inertia_)

In [None]:
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

In [None]:
algorithm = (KMeans(n_clusters = 6 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
algorithm.fit(X3)
labels3 = algorithm.labels_
centroids3 = algorithm.cluster_centers_

In [None]:
data['label3'] =  labels3
trace1 = go.Scatter3d(
    x= data['Age'],
    y= data['Spending Score (1-100)'],
    z= data['Annual Income (k$)'],
    mode='markers',
     marker=dict(
        color = data['label3'], 
        size= 20,
        line=dict(
            color= data['label3'],
            width= 12
        ),
        opacity=0.8
     )
)
data = [trace1]
layout = go.Layout(
#     margin=dict(
#         l=0,
#         r=0,
#         b=0,
#         t=0
#     )
    title= 'Clusters',
    scene = dict(
            xaxis = dict(title  = 'Age'),
            yaxis = dict(title  = 'Spending Score'),
            zaxis = dict(title  = 'Annual Income')
        )
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

## Accuracy

As we know that K-Means is an unsupervised learning tecnique, we don’t have any reasonable way to determine how valid the cluster predictions are.
Instead, to determine the accuracy of the clusters we compare the clusters acquired from different techniques like Agglomerative, DBSCAN and Mean Shift.

In [None]:
fig = plt.figure(figsize=(20,15))

##### KMeans #####
ax = fig.add_subplot(221)
X = org_data.drop(['CustomerID', 'Gender'], axis=1)

km5 = KMeans(n_clusters=5).fit(X)
X['Labels'] = km5.labels_
sns.scatterplot(X['Annual Income (k$)'], X['Spending Score (1-100)'], hue=X['Labels'], style=X['Labels'],
                palette=sns.color_palette('hls', 5), s=60, ax=ax)
ax.set_title('KMeans with 5 Clusters')


##### Agglomerative Clustering #####
ax = fig.add_subplot(222)

agglom = AgglomerativeClustering(n_clusters=5, linkage='average').fit(X)
X['Labels'] = agglom.labels_
sns.scatterplot(X['Annual Income (k$)'], X['Spending Score (1-100)'], hue=X['Labels'], style=X['Labels'],
                palette=sns.color_palette('hls', 5), s=60, ax=ax)
ax.set_title('Agglomerative with 5 Clusters')


##### DBSCAN #####
ax = fig.add_subplot(223)

db = DBSCAN(eps=11, min_samples=6).fit(X)
X['Labels'] = db.labels_
sns.scatterplot(X['Annual Income (k$)'], X['Spending Score (1-100)'], hue=X['Labels'], style=X['Labels'],
                palette=sns.color_palette('hls', 5), s=60, ax=ax)
ax.set_title('DBSCAN with epsilon 11, min samples 6')


##### MEAN SHIFT #####
ax = fig.add_subplot(224)

bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth).fit(X)
X['Labels'] = ms.labels_
sns.scatterplot(X['Annual Income (k$)'], X['Spending Score (1-100)'], hue=X['Labels'], style=X['Labels'],
                palette=sns.color_palette('hls', 5), s=60, ax=ax)
ax.set_title('MeanShift')

plt.tight_layout()
plt.show()

As we can see from the above 4 graphs, that all the techniques are giving about the same number of clusters and also giving similar structure of the cluster. Hence, we can conclude that our clustering was efficient.

## Conclusion

Based on the aforementioned results of the analyses, we have formulated the following stratagems:

##### From the first cluster:

1.  Customers in the grey cluster are brand loyals. They needn't be directly targeted through active campaigns. They can be retained through brand loyalty programs and benefits.
2. Customers in the blue cluster are moody buyers. They shouldn't be directly targeted and their purchasing tendency should be isolated to them only.
3. Customers in the green and yellow clusters, though with a decent spending score, have a scope for being targeted with campaigns. The customers in the yellow cluster are older and need a campaign inclined more towards brick and mortar marketing. The customers in the green cluster can be targeted more with digital campaigns. The transitional area between two age groups can be entertained with a proper amalgamation of both types of marketing techniques.

##### From the second cluster:

1. Customers in the grey cluster are star customers. They have a high annual income along with a high spending score. The brand loyalty programs would be helpful in their retention.
2. Customers in the light green cluster shouldn't be targeted with any campaigns since they lack excessive  purchasing power and must be loyal to some other brand, or might be utilitarian.
3. Customers in the dark green cluster are loyals, but are to be retained through both benefits and brand loyalty programs along with active targeting campaigns (featuring cost benefit of products), since they can be easily swayed by competitors who offer better perceived value.
4. Customers in the orange cluster hold the paying capacity and can be swayed away toward our brand through new customer benefits or brand loyalty programs to portray perceived value. They must also be targeted through campaigns for ensuring product trial and use.
5. Customers in the blue cluster formulate the primary target. They must be actively targeted through campaigns (featuring cost benefit) for choosing our brand over others.

Thus, a primary target has been acquired within the age group of 16-40 and having mediocre household income along with a mediocre spending score. Moreover, secondary targets and non-target areas have also been acquired. The information gained/known prior to the analyses viz. Gender ratio of customers, etcetera can also be used to optimise the campaigns of the brand depending upon the budget and the scale.