# Introduction to Clustering Algorithms and its Use Cases

##### This analysis is a part of this medium blog: https://medium.com/delvify/introduction-to-clustering-algorithms-and-its-use-cases-35c1655c91e7



## 1. Introduction
Has it ever happened to you that when you are in the middle of a conversation, and you suddenly forget what an object is called? If you speak multiple languages and constantly shift between them, then you probably have dealt with this situation. At this point, you would try to recollect and well, lets admit it most of the time we would fail only to try and describe that object by associating it with something similar.

We tried to identify similar objects or associations between this object and other perceivably similar objects to group them together together. This concept is typically what a clustering algorithm does.

Cluster analysis or clustering is the task of group a set of objects in such a way that objects in the same group called a cluster are more similar to each other than those in other clusters. - [1]

### But, what if I don't know what that object is to begin with?
When you don't have a label to work with, you are dealing with an unsupervised problem. If you have some labels, then it would be a semi-supervised problem. Clusters cannot be accurately defined as opposed to a definitive label which is why there are many algorithms that significantly differ on how these clusters are defined.

Some ways how a cluster is defined are as follows: 
1. Distance: How far is an instance away from each other?
2. Centroids: How far is an instance away from the center of each cluster?
3. Density: How dense is each cluster?
4. Distribution: How does the shape impact the cluster? 

When the margin is definitive in the sense that either the instance belongs to the cluster or it does not, then this type of clustering is termed as Hard Clustering. Whereas, when an instance can belong to a particular cluster to a certain degree it is termed as Soft Clustering. The latter provides a score for each instance per cluster. This score could be based on a distance measure or a similarity measure such as RBF.

## 2. Exploratory Data Analysis
Without further ado, lets analyze a dataset to dive into a better understanding on clustering analysis. Mall customer segmentation data! - [2]

In [None]:
import pandas as pd 
import numpy as np

data = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe()

The spending score is based on parameters such as customer engagement and purchasing behavior. It ranges from 1–100. This dataset has 200 such customers and there exists no missing values for any fields. Before we begin with any clustering analysis its always a good idea to understand the shape and distribution of the features you are dealing with. Let's take a look at the pairplot of this dataset.

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt 

data.columns = ['CustomerID', 'Gender', 'Age', 'AnnualIncome', 'SpendingScore']
plot = sns.pairplot(data, corner = True)
plot.fig.suptitle("Pairplot of the data", y = 1, fontsize = 20) 
plt.show()

What can we infer from these plots? There seems to be a definitive interesting pattern or groupings between spending score and annual income, age and annual income, as expected. Spending score seems to follow roughly a normal distribution whereas annual income tends to have a skewed nature. Does the scale of these differ by a large amount? Not really, although feature scaling can definitely benefit any model. We need to ensure there isn't any significant outlier affect on the same when we consider feature scaling methods.

But what about gender? Gender is a categorical variable and what if we could try and visualize the same plot but specific to the gender distributions?

In [None]:
plot = sns.pairplot(data, hue = 'Gender', corner = True)
plot.fig.suptitle("Pairplot of the data by Gender", y = 1.05, fontsize = 20) 
plt.show()

Now the patterns are visibly changing more and more. However, does gender really seem to have an impact of significance with this dataset? Lets take a look at the histogram of age by gender to identify if there is indeed any significant relationship for this set.

In [None]:
sns.histplot(data = data, x = 'Age', hue = 'Gender', bins = 15, kde = True, multiple = 'dodge').set_title('Histogram analysis for Age by Gender')
plt.show()

As you can see, age seems to have a significant relationship with the dataset. But how do we visualize this? Age is a numerical field in the dataset and to visualize the same we would need to take a page from stratified sampling to form multiple stratas whilst not affecting the distribution too much. 
Based on this distribution on a cumulative manner, we can create stratas to have a definitive understanding on how many of these customers fall into these various stratas. As seen from the image below, I used, 15–30, 30–45, 45–60 and Above 60 as the stratas based on the distribution shown. The two plots follow roughly the same proportion within each bins

In [None]:
data['AgeRange'] = pd.cut(data["Age"],bins=[0, 15, 30, 45, 60, np.inf],labels=['<15', '15-30', '30-45', '45-60', '>60'])

fig, axes = plt.subplots(1, 2, figsize = (13, 4))
fig.suptitle('Strata Comparison for Age and Age Range')

sns.histplot(ax=axes[0], x='Age', data = data, hue = 'Gender', multiple = 'dodge')
axes[0].set_title('Age')
sns.histplot(ax=axes[1], x='AgeRange', data = data, hue = 'Gender', multiple = 'dodge')
axes[1].set_title('AgeRange')

plt.show()

Now that we have created the stratas, let's visualize the impact of age range on this dataset using the pairplot.

In [None]:
plot = sns.pairplot(data, hue = 'AgeRange', corner = True)
plot.fig.suptitle("Pairplot of the data by Age Range", y = 1.05, fontsize = 20) 
plt.show()

This plot reveals the possibility of cluster analysis and the impact already! If we deep dive into the spending score vs annual income plot the cluster formations although vague at this point is visibly significant. Now its time for us to model our clusters!

In [None]:
fig, axes = plt.subplots(1, 2, figsize = (13, 4))
fig.suptitle('Spending Scores by Age Range and Gender')

sns.barplot(ax=axes[0], data = data, x='AgeRange', order = ['15-30', '30-45','45-60', '>60'] , y = 'SpendingScore', hue = 'Gender')
axes[0].set_title('Age Range vs Spending Score by Gender')
sns.barplot(ax=axes[1], data = data, x='Gender', y = 'SpendingScore', hue = 'AgeRange')
axes[1].set_title('Gender vs Spending Score by Age Range')
            
plt.show()

## 3. k-Means
Before heading down to the analysis, let's try to capture the depth of this beautiful and simple algorithm. The idea of this algorithm is as follows: 
1. You start with a centroid randomly and provide the number of clusters to be initialized. This value is called 'k', hence the name k-Means. 
2. Using the distance measure to this centroid, you group an incoming instance.
3. Based on step 2, you update the centroid again of each cluster based on their distance.
4. Iteratively perform 2 and 3 until the centroids stop changing.

Using the Pycaret package on python, lets run the k-Means model and see what the results look like. When we setup the data we can provide an optional parameter which asks whether we would like a profiling to be performed or not. The profiling results are a part of the Kaggle notebook, if you would like to further know more interesting information from the data. The profiling too confirms the non significance of gender within this dataset based on correlation matrices. The model by default is assigned k = 4. Let's visualize how this model looks like on our dataset:

In [None]:
pip install pycaret

In [None]:
from pycaret.clustering import *
data.columns = data.columns.str.strip()
classifier = setup(data, silent = True, preprocess = True, profile = True)

In [None]:
kmeans = create_model('kmeans')
plot_model(kmeans, plot = 'cluster')

From this visualization, cluster 2 seems to be well segregated from the other clusters. However there seems to be a spread of cluster 1. The marginal borders between cluster 1, cluster 3 and cluster 4, respectively seems to be quite thin. Also do you notice an instance to be what seems like an anomaly from the 3D plot? As you can see, there is a heavy dependency on the value of k for this algorithm which requires you to run the same couple of times in order to arrive at a solution that you are comfortable with.

### How do I know what model to be comfortable with if I don't have labels to begin with?

As we all know, with any analysis, we need to define the performance measure, or ways to evaluate it in order for us to know how this model would perform for our use cases. We definitely want to limit any surprises after we deploy the model! This algorithm provides you with a guaranteed solution. But is it the most optimal solution? Let's find out by understanding how to evaluate a cluster model.

In [None]:
evaluate_model(kmeans)

In [None]:
plot_model(kmeans, plot = 'tsne')

In [None]:
plot_model(kmeans, plot = 'elbow')

In [None]:
plot_model(kmeans, plot = 'silhouette')

### Inertia 
It is the mean squared distance between each instance and its closest centroid. The lower this value, the the more denser the cluster is. But is this always a good measure? As the number of clusters increases this value will always decrease. Hence inertia is heavily dependent on the value of k as well. How do we decide on a value for k?

### Elbow plot
Plotting inertia vs number of clusters will reveal a much more informative plot. We can see that there is an elbow at k = 5. After k = 5, the distortion score approaches saturation and this suggests that an optimal k could be k = 5. But this approach is sort of brute force and of course computationally expensive as you have to run the model multiple times with different k.

### Silhouette score
A better measure is the mean of silhouette coefficient across all the instances. This coefficient measures the mean distance between intra cluster distance, i.e. the mean distance of the instance with other instances within the same cluster and the mean distance of the nearest cluster over the maximum value between these two measures. This measure varies from -1 to 1 and can be depicted as follows: 
1. Value ~ 1 → The instance is well packed within its defined cluster 
2. Value ~ 0 → The instance is closer to the boundary of its defined cluster 
3. Value ~ -1 → The instance most probably does not belong in its defined cluster 

When we use this information and plot every instance's silhouette coefficient we can obtain a much more informative evaluation of the cluster, and we get a silhouette diagram. Let's take a look at the significance of various elements within this plot. 

* Height: Each cluster label has a knife looking formation. The height of this depicts the density or the number of instances within that cluster. 
* Width: The width of these formations are the respective silhouette coefficients. The wider it is, the better. 
* Dashed line: Furthermore, the dashed line is the average silhouette coefficient. If majority of the instances within a cluster belong to a lower coefficient than this line, we can confidently say that the cluster formations have room for improvement as its mean would be closer to other clusters. Hence majority of the instances should be existent beyond this line and closer to 1.

In this manner Silhouette Score and it's diagram act as a good source of choice for  optimal k.

Based on the above information, our model has an average silhouette score of 0.4 which shows room for improvement which is also suggested by the elbow plot with k = 5. Most of the instances do not lie beyond the dashed line. Cluster 1 seems to be performing the worst as some instances tends towards -1 and indicates that they probably belong to their neighboring clusters. Cluster 1 has heavy room for improvement. 
As suggested by these plots let's re-run the algorithm with k = 5. Let's take a look at their 3D and 2D plots first.

In [None]:
kmeans_optimized = create_model('kmeans', num_clusters = 5)
plot_model(kmeans_optimized, plot = 'cluster')

In [None]:
plot_model(kmeans_optimized, plot = 'tsne')

In [None]:
plot_model(kmeans_optimized, plot = 'silhouette')

Visually the cluster grouping seems to be performing better with k = 5. However there seems to be some marginal instances between cluster 1 and cluster 3. What about the silhouette diagram?

Even though the elbow plot suggested to approach with k =  5, model seems to be performing worse than before as majority  instances do not lie beyond the average silhouette score. There is a tendency in cluster 3 towards -1 which suggests that there are some instances which may belong to the neighboring groups. There is definitely room for improvement. As you probably guessed, the influence of the k value, the number of instances within each cluster and the shape of the cluster all has significance on the choice of k-Means to fit with the data.

## 3. Beyond k-Means

What are the limitations of k-Means? 
* Dependency on k and on initial value assignment: There is a heavy dependency of the algorithm on k value which means we need to run the algorithm multiple times in order to evaluate and choose a model that's best fitting. Additionally, there is a dependency of the model on the initial assignment of these centroids, which is randomized. 
* Numerical influence: k-Means can only handle numerical data. Although the purpose of this analysis was to introduce clustering analysis and means of evaluation of the same, as with any algorithm with appropriate data preprocessing, categorical variable handling and feature engineering can improve the performance of the model. 
* Shape of the cluster: k-Means is heavily shape dependent. The algorithm has an assumption that we are mainly dealing with spherical formations. When the shape of the data and its distribution do not align with this assumption we would need to look for other algorithms more suiting in terms of the shape of the clusters. But k-Means act as a great first step in understanding a possible underlying shape of each clusters.
* Number of members: The number of members within each cluster is an important parameter for this algorithm. The algorithm assumes roughly that each cluster would have the same number of instances which may be a very idealistic thought process for various use cases in the real world. 
* Dimensions: We worked with very limited features. There may be numerous other factors impacting the spending score. When dealing with higher dimensions especially with a distance based similarity measure, it always saturates and converges to a constant value which will heavily bias your model. 
* Outliers: As we saw in our analysis above, outliers can have a significant impact on the centroid as it drags the shape of the cluster. Sometimes, outliers may even get their own cluster!
* Convergence to local minima: k-Means will give a guaranteed convergence. But this may be a local minima and not global minima. Hence not always will the solution be an optimal solution.

## 4. How can I use clustering analysis in my business use cases?
1. Exploratory data analysis - What to do with this data? We have asked this question so many times whenever dealing with a new dataset. Clustering forms an integral part to analyzing data especially in the exploratory data analysis stage. It can help you understand the nuances between your features to further give you a better understanding on how to take it from there. After all, any good analysis should be able to increase the number of questions to ask!
2. Customer segmentation - As shown in the example above, understanding a buyer persona and grouping it can help attain various growth objectives, be it, marketing, retargeting audience, demand planning and even customer retention. When a customer lands on your page for instance, based on their interaction we would be able to map them back to a cluster and take strategic decision on how to guide the customer towards your objective KPI. Whether that is improving recommendations in this manner for an uplift in your conversion or whether it is to engage them better and achieve a higher retention rate, clustering analysis to the rescue!
3. Visual search - Has it ever happened to you that you are taking a stroll down the park and noticed some amazing shoes? You suspect that its from your favorite brand, so you snap a photo and try to search online where you would get all similar looking shoes if not for the same one, for you to choose from. Clustering algorithms can group images into various clusters based on their similarity and when shown with a new image, it would simply try to identify which group does this image resemble the most to?
4. Fraud detection - An anomaly is anything that deviates from a what we define as a standard or anything that is not expected. When an instance is shown to a cluster model wherein it performs strange in the sense that it belongs to none of the clusters, then that instance could be something to watch out for or simply might need your attention on! Flagging these instances could potentially save you from a danger. But make sure the instance is an anomaly and not a novelty instance (curious about this? More on that later!

## 5. Resources
[1] Cluster analysis, Wikipedia 
[2] Mall customer segmentation data, Vijay Choudhary, Kaggle 
[3] Clustering analysis, Annette Catherine Paul, Kaggle 
[4] Unsupervised learning techniques, Aurelien Geron, Hands-On Machine Learning with Scikit Learn, Keras & Tensorflow, O'Reilly 
[5] k-Means advantages and disadvantages - Clustering in Machine Learning, Google developer courses