# Clustering Algorithms

There are many types of clustering algorithms.

Many algorithms use similarity or distance measures between examples in the feature space in an effort to discover dense regions of observations. As such, it is often good practice to scale data prior to using clustering algorithms.

Central to all of the goals of cluster analysis is the notion of the degree of similarity (or dissimilarity) between the individual objects being clustered. A clustering method attempts to group the objects based on the definition of similarity supplied to it.


The scikit-learn library provides a suite of different clustering algorithms to choose from.

A list of 10 of the more popular algorithms is as follows:

Affinity Propagation
Agglomerative Clustering
BIRCH
DBSCAN
K-Means
Mini-Batch K-Means
Mean Shift
OPTICS
Spectral Clustering
Mixture of Gaussians
Each algorithm offers a different approach to the challenge of discovering natural groups in data.

There is no best clustering algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

Source: https://machinelearningmastery.com/clustering-algorithms-with-python/

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from numpy import unique
from numpy import where
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv("/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")

df.head()

In [None]:
df.drop(["CustomerID"], axis = 1, inplace = True)

In [None]:
df.info()

**Encode the Categorical Feature ie. Gender**

In [None]:
encoder = LabelEncoder()
Gender_ec = encoder.fit_transform(df.iloc[:,0])

In [None]:
df["Gender"] = Gender_ec

In [None]:
df.head()

**Scaling**

When you’re working with a learning model, it is important to scale the features to a range which is centered around zero. This is done so that the variance of the features are in the same range. If a feature’s variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset, which is not something we want happening in our model.

In [None]:
scaler = StandardScaler()
scaled = scaler.fit_transform(df)

In [None]:
df1 = pd.DataFrame(data = scaled, columns = ["Gender", "Age", "Annual Income (k$)", "Spending Score (1-100)"])

In [None]:
df1.head()

**PCA for Dimensionality Reduction**

Given a collection of points in two, three, or higher dimensional space, a "best fitting" line can be defined as one that minimizes the average squared distance from a point to the line. The next best-fitting line can be similarly chosen from directions perpendicular to the first. Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures principal component analysis (PCA).

In [None]:
pca = PCA(n_components = 2)
df2 = pca.fit_transform(df1)

In [None]:
df2.shape

In [None]:
plt.scatter(df2[:, 0], df2[:, 1])

**KMeans Clustering and Marketeer Report**

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of  samples  into  disjoint clusters , each described by the mean 
 of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from , although they live in the same space.

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

 
 
Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:

Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 5)
yhat = model.fit_predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    plt.title("Sklearn version of KMeans cluster")
    plt.style.use('fivethirtyeight')

In [None]:
print(model.labels_)

Model label tells which cluster is assigned to the object

In [None]:
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(model.labels_ == i)[0] for i in range(model.n_clusters)}

# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.items():
    temp = [key, value]
    dictlist.append(temp)

In [None]:
#This list contains indices of objects in the cluster
dictlist[0]

In [None]:
## To get the array of our original encoded dataset 
df3 = df.values

In [None]:
## To get items from the original dataset
accessed_mapping = map(df3.__getitem__, dictlist[0])
cl1 = list(accessed_mapping)
cluster_1 = pd.DataFrame(cl1[1], columns = ["Gender", "Age", "Annual Income (k$)", "Spending Score (1-100)"])

accessed_mapping = map(df3.__getitem__, dictlist[1])
cl2 = list(accessed_mapping)
cluster_2 = pd.DataFrame(cl2[1], columns = ["Gender", "Age", "Annual Income (k$)", "Spending Score (1-100)"])

accessed_mapping = map(df3.__getitem__, dictlist[2])
cl3 = list(accessed_mapping)
cluster_3 = pd.DataFrame(cl3[1], columns = ["Gender", "Age", "Annual Income (k$)", "Spending Score (1-100)"])

accessed_mapping = map(df3.__getitem__, dictlist[3])
cl4 = list(accessed_mapping)
cluster_4 = pd.DataFrame(cl4[1], columns = ["Gender", "Age", "Annual Income (k$)", "Spending Score (1-100)"])

accessed_mapping = map(df3.__getitem__, dictlist[4])
cl5 = list(accessed_mapping)
cluster_5 = pd.DataFrame(cl5[1], columns = ["Gender", "Age", "Annual Income (k$)", "Spending Score (1-100)"])


In [None]:
## The objects in clusters
cluster_1

In [None]:
## Final Report of Cluster 1


print("*" * 75)
print("The Average age of Customers in cluster 1 is:")
print(cluster_1.Age.mean())
print("*" * 75)
print("The Number of Male(1) and female(0) customers in cluster 1 are:")
print(cluster_1["Gender"].value_counts())
print("*" * 75)
print("The Average annual income (in dollars) of Customers in category 1 is:")
print(cluster_1["Annual Income (k$)"].mean())
print("*" * 75)
print("The Mean,Median and Mode of spending Score of people in category 1 is:")
print(cluster_1["Spending Score (1-100)"].mode())
print("*" * 75)

In [None]:
## Final Report of Cluster 2

print("*" * 75)
print("The Average age of Customers in cluster 2 is:")
print(cluster_2.Age.mean())
print("*" * 75)
print("The Number of Male(1) and female(0) customers in cluster 2 are:")
print(cluster_2["Gender"].value_counts())
print("*" * 75)
print("The Average annual income (in dollars) of Customers in category 2 is:")
print(cluster_2["Annual Income (k$)"].mean())
print("*" * 75)
print("The Mean,Median and Mode of spending Score of people in category 2 is:")
print(cluster_2["Spending Score (1-100)"].mode())
print("*" * 75)

In [None]:
## Final Report of Cluster 3

print("*" * 75)
print("The Average age of Customers in cluster 3 is:")
print(cluster_3.Age.mean())
print("*" * 75)
print("The Number of Male(1) and female(0) customers in cluster 3 are:")
print(cluster_3["Gender"].value_counts())
print("*" * 75)
print("The Average annual income (in dollars) of Customers in category 3 is:")
print(cluster_3["Annual Income (k$)"].mean())
print("*" * 75)
print("The Mean,Median and Mode of spending Score of people in category 3 is:")
print(cluster_3["Spending Score (1-100)"].mode())
print("*" * 75)

In [None]:
## Final Report of Cluster 4

print("*" * 75)
print("The Average age of Customers in cluster 4 is:")
print(cluster_4.Age.mean())
print("*" * 75)
print("The Number of Male(1) and female(0) customers in cluster 4 are:")
print(cluster_4["Gender"].value_counts())
print("*" * 75)
print("The Average annual income (in dollars) of Customers in category 4 is:")
print(cluster_4["Annual Income (k$)"].mean())
print("*" * 75)
print("The Mean,Median and Mode of spending Score of people in category 4 is:")
print(cluster_4["Spending Score (1-100)"].mode())
print("*" * 75)

In [None]:
## Final Report of Cluster 5

print("*" * 75)
print("The Average age of Customers in cluster 5 is:")
print(cluster_5.Age.mean())
print("*" * 75)
print("The Number of Male(1) and female(0) customers in cluster 5 are:")
print(cluster_5["Gender"].value_counts())
print("*" * 75)
print("The Average annual income (in dollars) of Customers in category 5 is:")
print(cluster_5["Annual Income (k$)"].mean())
print("*" * 75)
print("The Mean,Median and Mode of spending Score of people in category 5 is:")
print(cluster_5["Spending Score (1-100)"].mode())
print("*" * 75)

# 8 other Clustering Algorithms

**Affinity Propagation**

AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.


Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.

The main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the order 
, where  is the number of samples and  is the number of iterations until convergence. Further, the memory complexity is of the order 
 if a dense similarity matrix is used, but reducible if a sparse similarity matrix is used. This makes Affinity Propagation most appropriate for small to medium sized datasets.

In [None]:
from sklearn.cluster import AffinityPropagation
model = AffinityPropagation(damping=0.9)
model.fit(df2)
yhat = model.predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    #plt.scatter(cluster.cluster_centers_[:, 0], cluster.cluster_centers_[:, 1], marker = '+', label='Clusters', c = "red")
    plt.title("Sklearn version of Affinity Propagation")
    plt.style.use('fivethirtyeight')

This method isn't really good for our dataset because the clusters are not distinct. also if you run labels code for this model the negative value will indicate outliers.

**Agglomerative Hierarchical Clustering**

The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.

In [None]:
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters = 5)
yhat = model.fit_predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    plt.title("Sklearn version of Agglomerative Clustering")
    plt.style.use('fivethirtyeight')
    

**BIRCH**

The Birch builds a tree called the Clustering Feature Tree (CFT) for the given data. The data is essentially lossy compressed to a set of Clustering Feature nodes (CF Nodes). The CF Nodes have a number of subclusters called Clustering Feature subclusters (CF Subclusters) and these CF Subclusters located in the non-terminal CF Nodes can have CF Nodes as children.

The CF Subclusters hold the necessary information for clustering which prevents the need to hold the entire input data in memory. This information includes:

Number of samples in a subcluster.

Linear Sum - A n-dimensional vector holding the sum of all samples

Squared Sum - Sum of the squared L2 norm of all samples.

Centroids - To avoid recalculation linear sum / n_samples.

Squared norm of the centroids.

The Birch algorithm has two parameters, the threshold and the branching factor. The branching factor limits the number of subclusters in a node and the threshold limits the distance between the entering sample and the existing subclusters.

This algorithm can be viewed as an instance or data reduction method, since it reduces the input data to a set of subclusters which are obtained directly from the leaves of the CFT. This reduced data can be further processed by feeding it into a global clusterer. This global clusterer can be set by n_clusters. If n_clusters is set to None, the subclusters from the leaves are directly read off, otherwise a global clustering step labels these subclusters into global clusters (labels) and the samples are mapped to the global label of the nearest subcluster.

In [None]:
from sklearn.cluster import Birch
model = Birch(threshold=0.01, n_clusters=5)
model.fit(df2)
yhat = model.predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
##plt.scatter(cluster.cluster_centers_[:, 0], cluster.cluster_centers_[:, 1], marker = '+', label='Clusters', c = "red")
    plt.title("Sklearn version of BIRCH")
    plt.style.use('fivethirtyeight')

**DBSCAN**

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

In [None]:
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.30)
yhat = model.fit_predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    plt.title("Sklearn version of DBSCAN")
    plt.style.use('fivethirtyeight')

This algorithm is not suitable for our dataset.

**MiniBatch Clustering**

Mini Batch K-means algorithm‘s main idea is to use small random batches of data of a fixed size, so they can be stored in memory. Each iteration a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence. Each mini batch updates the clusters using a convex combination of the values of the prototypes and the data, applying a learning rate that decreases with the number of iterations. This learning rate is the inverse of the number of data assigned to a cluster during the process. As the number of iterations increases, the effect of new data is reduced, so convergence can be detected when no changes in the clusters occur in several consecutive iterations.
The empirical results suggest that it can obtain a substantial saving of computational time at the expense of some loss of cluster quality, but not extensive study of the algorithm has been done to measure how the characteristics of the datasets, such as the number of clusters or its size, affect the partition quality.

In [None]:
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=5)
model.fit(df2)
yhat = model.predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
##plt.scatter(cluster.cluster_centers_[:, 0], cluster.cluster_centers_[:, 1], marker = '+', label='Clusters', c = "red")
    plt.title("Sklearn version of Mini Batch Means")
    plt.style.use('fivethirtyeight')

**OPTICS**

The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization of DBSCAN that relaxes the eps requirement from a single value to a value range. The key difference between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both a reachability_ distance, and a spot within the cluster ordering_ attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership. If OPTICS is run with the default value of inf set for max_eps, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any given eps value using the cluster_optics_dbscan method. Setting max_eps to a lower value will result in shorter run times, and can be thought of as the maximum neighborhood radius from each point to find other potential reachable points.

In [None]:
from sklearn.cluster import OPTICS

model = OPTICS(eps=0.8)
yhat = model.fit_predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    plt.title("Sklearn version of optics clustering")
    plt.style.use('fivethirtyeight')

This is by far worst suiting algorithm on our dataset

**Spectral Clustering**

For two clusters, SpectralClustering solves a convex relaxation of the normalised cuts problem on the similarity graph: cutting the graph in two so that the weight of the edges cut is small compared to the weights of the edges inside each cluster. This criteria is especially interesting when working on images, where graph vertices are pixels, and weights of the edges of the similarity graph are computed using a function of a gradient of the image.



In [None]:
from sklearn.cluster import SpectralClustering

model = SpectralClustering(n_clusters = 5)
yhat = model.fit_predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    plt.title("Sklearn version of Spectral clustering")
    plt.style.use('fivethirtyeight')

**GMM**

Gaussian mixture models (GMMs) are often used for data clustering. You can use GMMs to perform either hard clustering or soft clustering on query data.

To perform hard clustering, the GMM assigns query data points to the multivariate normal components that maximize the component posterior probability, given the data. That is, given a fitted GMM, cluster assigns query data to the component yielding the highest posterior probability. Hard clustering assigns a data point to exactly one cluster. For an example showing how to fit a GMM to data, cluster using the fitted model, and estimate component posterior probabilities, see Cluster Gaussian Mixture Data Using Hard Clustering.

Additionally, you can use a GMM to perform a more flexible clustering on data, referred to as soft (or fuzzy) clustering. Soft clustering methods assign a score to a data point for each cluster. The value of the score indicates the association strength of the data point to the cluster. As opposed to hard clustering methods, soft clustering methods are flexible because they can assign a data point to more than one cluster. When you perform GMM clustering, the score is the posterior probability. For an example of soft clustering with a GMM, see Cluster Gaussian Mixture Data Using Soft Clustering.

GMM clustering can accommodate clusters that have different sizes and correlation structures within them. Therefore, in certain applications,, GMM clustering can be more appropriate than methods such as k-means clustering. Like many clustering methods, GMM clustering requires you to specify the number of clusters before fitting the model. The number of clusters specifies the number of components in the GMM

In [None]:
from sklearn.mixture import GaussianMixture

model = GaussianMixture(n_components = 5)
yhat = model.fit_predict(df2)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    plt.scatter(df2[row_ix, 0], df2[row_ix, 1])
    plt.title("Sklearn version of Gaussian Mixture")
    plt.style.use('fivethirtyeight')

These algorithms can be used as per what fits your data best. You can create the report of cluster as shown after the KMeans Cluster. 