**Credit to: Berkay Alan**

**Unsupervised Learning - Clustering**

**20 December 2022**

## Content


**Unsupervised Learning - Clustering**

***

- What's Unsupervised Learning?
- Clustering
- K-Means Clustering (Theory - Exploratory Data Analysis - Preprocessing - Model- Tuning)
    - Color - Image Quantization 
- Hierarchical Clustering (Theory - Model)
- DBSCAN (Density-based spatial clustering) (Theory - Model)

## Resources

- **The Elements of  Statistical Learning** - Trevor Hastie,  Robert Tibshirani, Jerome Friedman -  Data Mining, Inference, and Prediction (Springer Series in Statistics) 

- **An Introduction to Statistical Learning** - Trevor Hastie,  Robert Tibshirani, Daniela Witten, Gareth James

- [**Machine Learning by Shervine Amidi**](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning)

- [**10 Clustering Algorithms With Python by Machine Learning Mastery**](https://machinelearningmastery.com/clustering-algorithms-with-python/)

- [**K-Means: Getting The Optimal Number Of Clusters**](https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/)

- [**Elbow Method for optimal value of k in K-Means**](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/)

- [**Silhouette Algorithm to determine the optimal value of k**](https://www.geeksforgeeks.org/silhouette-algorithm-to-determine-the-optimal-value-of-k/?ref=rp)

- [**4 Distance Measures for Machine Learning by Machine Learning Mastery**](https://machinelearningmastery.com/distance-measures-for-machine-learning/)

- [**Working with Images in Python using Matplotlib**](https://www.geeksforgeeks.org/working-with-images-in-python-using-matplotlib/)

- [**Hierarchical Clustering by Statquest**](https://www.youtube.com/watch?v=7xHsRkOdVwo&ab_channel=StatQuestwithJoshStarmer)

- [**ML | Hierarchical clustering (Agglomerative and Divisive clustering)**](https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive-clustering/)

- [**DBSCAN Clustering Algorithm in Machine Learning**](https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html)

## Importing Libraries

In [None]:
from warnings import filterwarnings
filterwarnings("ignore")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mping
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split,cross_val_score,cross_val_predict,ShuffleSplit,GridSearchCV
from sklearn.cluster import KMeans,AgglomerativeClustering,DBSCAN
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs,make_moons
from yellowbrick.cluster import KElbowVisualizer
# import time
from matplotlib.colors import ListedColormap
from skompiler import skompile
from joblib import dump, load

import plotly.express as px

In order to see all rows and columns, we will increase max display numbers of dataframe.

In [None]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

## What's Unsupervised Learning?

 Unsupervised learning is a set of statistical tools intended for the setting in which we have only a set of features X1, X2, . . . , Xp measured on n observations. We are not interested in prediction, because **we do not have an associated response variable Y** . Rather, the goal is to discover interesting things about the measurements on X1, X2, . . . , Xp. Is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? Unsupervised learning refers to a diverse set of techniques for answering questions such as these. 
 
 Unsupervised learning is often performed as part of an *exploratory data analysis*. The goal of unsupervised learning is to find hidden patterns in unlabeled data. Furthermore, it can be hard to assess the results obtained from unsupervised learning methods, since there is no universally accepted mechanism for performing cross- validation or validating results on an independent data set. The reason for this difference is simple. If we fit a predictive model using a supervised learning technique, then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, **in unsupervised learning, there is no way to check our work because we don’t know the true answer**—the problem is unsupervised.

 There is 3 particular types of unsupervised learning:
 
- **Clustering :** a broad class of methods for discovering unknown subgroups in data.

- **Principal Components Analysis(PCA) :** a tool used for data visualization or data pre-processing before supervised techniques are applied.

- **Association rules mining :** a technique used in insight generation to discover patterns of transactions.

## Clustering

 **Clustering** refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other. Of course, to make this concrete, we must define what it means for two or more observations to be similar or different. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.

![image.png](attachment:image.png)

<div class="alert alert-block alert-info">
<b>Reference :</b>  Photo is taken by Dataiku Blog.
</div>

***

 An application example of clustering arises in marketing. We may have access to a large number of measurements (e.g. median household income, occupation, distance from nearest urban area, and so forth) for a large number of people. Our goal is to perform market segmentation by identifying subgroups of people who might be more receptive to a particular form of advertising, or more likely to purchase a particular product. The task of performing market segmentation amounts to clustering the people in the data set.

 Since clustering is popular in many fields, there exist a great number of clustering methods. In this tutorial we focus on perhaps the two best-known clustering approaches: **K-means clustering and hierarchical clustering**. In K-means K-means clustering, we seek to partition the observations into a pre-specified clustering number of clusters. On the other hand, in hierarchical clustering, we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a *dendrogram*, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n. There are advantages and disadvantages
to each of these clustering approaches.

## K-Means Clustering (Theory - Model- Tuning)

### Theory

 **K-means clustering** is a simple and elegant approach for partitioning a data set into K distinct, non-overlapping clusters. To perform K-means clustering, we must first specify the desired number of clusters K; then the K-means algorithm will assign each observation to exactly one of the K clusters.

![Screen%20Shot%202021-09-26%20at%2008.34.08.png](attachment:Screen%20Shot%202021-09-26%20at%2008.34.08.png)

<div class="alert alert-block alert-info">
<b>Reference 1:</b>  Image is taken from *Introduction to Statistical Learning* book.
</div>

***

Figure shows the results obtained from performing K-means clustering on a simulated example consisting of 150 observations in two dimensions, using three different values of K.

The K-means clustering procedure results from a simple and intuitive mathematical problem. We begin by defining some notation. Let C1, . . . , CK denote sets containing the indices of the observations in each cluster. These sets satisfy two properties:

1. C1 ∪ C2 ∪ ... ∪ CK = {1,...,n}. In other words, each observation belongs to at least one of the K clusters.
2. Ck ∩ Ck′ = ∅ for all k ̸= k′. In other words, the clusters are non- overlapping: no observation belongs to more than one cluster.


For instance, if the ith observation is in the kth cluster, then i ∈ Ck. The idea behind K-means clustering is that a good clustering is one for which the **within-cluster variation is as small as possible**. The within-cluster variation for cluster Ck is a measure W (Ck ) of the amount by which the observations within a cluster differ from each other. 

![Screen%20Shot%202021-09-26%20at%2008.48.13.png](attachment:Screen%20Shot%202021-09-26%20at%2008.48.13.png)

 This formula says that we want to partition the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible.
 Solving seems like a reasonable idea, but in order to make it actionable we need to define the within-cluster variation. There are many possible ways to define this concept, but by far the most common choice involves squared *Euclidean distance*. That is, we define;
 
 
![Screen%20Shot%202021-09-26%20at%2008.50.50.png](attachment:Screen%20Shot%202021-09-26%20at%2008.50.50.png)

where |Ck| denotes the number of observations in the kth cluster. In other words, the within-cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in the kth cluster, divided by the total number of observations in the kth cluster.

 Now, we would like to find an algorithm to solve this equation. This is actually a very difficult problem to solve precisely, since there are almost Kn ways to partition n observations into K clusters.
 
 **K-Means Clustering Algorithm**
 
***

**1.** Randomly assign a number, from 1 to K, to each of the observations.

    These serve as initial cluster assignments for the observations.


**2.** Iterate until the cluster assignments stop changing:

    (a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.

    (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

 Step 2(b), reallocating the observations can only improve. This means that as the algorithm is run, the clustering obtained will continually improve until the result no longer changes; the objective of will never increase. When the result no longer changes, a **local optimum** has been reached.
 
 K-means clustering derives its name from the fact that in Step 2(a), the cluster centroids are computed as the mean of the observations assigned to each cluster.
 
 Because the K-means algorithm finds a local rather than a global optimum, the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1 of Algorithm. For this reason, it is important to run the algorithm multiple times from different random initial configurations. Then one selects the best solution, that for which the objective is smallest.

 As we have seen, to perform K-means clustering, we must decide how many clusters we expect in the data. The problem of selecting K is far from simple. This issue, along with other practical considerations that arise in performing K-means clustering.

***

**How to find optimal number of clusters (K)?**

There are some methods to find optimal number of clusters (K):

**1.** Elbow Curve Method

    The elbow method runs k-means clustering on the dataset for a range of values of k (say 1 to 10). Perform K-means clustering with all these different values of K. For each of the K values, we calculate average distances to the centroid across all data points. Then we Plot these points and find the point where the average distance from the centroid falls suddenly (“Elbow”).


**2.** Silhouette Analysis

    The silhouette coefficient is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation).



***

**Distance Measures for Unsupervised Learning**

 Distance measures play an important role in unsupervised learning. Different distance measures must be chosen and used depending on the types of the data. As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.

 A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain. Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

 Most commonly used distance measures in unsupervised learning are as follows:

- **Hamming Distance**

    Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.


- **Euclidean Distance**

    Euclidean distance calculates the distance between two real-valued vectors.


- **Manhattan Distance**

    The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.


- **Minkowski Distance**

    Minkowski distance calculates the distance between two real-valued vectors.
    
<div class="alert alert-block alert-success">
<b>Do you want to learn details?:</b> If you want to learn more, please check the post of Machine Learning Mastery.  
</div>

[Take me to the post](https://machinelearningmastery.com/distance-measures-for-machine-learning/)

### Exploratory Data Analysis - Preprocessing

For a real world example, we will use *Bank Marketing Data Set* dataset. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution.

It can be downloaded from [here](https://www.kaggle.com/berkayalan/bank-marketing-data-set).

We will understand the dataset first.

In [None]:
import os
os.getcwd()

In [None]:
#Load bank_marketing_data_csv here
df =

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.describe().T

First, let's try to undertand age distribution of customers.

In [None]:
#Plot a historgram of the customers age distribution. 
px.histogram()

We also want to see if there is between ages and loan status of customers.

In [None]:
# Plot the "Age Distribution of Customers" using nbins=30, color the distribution plot by loan
px.histogram()

Now we will try to undertand marital status distribution of customers.

In [None]:
# Plot histogram plot of marital status of customers.

pdays column means number of days that passed by after the client was last contacted from a previous campaign. 999 means client was not previously contacted. We will look at that.

In [None]:
# Plot histogram plot of pdays of customers
px.histogram(df, x='')

In [None]:
# Count the total number for each unique pdays.
pd.DataFrame()

As we can see, there are many clients that was not previously contacted. Now we will look only contacted clients.

In [None]:
# plot the histogram plot for clients that are Contacted previously (exclude pdays=999)


Now we will look distribution of durations based on contact type.

In [None]:
# plot histogram of duration by contact


Now we will look loan status of customers.

In [None]:
# Plot barchart for personal loan of customers
title = "Personal Loan of customers"

In [None]:
#Plot barchart Housing Loan of Customers
title = "Housing Loan of customers"


Now we will look jobs of customers.

In [None]:
plt.figure(figsize=(12,6),dpi=200)
sns.countplot(data=df,x="job",order=df.job.value_counts().index)
plt.title("Jobs of Customers")
plt.xticks(rotation=90)
plt.show()

Now we will look education of customers.

In [None]:
plt.figure(figsize=(12,6),dpi=200)
sns.countplot(data=df,x="education",order=df.education.value_counts().index)
plt.title("Education of Customers")
plt.xticks(rotation=90)
plt.show()

In [None]:
df.head()

***

 In order to calculate distances for K-Means clustering, all features must be in numeric format. To solve this issue, we will apply **dummy method**.

In [None]:
df.shape

In [None]:
df_dummies = pd.get_dummies(df)

In [None]:
df_dummies.head()

In [None]:
df_dummies.shape

As we can see, it created new column for each category of each feature.

Because we are dealing with distance metric,we will apply **scaling**.
[link to standardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
# init a Standard scaler object
scaler = 

In [None]:
# apply scaling onto our encoded dummy dataframe
df_scaled = scaler.fit_transform(df_dummies)

In [None]:
df_scaled[0]

### Model Kmeans

[Kmeans](https://scikit-learn.org/stable/modules/clustering.html#k-means)

In [None]:
df_scaled[0]

In [None]:
k_means_model = KMeans(n_clusters= , random_state=) #n_clusters is K value. Let's use n_clusters = 2, random_state= 88

In [None]:
k_means_model.fit(df_scaled)

In [None]:
cluster_labels = k_means_model.predict(df_scaled) #we can also do that with fit_predict() method.

In [None]:
cluster_labels

Now we will merge cluster labels with our main data set.

In [None]:
df_dummies["Cluster"] = cluster_labels

In [None]:
df_dummies.head()

Now we can look at correlations of each feature with clusters that we assigned.

In [None]:
plt.figure(figsize=(12,6),dpi=200)
df_dummies.corr()["Cluster"].iloc[:-1].sort_values().plot(kind="bar")
plt.title("Correlation between Clusters and Features")
plt.show()

***

**How to find optimal number of clusters (K)?**

There are some methods to find optimal number of clusters (K):

**1. Elbow Curve Method**

The elbow method runs k-means clustering on the dataset for a range of values of k (say 1 to 10). Perform K-means clustering with all these different values of K. For each of the K values, we calculate average distances to the centroid across all data points. Then we Plot these points and find the point where the average distance from the centroid falls suddenly (“Elbow”). We want high similarity(Max) inside clusters and <text style="color:#6cef66">low similarity interclusters(Min)</text> to find <text style="color:#6ce333">optimized sum of squared distance(SSD)</text>.

In [None]:
ssd = []

## SSD Point to cluster centers



In [None]:
ssd

In [None]:
plt.plot(range(2,10),ssd,"o--")
plt.title("Elbow Range")
plt.show()

In [None]:
pd.Series(ssd).diff()

## Hierarchical Clustering (Theory - Model)

### Theory

 One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters K. **Hierarchical clustering** is an alternative approach which does not require that we commit to a particular choice of K. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a *dendrogram*.

![Screen%20Shot%202021-09-30%20at%2014.58.50.png](attachment:Screen%20Shot%202021-09-30%20at%2014.58.50.png)

<div class="alert alert-block alert-info">
<b>Reference 3:</b>  Image is taken from *Introduction to Statistical Learning* book.
</div>

***

**Types of Hierarchical Clustering**

1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up algorithms treat **each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data**. 

2. Divisive clustering: Also known as top-down approach. This algorithm also does not require to prespecify the number of clusters. Top-down clustering requires a method for **splitting a cluster that contains the whole data and proceeds by splitting clusters recursively until individual data have been splitted into singleton cluster**.

![image.png](attachment:image.png)

<div class="alert alert-block alert-info">
<b>Reference 2:</b>  Image is taken from *Great Learning* website.
</div>

 Each leaf of the dendrogram represents one of the observations in a data set. As we move up the tree, some leaves begin to fuse into branches. These correspond to observations that are *similar to each other*. As we move higher up the tree, branches themselves fuse, either with leaves or other branches. The earlier (lower in the tree) fusions occur, the more similar the groups of observations are to each other. On the other hand, observations that fuse later (near the top of the tree) can be quite different.

 Thus, observations that fuse at the very bottom of the tree are quite similar to each other, whereas observations that fuse close to the top of the tree will tend to be quite different. Y axis shows the distance.

![Screen%20Shot%202021-09-30%20at%2015.01.39.png](attachment:Screen%20Shot%202021-09-30%20at%2015.01.39.png)

<div class="alert alert-block alert-info">
<b>Reference 4:</b>  Image is taken from *Introduction to Statistical Learning* book.
</div>

Left: a dendrogram generated using Euclidean distance and complete linkage. Observations 5 and 7 are quite similar to each other, as are observations 1 and 6. However, observation 9 is no more similar to observation 2 than it is to observations 8, 5, and 7, even though observations 9 and 2 are close together in terms of horizontal distance. This is because observations 2,8,5, and 7 all fuse with observation 9 at the same height, approximately 1.8. 

Right: the raw data used to generate the dendrogram can be used to confirm that indeed, observation 9 is no more similar to observation 2 than it is to observations 8, 5, and 7.

***

![Screen%20Shot%202021-10-11%20at%2020.45.59.png](attachment:Screen%20Shot%202021-10-11%20at%2020.45.59.png)

<div class="alert alert-block alert-info">
<b>Reference 5:</b>  Image is taken from *Machine Learning A-Z™: Hands-On Python & R In Data Science* course.
</div>

**Note:** To put it mathematically, there are *2^(n−1)* possible reorderings of the dendrogram, where n is the number of leaves. This is because at each of the n − 1 points where fusions occur, the positions of the two fused branches could be swapped without affecting the meaning of the dendrogram. Therefore, we cannot draw conclusions about the similarity of two observations based on their proximity along the horizontal axis.

 Now, we would like to find an algorithm to solve this equation. 
 
 **Hierarchical Clustering Algorithm**
 
***

1. Compare data points to find most similar data points to each other.

2. Merge these to create a cluster.

3. Compare clusters to find most similar clusters and merge again.

4. Repeat until all data points in a single cluster.

### Model

For a real world example, we will use *Auto-mpg dataset* dataset. The data is technical spec of cars. The dataset is downloaded from UCI Machine Learning Repository.

It can be downloaded from [here](https://www.kaggle.com/uciml/autompg-dataset).

We will understand the dataset first.

In [None]:
# Load auto-mpg dataset here
df = pd.read_csv("")

In [None]:
df.shape

In [None]:
df.head()

In [None]:
# count the number of unique values for origin


In [None]:
df.describe()

***

 In order to calculate distances for hierarchical cluestering, all features must be in numeric format. *Origin* column is in categorical format. To solve this issue, we will apply **dummy method**.

In [None]:
df_with_dummies = pd.get_dummies(df.drop("car name",axis=1))

In [None]:
df_with_dummies.head()

As we can see, it created new column for each category of each feature.

Because we are dealing with distance metric,we will apply **Min Max scaling**. It transforms everything to scale between 0 and 1.

In [None]:
# initialise MinMax scaler
scaler = 

In [None]:
## fit transform the dummy dataframe
df_scaled = scaler.fit_transform()

In [None]:
df_scaled[:5]

In [None]:
df_scaled  = pd.DataFrame(df_scaled,columns=df_with_dummies.columns)

In [None]:
df_scaled.head()

In [None]:
plt.figure(figsize=(20,9))
sns.heatmap(df_scaled)
plt.show()

It was in the same length of data set. Dummy variables seem black and white.

Now we will do **cluster heatmap** by using *clustermap()* method.

In [None]:
plt.figure(figsize=(20,9))
sns.clustermap(df_scaled,row_cluster=False)
plt.show()

In order to understand the relationships between features, we will make a heatmap of correlations.

In [None]:
plt.figure(figsize=(15,9))
sns.heatmap(df_scaled.corr())
plt.title("Correlation between Features")
plt.show()

Now we will perform the clustering and assign the labels by using **sklearn**.

To understand Agglomerative Clustering, please refer [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) 

In [None]:
# Initialise Agglomerative Clustering Class.
# use n_clusters = 5, distance metric = euclidean, linkage=complete. distance_threshold is also important.
hier_model = 

If you want to investigate more, please check out the documentation of [Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html).

In [None]:
# Fit the scaled data into the model
cluster_labels = hier_model.fit_predict()

In [None]:
cluster_labels[:20]

In [None]:
# Vssualise using a scatter plot of weight against mpg, color the plot by the cluster_labels. What do you observe?



In [None]:
# Visualise using a scatter plot of horsepower against mpg, color the plot by the cluster_labels. What do you observe?


Now we will create the dendogram of the model by getting help from **linkage()** method. It performs hierarchical/agglomerative clustering. If you want learn about more, you can read this [official explanation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) of the library.

In [None]:
linkage_matrix = hierarchy.linkage(hier_model.children_)

In [None]:
linkage_matrix

In [None]:
linkage_df = pd.DataFrame(linkage_matrix,columns=["First Point","Second Point","Distance Between Points","How many points are there in the cluster?"])
linkage_df.head()

In [None]:
plt.figure(figsize=(20,10))
dendrogram = hierarchy.dendrogram(linkage_matrix,truncate_mode="level",p=5)
plt.xticks(rotation=90)
plt.grid(False)
plt.show()

### Extra Model

For a real world example, we will use *Mall Customer Segmentation Data*. This data set is created only for the learning purpose of the customer segmentation concepts , also known as market basket analysis.

It can be downloaded from [here](https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python).

We will understand the dataset first.

In [None]:
# Load Mall Customers dataset
df = 

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.describe()

To make a reasonable analysis, we will work only with **Annual Income(k$)** and **Spending Score(1-100)** columns.

In [None]:
X = df.iloc[:, [3, 4]].values

Now we will create a dendrogram to find the optimal number of clusters.

In [None]:
plt.figure(figsize=(20,9))
dendrogram = hierarchy.dendrogram(hierarchy.linkage(X, method = 'ward'),leaf_font_size=10)
plt.title('Dendrogram of Mall Customers')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.grid(False)
plt.xticks(rotation=90)
plt.show()

Now we will train the Hierarchical Clustering model on the dataset.

In [None]:
# Use the setting: n_clusters = 5, affinity = 'euclidean', linkage = 'ward', random_state=88
hierearchical = AgglomerativeClustering()
# Fit the hierarchical cluster
y_hierearchical = 

In [None]:
y_hierearchical

In [None]:
# Visualise the Clusters from the Ward Linkage Model
plt.figure(figsize=(15,9))
plt.scatter(X[y_hierearchical == 0, 0], X[y_hierearchical == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hierearchical == 1, 0], X[y_hierearchical == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hierearchical == 2, 0], X[y_hierearchical == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hierearchical == 3, 0], X[y_hierearchical == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hierearchical == 4, 0], X[y_hierearchical == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of  Mall Customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.grid(False)
plt.legend()
plt.show()

## DBSCAN (Density-based spatial clustering)  (Theory - Model)

### Theory

Centrally, all clustering methods use the same approach i.e. first we calculate similarities and then we use it to cluster the data points into groups or batches. Now we will focus on the Density-based spatial clustering of applications with noise (DBSCAN) clustering method. 

 What’s nice about DBSCAN is that we don’t have to specify the number of clusters to use it. All we need is a function to calculate the distance between values and some guidance for what amount of distance is considered “close”. DBSCAN also produces more reasonable results than k-means across a variety of different distributions.

![image.png](attachment:image.png)

<div class="alert alert-block alert-info">
<b>Reference 5:</b>  Image is taken from *Towards Data Science* website.
</div>


 Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density.

 **Density-Based Spatial Clustering of Applications with Noise (DBSCAN)** is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.
 

 The DBSCAN algorithm uses two parameters:

- *eps (ε)*: A distance measure that will be used to locate the points in the neighborhood of any point. For each instance, the algorithm counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε-neighborhood.

- *minPts*: The minimum number of points (a threshold) clustered together for a region to be considered dense. If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is considered a **core instance**. In other words, core instances are those that are located in dense regions. All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster.

Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

To check out more, please visit [this](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) online visualisation tool.

### Model

For a real world example, we will use *Wholesale customers* dataset. The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

It can be downloaded from [here](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers).

Attribute Information:

    1) FRESH: annual spending (m.u.) on fresh products (Continuous);
    2) MILK: annual spending (m.u.) on milk products (Continuous);
    3) GROCERY: annual spending (m.u.)on grocery products (Continuous);
    4) FROZEN: annual spending (m.u.)on frozen products (Continuous)
    5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
    6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
    7) CHANNEL: customers  Channel - Horeca (Hotel/Restaurant/CafÃ©) or Retail channel (Nominal)
    8) REGION: customers  Region Lisnon, Oporto or Other (Nominal)

We will understand the dataset first.

In [None]:
# Load Wholesale customers dataset here.
df = 

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.describe()

Now we will create a scatterplot showing the relation between MILK and GROCERY spending, colored by Channel column.

In [None]:
# Plot scatter between Milk and Grocery spending, color by Channel


In [None]:
sns.histplot(df,x='Milk',hue='Channel',multiple="stack")
plt.grid(False)
plt.show()

Now we will create an annotated clustermap of the correlations between spending on different categories.

In [None]:
print('Correlation Between Spending Categories')
sns.clustermap(df.drop(['Region','Channel'],axis=1).corr(),annot=True)
plt.grid(False)
plt.show()

We will create a variety of models testing different epsilon values.

In [None]:
# initialise standard scaler, apply scaling onto the dataframe.
scaler = 
scaled_X = 

Testing different epsilon values using `np.linspace` [refer here](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html). Range between 0.001 to 3, 50 steps.

In [None]:
outlier_percent = []

for eps in : #interval here
    
    # Create Model
    dbscan = DBSCAN(eps=eps,min_samples=2*scaled_X.shape[1])
    dbscan.fit(scaled_X)
   
     
    # Log percentage of points that are outliers
    perc_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
    
    outlier_percent.append(perc_outliers)

In [None]:
# Plot Percentage of Points Classified as Outliers against different epsilon values


Based on the plot above, we will retrain a DBSCAN model with a the most reasonable epsilon value. Refer DBScan [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). 

In [None]:
# DB scan model
dbscan = 
dbscan.fit()

In [None]:
# Get the labels
df['Labels'] = dbscan.labels_

In [None]:
df.head()