## **Hierarchical clustering**

- The process of initiation with each item as a separate cluster, then the pairs of
items/clusters are successively merged until all items belong to one cluster. This is
known as `Hierarchical clustering`

- The graphical representation of this agglomeration is called a `Dendrogram`

#### The various linkages (the methods of computing the distances between clusters) are the following:

1. `complete linkage` - The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters

1. `single linkage` - The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters

1. `average linkage` - The distance between two clusters is defined as the
average distance between the elements in cluster 1 and the elements in
cluster 2.

1. `centroid linkage` - The distance between two clusters is defined as the
distance between the centroid for cluster 1 (a mean vector of length p
variables) and the centroid for cluster 2

1. `Ward’s minimum variance method` - It minimizes the total within-cluster
variance.

## Procedure

##### Hierarchical clustering is a method where the data points are clustered using bottom-up approach.
- At the initial state, each data point is considered as a cluster and then the data points are combined based on the distance(Euclidean distance generally)
- This process of combining continues till all the data points fall into one cluster. The visual representation of this kind of agglomeration is called a "**Dendrogram**" 
- Based on the dendrogram, user can decide on the number of clusters for partitioning/clustering

In [None]:
## Lets consider a toy example
from scipy.cluster.hierarchy import dendrogram, linkage  
from matplotlib import pyplot as plt

In [None]:
import numpy as np

X = np.array([[5,3],  
    [10,15],
    [15,12],
    [24,10],
    [30,30],
    [85,70],
    [71,80],
    [60,78],
    [70,55],
    [80,91],])

In [None]:
## Let's plot x and y
plt.figure(figsize=(10, 7))
plt.scatter(X[:,0],X[:,1])
plt.show()
           

In [None]:
linked = linkage(X, 'single')
labelList = range(1, 11)

In [None]:
labelList

In [None]:
plt.figure(figsize=(10, 7))  
dendrogram(linked,  
            orientation='top',
            labels=labelList,
            distance_sort='descending'
           )
plt.show()  