# Hierarchical clustering

## Hierarquical versus Partitional

In [17]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='img/partitional_clustering.png'></td><td><img src='img/hierarquical_clustering.png'></td></tr></table>"))

## Agglomerative x Divisive

### Agglomerative
Agglomerative (“bottom up”): each observation starts in its own cluster, and **pairs of clusters are merged** as one
moves up the hierarchy. 

`compute the proximity matrix, if necessary.
repeat:
    merge the closest two clusters.
    update the proximity matrix to reflect the proximity between the new cluster and the original clusters.
until only one cluster remains.`

In [18]:
display(HTML("<table><tr><td><img src='img/agglomerative_1.png'></td><td><img src='img/agglomerative_2.png'></td></tr><tr><td><img src='img/agglomerative_3.png'></td><td><img src='img/agglomerative_4.png'></td></tr><tr></td><td><img src='img/agglomerative_5.png'></td></tr></table>"))

### Divisive
Divisive (“top down”): all observations start in one cluster, and **splits are performed** recursively as one moves down
the hierarchy

## Defining Proximity between Clusters

1. **Single link or MIN**: defines cluster proximity as the **proximity** between the closest two points that are in different clusters.

2. **Complete link or MAX**: takes the proximity between the **farthest** two points in different clusters to be the cluster proximity.

3. **Average**: defines cluster proximity to be the **average pairwise** proximities of all pairs of points from different clusters

4. **Centroids**: the cluster proximity is commonly defined as the proximity between cluster centroids

[Extra] **Ward’s**: measures the proximity between two clusters in terms of the increase in the SSE that results from merging the two cluster.

In [19]:
display(HTML("<table><tr><td> 1 <img src='img/distance_min.png'></td><td> 2 <img src='img/distance_max.png'></td></tr><tr><td> 3 <img src='img/distance_avg.png'></td><td> 4 <img src='img/distance_centroids.png'></td></tr></table>"))

0,1
1,2
3,4


# DBScan
Density-Based Spatial Clustering of Applications with Noise

Given a set of points in some space, it **groups together points that are closely packed together** (points with many
nearby neighbors), marking as *outliers points that lie alone in low-density regions*.

- Core points: A point is a core point if there are **at least MinPts within a distance of Eps**, where MinPts and Eps are user-specified parameters. 

- Border points: A border point is not a core point, but **falls within the neighborhood of a core point.**

- Noise points: A noise point is any point that is neither a core point nor a border point. 

![dbscan](img/dbscan.png)

## DBScan Algorithm
1.  Start with an **arbitrary** point which has not been visited and its neighborhood information is retrieved from the Eps parameter.

2. If this point contains MinPts within Eps neighborhood, cluster formation starts. Otherwise the point* is labeled as noise.

*This point can be later found within the Eps neighborhood of a different point and, thus can be made a part of the cluster.

3. If a point is found to be a core point then the points within the Eps neighborhood is also part of the cluster. So all the points found within Eps neighborhood are added, along with their own Eps neighborhood, if they are also core points.

4. The process restarts with a new point which can be a part of a new cluster or labeled as noise.

Visualizing DBScan: https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

# Clustering Evaluation
Evaluating the performance of a clustering algorithm is **not as trivial** as counting the number of errors or the precision
and recall of a supervised classification algorithm.

- Adjusted Rand index
- Mutual Information based scores
- Homogeneity, completeness and V-measure
- Silhouette Coefficient

## Silhouette Coefficient
 The silhouette value is a measure of how similar a sample is to its own cluster (**cohesion**) compared to other clusters
(**separation**).

The silhouette ranges from −1 to +1.
-  **High value** = the clustering configuration is **appropriate**.
-  **Low value** = the clustering configuration may have **too many or too** few clusters.

The Silhouette Coefficient is defined **for each sample** and is composed of two scores:
- **a**: The mean distance between a sample and all other points **in the same cluster**.
- **b**: The mean distance between a sample and all other points in the **next nearest cluster**

The Silhouette Coefficient s for a single sample is given as: $s = \frac{b - a}{max(a,b)}$

The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering (a ≪ b). Scores around
zero indicate overlapping clusters.