One way that clustering can be defined as 'good' or 'bad' is WCSS. It is Within Cluster Sum of Squares that signifies the squared distances between the centroid and the other cluster points of a respective cluster.


Ideally, we would want that our clusters are densely packed by the data points and two clusters are far apart and well separated from each other.

The first requirement can be measured by WCSS while the second one cant be. For this, there are two more unit through which we can measure it.

They are: 
1. Dunn Index
2. Silhouette Score

## **Dunn Index**

- It is a metric for evaluating clustering algorithms
- The objective of Dunn index is to identify clusters that are:
 - compact with a small variance between members of the cluster
 - and well separated.

- Dunn Index is denoted by **‘D’** and is given as:

$D = \frac{min_{i,j} distance(i,j)}{max_k distance^{'}(k)}$

where; 
- $distance(i,j)$ → distance between the farthest points of the clusters $C_i$ and $C_j$ → **Inter-Cluster distance**

- ${distance^{'}(k)}$ → distance between the farthest points within the $k^{th}$ clusters **Intra-Cluster distance**

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/008/125/original/Screenshot_2022-07-29_at_3.53.47_PM.png?1659089763' height = '500' width = '800'>


<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/008/126/original/Screenshot_2022-07-29_at_3.56.07_PM.png?1659089899' height = '500' width = '800'>


- If $Dunn-index$ is high, it implies that clusters are well separated and the points in the same cluster are intact. 
- For every pair of points from $C_i$ and $C_j$, we have to compute **$distance(i,j)$** for getting the inter cluster distance.
- Similarly for calculating the **$distance'(k)$** we will have to iterate through each pair of points within $k^{th}$ cluster

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/008/127/original/Screenshot_2022-07-29_at_3.57.44_PM.png?1659089997' height = '500' width = '800'>




## **Silhouette Score**

- The silhouette score of a point **measures how close that point lies to its nearest neighbor points, across all clusters**. 

- It provides information about clustering quality which can be used to determine whether further refinement by clustering should be performed on the current clustering.

- An instance’s silhouette coefficient is equal to $\Large\frac{(b – a)}{max(a, b)}$, where; 

 - $a$ is the the mean intra-cluster distance (i.e., mean distance to the other instances in the same cluster)

 - $b$ is the nearest mean inter-cluster distance (i.e., the mean distance to the instances of the next closest cluster). It is defined such that the instance’s own cluster is excluding. 

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/008/308/original/Screenshot_2022-08-02_at_6.47.39_PM.png?1659445799' height = '500' width = '800'>

>**Instructor Note**:
Ask the learners about the range of Silhouette score.

**Q. What's the range of silhouette score?**

- The range of Silhouette score is [-1, 1].

- Score is $1$ when $b > a$ and $a = 0$

  - implying that the points in the same cluster are very close. 
  - In this case the points within the same cluster are very close whereas the points in the other clusters are at some distance.

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/008/311/original/Screenshot_2022-08-02_at_7.05.22_PM.png?1659446843' height = '500' width = '800'>

- Score is $-1$ when $a > b$ and $b = 0$

  - In this case the points from the same cluster of the current instance are at some distance but the points from the other clusters are overlapping with the current instance.

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/008/310/original/Screenshot_2022-08-02_at_7.02.19_PM.png?1659446661' height = '500' width = '800'>

- Ideally we want our Silhouette score to be closer to $1$, because we do want that our inter-cluster distances should be greater than intra-cluster distances.

- This is a more precise alternative to **Inertia (within cluster sum of squares (also called SSE))** and Elbow curve!

- A more precise approach (but also more computationally expensive) is to use the silhouette score, which is the mean silhouette coefficient over all the instances. 

#### **Interpreting Silhouette scores**

- A coefficient close to $+1$ means that the instance is well inside its own cluster and far from other clusters, 

- While a coefficient close to $0$ means that it is close to a cluster boundary, 

- A coefficient close to $–1$ means that the instance may have been assigned to the wrong cluster.

