# Metrics

## Silhouette Score

Silhouette Score is the mean Silhouette Coefficient for all clusters, which is calculated using the mean intra-cluster distance and the mean nearest-cluster distance. This score is between -1 and 1, where the higher the score the more well-defined and distinct your clusters are

***a***: mean distance between the observation and all other data points in the ***same cluster***. This distance can also be called as ***mean intra-cluster distance***

***b***: mean distance between the observation and all other data points of the next ***nearest cluster***. This distance can also be called as ***mean nearest-cluster distance***
$$S = \frac{b-a}{max(a, b)}$$

## Calinski-Harabaz Index
Calinski-Harabaz Index is calculated using the between-cluster dispersion and within-cluster dispersion in order to measure the distinctiveness between groups. Like the Silhouette Score, the higher the score the more well-defined the clusters are. This score has no bound, meaning that there is no ‘acceptable’ or ‘good’ value

1. Calculate inter-cluster dispersion: BGSS(between group sum of squares)
$$BGSS = \sum_{k=1}^K{n_k * ||C_k-C||^2}$$
- $𝑛_𝑘$ : the number of observations in cluster 𝑘
- $𝐶_𝑘$ : the centroid of cluster 𝑘
- 𝐶 : the centroid of the dataset (barycenter)
- 𝐾 : the number of clusters

2. Calculate intra-cluster dispersion: WGSS(within group sum of squares).
$$WGSS_k = \sum_{k=1}^{n_k}{||X_{ik}-C_k||^2}$$
- $𝑛_𝑘$ : the number of observations in cluster 𝑘
- $𝑋_{𝑖𝑘}$ : the 𝑖-th observation of cluster 𝑘
- $𝐶_𝑘$ : the centroid of cluster 𝑘

Then sum all individual within group sums of squares: 
$$WGSS = \sum_{k=1}^{K}{WGSS_k}$$

3. Calculate Calinski-Harabasz Index
$$𝐶𝐻 = \frac{\frac{BGSS}{K-1}}{\frac{WGSS}{N-K}} = \frac{BGSS}{WGSS}*\frac{N-K}{K-1}$$
- 𝐵𝐺𝑆𝑆 : between-group sum of squares (between-group dispersion)
- 𝑊𝐺𝑆𝑆 : within-group sum of squares (within-group dispersion)
- 𝑁 : total number of observations
- 𝐾 : total number of clusters

## Davies-Bouldin Index
Davies-Bouldin Index is the average similarity of each cluster with its most similar cluster. Unlike the previous two metrics, this score measures the similarity of your clusters, meaning that the lower the score the better separation there is between your clusters?

1. Calculate intra-cluster dispersion:
$$S_i = \Bigg\{\frac{1}{n_i} \sum_{j=1}^{T_i} |X_j – C_i|^q \Bigg\}^\frac{1}{q}$$
- 𝑖 : particular identified cluster
- n_𝑖 : number of vectors (observations) in cluster 𝑖
- 𝑋_𝑗 : 𝑗-th vector (observation) in cluster 𝑖
- C_𝑖 : centroid of cluster 𝑖
- q: hyperparameter. Usually the value 𝑞 is set to 2

2. Calculate separation measure:
$$M_{ij} = \Bigg\{\sum_{k=1}^{N} |a_{ki} – a_{kj}|^p \Bigg\}^\frac{1}{p} = ||A_i – A_j||_p$$
- 𝑎_{𝑘𝑖} : 𝑘-th component of 𝑛-dimensional centroid 𝐴𝑖
- 𝑎_{𝑘𝑗} : 𝑘-th component of 𝑛-dimensional centroid 𝐴𝑗
- 𝑁 : total number of clusters
- p: another hyperparameter. Usually the value p is set to 2

3. Calculate similarity between clusters:
$$R_{ij} = \frac{S_i + S_j}{M_{ij}}$$
- 𝑆_𝑖 : intra-cluster dispersion of cluster 𝑖
- 𝑆_𝑗 : intra-cluster dispersion of cluster 𝑗
- 𝑀_{𝑖𝑗} : distance between centroids of clusters 𝑖 and  𝑗

4. Find most similar cluster for each cluster 𝑖:
$$R_i \equiv max(R_{ij}), i\neq j$$
5. Calculate Davies-Bouldin Index:
$$\bar{R} = \frac{1}{N}\sum_{i=1}^{N}R_i$$