# Silhouette Score for Clustering Evaluation

The **Silhouette score** is a widely used metric for evaluating the quality of clusters in a clustering algorithm. It measures how similar a point is to its own cluster compared to other clusters. The higher the Silhouette score, the better defined the clusters are.

## Key Concept

The Silhouette score combines both cohesion (how close points in a cluster are to each other) and separation (how well-separated the clusters are). For each data point, the score compares the average distance to other points within the same cluster (cohesion) to the average distance to points in the nearest cluster (separation). The Silhouette score is calculated for all points in the dataset, and the overall score is the average of individual point scores.

### Silhouette Score Equation

The **Silhouette score** for an individual point \( i \) is defined as:

\begin{equation}
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
\end{equation}

Where:
- $ a(i)$ is the **average distance** between point \( i \) and all other points in the same cluster. This measures cohesion.
- $ b(i) $ is the **minimum average distance** between point \( i \) and all points in any other cluster. This measures separation.
- $ \max(a(i), b(i)) $ is used to normalize the score, ensuring that it falls between -1 and 1.

### Intuition Behind the Formula

- **Cohesion** $a(i)$ measures how well the point \( i \) fits within its cluster. A lower value of $a(i)$ means that the point is close to other points in the cluster.
- **Separation** $b(i)$ measures how well-separated point $i$ is from the other clusters. A higher value of $b(i)$ means that the point is far from the nearest neighboring cluster.

- The **Silhouette score** $s(i)$ can take values between -1 and 1:
  - A score close to **1** indicates that the point is well clustered (both close to its own cluster and far from others).
  - A score close to **0** indicates that the point lies on or near the boundary between two clusters.
  - A score close to **-1** indicates that the point may have been assigned to the wrong cluster.

### Silhouette Score for the Entire Dataset

The **Silhouette score for the entire dataset** is the average of the Silhouette scores of all individual points:

\begin{equation}
\text{Silhouette Score} = \frac{1}{n} \sum_{i=1}^{n} s(i)
\end{equation}

Where:
- \( n \) is the total number of data points.
- \( s(i) \) is the individual Silhouette score for point \( i \).

This overall score provides an indication of how well the clustering algorithm performed.

### Interpretation of the Silhouette Score

- **A high Silhouette score** (close to 1) means the clustering configuration is very good, with distinct, well-separated clusters.
- **A low Silhouette score** (close to 0) indicates that some data points may be on the boundary of two clusters, suggesting a poor clustering result.
- **A negative Silhouette score** (close to -1) implies that many points are likely assigned to the wrong clusters, and the clustering is highly ineffective.

## Strengths of the Silhouette Score

- **Interpretability**: The Silhouette score provides a simple and intuitive measure of clustering quality.
- **Cluster Validation**: It can be used to assess the validity of different clustering solutions (e.g., comparing the Silhouette scores of different values of \( k \) in K-Means clustering).
- **Range of Values**: The score is bounded between -1 and 1, making it easy to interpret.

## Limitations of the Silhouette Score

- **Assumes Convex Clusters**: The Silhouette score assumes that clusters are convex and isotropic, which may not always be the case, especially for non-globular clusters.
- **Sensitive to Outliers**: The score can be sensitive to outliers or noise, as these points can distort the cohesion and separation calculations.

## Conclusion

The **Silhouette score** is an effective and widely used metric for evaluating the quality of clustering. It provides a balance between cohesion and separation, and it can help determine the optimal number of clusters in a dataset. While it is useful for many clustering tasks, it may not perform well with irregularly shaped clusters or datasets with significant noise.


In [33]:
# silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
# silhouette_score = (b-a)/max(b,a)
# a - cohesion
# b - separation

In [2]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
import numpy as np

def calculate_silhouette_score(data, labels):
    """
    Custom function to explain the silhouette score.

    Parameters:
        data (numpy.ndarray): The dataset (2D array).
        labels (list or numpy.ndarray): Cluster labels for each data point.

    Returns:
        float: The silhouette score.
    """
    n_samples = len(data)
    silhouette_values = []

    for i in range(n_samples):
        current_cluster = labels[i]

        # Points in the same cluster
        same_cluster = data[labels == current_cluster]

        # Points in other clusters
        other_clusters = data[labels != current_cluster]

        # Compute average intra-cluster distance (a) : cohesion
        a = np.mean(np.linalg.norm(same_cluster - data[i], axis=1))

        # Compute average nearest-cluster distance (b) : separation
        b = np.mean(np.linalg.norm(other_clusters - data[i], axis=1))

        # Silhouette value for the point
        silhouette_values.append((b - a) / max(a, b))

    # Average silhouette score
    return np.mean(silhouette_values)

# Example dataset (2D points for visualization)
data = np.array([
    [1, 2], [2, 3], [3, 4],   # Cluster 1
    [8, 8], [9, 9], [8, 9],   # Cluster 2
    [15, 14], [16, 15], [15, 16] # Cluster 3
])

In [3]:
data

array([[ 1,  2],
       [ 2,  3],
       [ 3,  4],
       [ 8,  8],
       [ 9,  9],
       [ 8,  9],
       [15, 14],
       [16, 15],
       [15, 16]])

In [4]:
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42).fit(data)
labels = kmeans.labels_


In [5]:
# Compute silhouette score using sklearn
sil_score = silhouette_score(data, labels)
print(f"Silhouette Score using sklearn: {sil_score:.2f}")

Silhouette Score using sklearn: 0.82


In [6]:
labels

array([2, 2, 2, 0, 0, 0, 1, 1, 1], dtype=int32)

In [8]:
# Points in the same cluster
current_cluster = 2
same_cluster = data[labels == current_cluster]
# Points in other clusters
other_clusters = data[labels != current_cluster]

In [10]:
same_cluster,other_clusters

(array([[1, 2],
        [2, 3],
        [3, 4]]),
 array([[ 8,  8],
        [ 9,  9],
        [ 8,  9],
        [15, 14],
        [16, 15],
        [15, 16]]))

In [11]:
same_cluster-data[0]

array([[0, 0],
       [1, 1],
       [2, 2]])

In [14]:
np.linalg.norm(same_cluster-data[0],axis=1)

array([0.        , 1.41421356, 2.82842712])

In [15]:
np.sqrt(2)

1.4142135623730951

In [18]:
data

array([[ 1,  2],
       [ 2,  3],
       [ 3,  4],
       [ 8,  8],
       [ 9,  9],
       [ 8,  9],
       [15, 14],
       [16, 15],
       [15, 16]])

In [19]:
# axis=0 , feature wise operation
# axis=1 , row wise operation
data.sum(axis=1)

array([ 3,  5,  7, 16, 18, 17, 29, 31, 31])

In [27]:
# Compute silhouette score using custom function
custom_score = calculate_silhouette_score(data, labels)
print(f"Silhouette Score using custom function: {custom_score:.2f}")

Silhouette Score using custom function: 0.91


In [38]:
# Explanation with numeric example
example_point = data[0]  # First data point
example_cluster = labels[0]  # Cluster of the first data point
same_cluster_points = data[labels == example_cluster]
other_cluster_points = data[labels != example_cluster]

# Average intra-cluster distance (a)
a_example = np.mean(np.linalg.norm(same_cluster_points - example_point, axis=1))
# Average nearest-cluster distance (b)
b_example = np.mean(np.linalg.norm(other_cluster_points - example_point, axis=1))

print("\nExplanation with Numeric Example:")
print(f"Point: {example_point}")
print(f"same cluster ponints :\n{same_cluster_points}")
print(f"other cluster ponints :\n{other_cluster_points}")
print(f'same cluster difference :\n{same_cluster_points-example_point}')
print(f'other cluster difference :\n{other_cluster_points-example_point}')

print('*'*50)
print(f'norm of the vectors  a : \n{np.linalg.norm(same_cluster_points-example_point,axis=1)}')
print(f'norm of the vectors  b : \n{np.linalg.norm(other_cluster_points-example_point,axis=1)}')
print(f'cohesion = {np.mean(np.linalg.norm(same_cluster_points-example_point,axis=1))}')
print(f'separation = {np.mean(np.linalg.norm(other_cluster_points-example_point,axis=1))}')

print('*'*50)
print(f"Average intra-cluster distance (a): {a_example:.2f}")
print(f"Average nearest-cluster distance (b): {b_example:.2f}")
print(f"Silhouette value for the point: {(b_example - a_example) / max(a_example, b_example):.2f}")


Explanation with Numeric Example:
Point: [1 2]
same cluster ponints :
[[1 2]
 [2 3]
 [3 4]]
other cluster ponints :
[[ 8  8]
 [ 9  9]
 [ 8  9]
 [15 14]
 [16 15]
 [15 16]]
same cluster difference :
[[0 0]
 [1 1]
 [2 2]]
other cluster difference :
[[ 7  6]
 [ 8  7]
 [ 7  7]
 [14 12]
 [15 13]
 [14 14]]
**************************************************
norm of the vectors  a : 
[0.         1.41421356 2.82842712]
norm of the vectors  b : 
[ 9.21954446 10.63014581  9.89949494 18.43908891 19.84943324 19.79898987]
cohesion = 1.4142135623730951
separation = 14.639449539287922
**************************************************
Average intra-cluster distance (a): 1.41
Average nearest-cluster distance (b): 14.64
Silhouette value for the point: 0.90


# Davies-Bouldin Score

The **Davies-Bouldin Score** evaluates clustering quality by measuring the average similarity between clusters. Lower scores indicate better clustering.

The formula is:

\begin{equation}
DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d_{ij}} \right)
\end{equation}

Where:
- $k$: The number of clusters.
- $\sigma_i$: Scatter within cluster $i$, defined as the average distance of all points in cluster $i$ to the cluster centroid $c_i$:
  \begin{equation}
  \sigma_i = \frac{1}{|C_i|} \sum_{x \in C_i} \text{dist}(x, c_i)
  \end{equation}
  Here, $|C_i|$ is the number of points in cluster $i$.

- $d_{ij}$: Distance between centroids of clusters $i$ and $j$, typically computed as:
  \begin{equation}
  d_{ij} = \text{dist}(c_i, c_j)
  \end{equation}

**Steps to Compute Davies-Bouldin Score**:
1. For each cluster $i$, calculate its scatter $\sigma_i$.
2. For each pair of clusters $(i, j)$, compute the **similarity**:
   \begin{equation}
   R_{ij} = \frac{\sigma_i + \sigma_j}{d_{ij}}
   \end{equation}
3. For each cluster $i$, find the maximum $R_{ij}$ across all other clusters $j \neq i$.
4. Average these maximum values over all clusters to get $DB$:
   \begin{equation}
   DB = \frac{1}{k} \sum_{i=1}^k \max_{j \neq i} R_{ij}
   \end{equation}

**Interpretation**:
- **Lower $DB$**: Clusters are compact (low $\sigma$) and well-separated (high $d_{ij}$).
- **Higher $DB$**: Indicates poor clustering (overlapping or dispersed clusters).


In [39]:
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
import numpy as np

def calculate_davies_bouldin_score(data, labels):
    """
    Custom function to explain the Davies-Bouldin score.

    Parameters:
        data (numpy.ndarray): The dataset (2D array).
        labels (list or numpy.ndarray): Cluster labels for each data point.

    Returns:
        float: The Davies-Bouldin score.
    """
    n_clusters = len(np.unique(labels))
    cluster_means = []  # Centroids of clusters
    intra_cluster_distances = []  # Scatter within clusters

    # Compute cluster centroids and intra-cluster distances
    for cluster in range(n_clusters):
        cluster_points = data[labels == cluster]
        centroid = np.mean(cluster_points, axis=0)
        cluster_means.append(centroid)
        scatter = np.mean(np.linalg.norm(cluster_points - centroid, axis=1))
        intra_cluster_distances.append(scatter)

    # Compute pairwise Davies-Bouldin components
    db_values = []
    for i in range(n_clusters):
        max_ratio = 0
        for j in range(n_clusters):
            if i != j:
                inter_cluster_distance = np.linalg.norm(cluster_means[i] - cluster_means[j])
                ratio = (intra_cluster_distances[i] + intra_cluster_distances[j]) / inter_cluster_distance
                max_ratio = max(max_ratio, ratio)
        db_values.append(max_ratio)

    # Davies-Bouldin score is the average of the maximum ratios
    return np.mean(db_values)

# Example dataset (2D points for visualization)
data = np.array([
    [1, 2], [2, 3], [3, 4],   # Cluster 1
    [8, 8], [9, 9], [8, 9],   # Cluster 2
    [15, 14], [16, 15], [15, 16] # Cluster 3
])

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42).fit(data)
labels = kmeans.labels_

# Compute Davies-Bouldin score using sklearn
db_score = davies_bouldin_score(data, labels)
print(f"Davies-Bouldin Score using sklearn: {db_score:.2f}")

# Compute Davies-Bouldin score using custom function
custom_db_score = calculate_davies_bouldin_score(data, labels)
print(f"Davies-Bouldin Score using custom function: {custom_db_score:.2f}")

# Explanation with numeric example
print("\nExplanation with Numeric Example:")
cluster_1_points = data[labels == 0]
cluster_2_points = data[labels == 1]
centroid_1 = np.mean(cluster_1_points, axis=0)
centroid_2 = np.mean(cluster_2_points, axis=0)

# Intra-cluster distances (scatter)
scatter_1 = np.mean(np.linalg.norm(cluster_1_points - centroid_1, axis=1))
scatter_2 = np.mean(np.linalg.norm(cluster_2_points - centroid_2, axis=1))

# Inter-cluster distance
inter_cluster_distance = np.linalg.norm(centroid_1 - centroid_2)

# Davies-Bouldin ratio for cluster 1 and 2
db_ratio = (scatter_1 + scatter_2) / inter_cluster_distance

print(f"Cluster 1 Centroid: {centroid_1}")
print(f"Cluster 2 Centroid: {centroid_2}")
print(f"Cluster 1 Scatter: {scatter_1:.2f}")
print(f"Cluster 2 Scatter: {scatter_2:.2f}")
print(f"Inter-Cluster Distance: {inter_cluster_distance:.2f}")
print(f"Davies-Bouldin Ratio (Cluster 1 & 2): {db_ratio:.2f}")

Davies-Bouldin Score using sklearn: 0.18
Davies-Bouldin Score using custom function: 0.18

Explanation with Numeric Example:
Cluster 1 Centroid: [8.33333333 8.66666667]
Cluster 2 Centroid: [15.33333333 15.        ]
Cluster 1 Scatter: 0.65
Cluster 2 Scatter: 0.92
Inter-Cluster Distance: 9.44
Davies-Bouldin Ratio (Cluster 1 & 2): 0.17


Steps for Multiple Clusters

* Compute Centroids for Each Cluster: Find the centroid for each cluster by taking the mean of all points in the cluster.

* Calculate Scatter (σi) for Each Cluster: For each cluster, compute the average distance of all points in that cluster to the cluster’s centroid.

* Compute Pairwise Centroid Distances (dij): For each pair of clusters ii and jj, calculate the distance between their centroids.

* Compute Rij for All Cluster Pairs: For each pair of clusters ii and jj, compute:
    Rij=σi+σjdij
    Rij​=dij​σi​+σj​​

* This measures the similarity between clusters ii and jj.

* Find the Maximum Rij for Each Cluster: For each cluster ii, find the maximum RijRij​ across all other clusters jj. This represents how "bad" the separation is for cluster ii.

* Compute the Average of Maximum Rij: Take the average of the maximum RijRij​ values across all clusters to get the Davies-Bouldin Score:

**Silhoette Score: The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).**

**Davies-Bouldin (DB) Score : evaluates the average similarity between each cluster and its most similar cluster.**

# Calinski-Harabasz (CH) Score for Clustering Evaluation

The **Calinski-Harabasz index** (CH score) is a metric used to evaluate the quality of clustering results. It measures how well-separated and compact the clusters are. A higher CH score generally indicates better clustering.

## Key Concept

The CH score is based on the dispersion of data points within clusters and between clusters. The goal is to have well-separated clusters (large between-cluster dispersion) and compact clusters (small within-cluster dispersion). The score is calculated as the ratio of between-cluster dispersion to within-cluster dispersion.

### Between-Cluster Dispersion

The **between-cluster dispersion matrix** \( B_k \) quantifies the spread of the centroids of different clusters. It is calculated by considering the distances between each cluster's centroid and the global centroid (mean of all data points). The more separated the cluster centroids are from the global centroid, the higher the between-cluster dispersion.

### Within-Cluster Dispersion

The **within-cluster dispersion matrix** \( W_k \) measures the compactness of each cluster. It is the sum of squared distances between each data point in a cluster and the cluster's centroid. Smaller within-cluster dispersion indicates that the data points are tightly grouped around their respective centroids.

### Calinski-Harabasz Index Equation

The Calinski-Harabasz index (CH score) is given by the following formula:

\[
CH = \frac{\text{tr}(B_k) / (k - 1)}{\text{tr}(W_k) / (n - k)}
\]

Where:
- \( B_k \) is the **between-cluster dispersion matrix**.
- \( W_k \) is the **within-cluster dispersion matrix**.
- \( n \) is the total number of data points.
- \( k \) is the number of clusters.
- \( \text{tr}() \) represents the trace of the matrix (sum of diagonal elements).

### Intuition Behind the Formula

- The numerator represents the between-cluster dispersion normalized by the number of clusters \( k \).
- The denominator represents the within-cluster dispersion normalized by the number of data points \( n \).
- A higher CH score means that the clusters are well-separated (larger between-cluster dispersion) and compact (smaller within-cluster dispersion).

## Calculation Steps

1. **Compute the Centroids**:
   - Calculate the centroid for each cluster. The centroid is the mean of all data points in the cluster.
   - Compute the global centroid (mean of all data points in the dataset).

2. **Calculate the Within-Cluster Dispersion**:
   - For each cluster, calculate the sum of squared distances between each point and the cluster's centroid.
   - Sum these values across all clusters to get the total within-cluster dispersion matrix \( W_k \).

3. **Calculate the Between-Cluster Dispersion**:
   - For each cluster, calculate the squared distance between the cluster's centroid and the global centroid.
   - Multiply this squared distance by the number of points in the cluster to obtain the total between-cluster dispersion matrix \( B_k \).

4. **Compute the CH Score**:
   - Use the formula mentioned above to compute the CH score by calculating the trace of the dispersion matrices and plugging them into the formula.



In [None]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

def compute_ch_score(X, labels):
    """
    Compute the Calinski-Harabasz (CH) score manually.

    Parameters:
        X : ndarray, shape (n_samples, n_features)
            The data points.
        labels : ndarray, shape (n_samples,)
            The cluster labels for each data point.

    Returns:
        ch_score : float
            The Calinski-Harabasz (CH) score.
    """
    # Number of samples and number of clusters
    n_samples, n_features = X.shape
    k = len(np.unique(labels))

    # Compute the overall centroid (mean of all data points)
    global_centroid = np.mean(X, axis=0)

    # Initialize matrices to calculate within and between cluster dispersion
    W_k = np.zeros((n_features, n_features))  # Within-cluster dispersion matrix
    B_k = np.zeros((n_features, n_features))  # Between-cluster dispersion matrix

    # Iterate through each cluster
    for i in range(k):
        # Select points in the current cluster
        cluster_points = X[labels == i]
        cluster_size = len(cluster_points)

        # Compute the centroid of the current cluster
        cluster_centroid = np.mean(cluster_points, axis=0)

        # Within-cluster dispersion: sum of squared distances from points to cluster centroid
        W_k += np.dot((cluster_points - cluster_centroid).T, (cluster_points - cluster_centroid))

        # Between-cluster dispersion: sum of squared distances from cluster centroid to global centroid
        B_k += cluster_size * np.dot((cluster_centroid - global_centroid).T, (cluster_centroid - global_centroid))

    # Compute the trace of both dispersion matrices
    tr_W_k = np.trace(W_k)
    tr_B_k = np.trace(B_k)

    # Compute the Calinski-Harabasz Score
    ch_score = (tr_B_k / (k - 1)) / (tr_W_k / (n_samples - k))

    return ch_score

# Create a synthetic dataset with clusters
X, y = make_blobs(n_samples=100, centers=3, random_state=42)

# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Compute the Calinski-Harabasz Score manually
ch_score_manual = compute_ch_score(X, kmeans.labels_)

print(f"Manual Calinski-Harabasz Score: {ch_score_manual}")
