# <center style='color:tan'> Internal Cluster Validation: `Silhouette` Score, `Davies-Bouldin (DB)` Score, `Calinski-Harabasz (CH)` Score </center>

# 1. Import required libraries

In [1]:
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score
from sklearn.metrics import calinski_harabasz_score

import pandas as pd
import numpy as np

# 2. Create dataset 

In [2]:
features, _ = datasets.make_classification(n_samples=150, n_features=4, random_state=42)

##### `Internal cluster validation` is applicaple in situations where ground truth information is absent. Hence, we didn't generate any labels while creating our dataset. 

In [3]:
features.shape # (samples, features)

(150, 4)

# 3. Perform preprocessing

In [4]:
scaler = StandardScaler()
scaled = scaler.fit_transform(features)

# 4. Create a dataframe

In [5]:
df = pd.DataFrame(scaled)
df.head()

Unnamed: 0,0,1,2,3
0,0.829921,-0.17824,-0.833476,0.501655
1,-1.743379,-1.32268,1.694849,-2.25514
2,-0.683461,-0.688386,0.65883,-1.004322
3,0.120082,1.337376,-0.075615,1.037536
4,0.281618,-0.579304,-0.299945,-0.197034


In [6]:
df.shape

(150, 4)

# 5. Perform K-Means clustering considering 3 clusters

In [7]:
kmeans1 = KMeans(n_clusters=3, n_init='auto', random_state=42)
kmeans1.fit(df)
preds1 = kmeans1.labels_

# 6. Perform K-Means clustering considering 4 clusters

In [8]:
kmeans2 = KMeans(n_clusters=4, n_init='auto', random_state=42)
kmeans2.fit(df)
preds2 = kmeans2.labels_

# 7. Add two new columns to our dataframe

In [9]:
df['Prediction1'] = preds1
df['Prediction2'] = preds2
df.head()

Unnamed: 0,0,1,2,3,Prediction1,Prediction2
0,0.829921,-0.17824,-0.833476,0.501655,1,1
1,-1.743379,-1.32268,1.694849,-2.25514,2,2
2,-0.683461,-0.688386,0.65883,-1.004322,2,2
3,0.120082,1.337376,-0.075615,1.037536,1,3
4,0.281618,-0.579304,-0.299945,-0.197034,1,1


In [10]:
print(df['Prediction1'].unique())
print(df['Prediction2'].unique())

[1 2 0]
[1 2 3 0]


The two lines shown above are displaying the unique labels present in the `K-Means` clustering outcome considering 3 and 4 clusters, respectively.

# 8. Separate features and labels

In [11]:
df_f = df.iloc[:, :-2]
df_l1 = df['Prediction1']
df_l2 = df['Prediction2']

# 9. Calculate `Silhouette` score

In [12]:
print('Silhouette score for 3-clusters:', round(silhouette_score(df_f, df_l1), 5))
print('Silhouette score for 4-clusters:', round(silhouette_score(df_f, df_l2), 5))

Silhouette score for 3-clusters: 0.36634
Silhouette score for 4-clusters: 0.39327


## A higher value of `Silhouette` score indicates better clustering.

# 10. Calculate `Davies-Bouldin (DB)` score 

In [13]:
print('Davies-Bouldin (DB) score for 3-clusters:', round(davies_bouldin_score(df_f, df_l1), 5))
print('Davies-Bouldin (DB) score for 4-clusters:', round(davies_bouldin_score(df_f, df_l2), 5))

Davies-Bouldin (DB) score for 3-clusters: 0.90118
Davies-Bouldin (DB) score for 4-clusters: 0.87543


## A lower value of `Davies-Bouldin (DB)` score indicates better clustering.

# 11. Calculate `Calinski-Harabasz (CH)` score

In [14]:
print('Calinski-Harabasz (CH) score for 3-clusters:', round(calinski_harabasz_score(df_f, df_l1), 5))
print('Calinski-Harabasz (CH) score for 4-clusters:', round(calinski_harabasz_score(df_f, df_l2), 5))

Calinski-Harabasz (CH) score for 3-clusters: 99.92173
Calinski-Harabasz (CH) score for 4-clusters: 125.73525


## A higher value of `Calinski-Harabasz (CH)` score indicates better clustering.