# Clustering with K-Means and K-Medoids

In this notebook, we will perform clustering on a synthetic dataset using K-Means and K-Medoids algorithms. We will determine the optimal number of clusters using the Elbow Method and Silhouette Score, and visualize the clustering results. Additionally, we will compare the clustering quality and visualize decision boundaries.

---


## 1. Import Required Libraries

We begin by importing the necessary libraries for clustering, generating synthetic data, and visualizing the results.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cdist


## 2. Generate Synthetic Data

We generate synthetic data (4-clusters) to simulate data points for clustering.


In [None]:
np.random.seed(42)

# Generate synthetic data for clustering
X = np.vstack([
    np.random.normal(loc=[2, 2], scale=0.5, size=(100, 2)),  # Cluster 1
    np.random.normal(loc=[6, 6], scale=0.5, size=(100, 2)),  # Cluster 2
    np.random.normal(loc=[10, 2], scale=0.5, size=(100, 2)),  # Cluster 3
    np.random.normal(loc=[6, -3], scale=0.5, size=(100, 2))   # Cluster 4
])

plt.scatter(X[:, 0], X[:, 1], s=30, alpha=0.6)
plt.title("Generated Data for Clustering")
plt.show()


## 3. Determine Optimal K Using the Elbow Method

The Elbow Method is used to find the optimal number of clusters by plotting the distortion (within-cluster sum of squared distances) as a function of K.


In [None]:
distortions = []
k_values = range(1, 11)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

plt.plot(k_values, distortions, marker='o')
plt.xlabel('Number of clusters, K')
plt.ylabel('Distortion')
plt.title('Elbow Method for Optimal K')
plt.show()


## 4. Determine Optimal K Using Silhouette Score

The Silhouette Score helps to assess the quality of clustering by comparing the cohesion and separation of clusters. A higher Silhouette Score indicates better-defined clusters.


In [None]:
silhouette_scores = []
k_values = range(2, 11)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    silhouette_scores.append(silhouette_score(X, labels))

plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters, K')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal K')
plt.show()


## 5. Determine Optimal K Using CH (K) Index

The CH(K) helps to assess the quality of clustering by comparing the cohesion and separation of clusters. A higher Silhouette Score indicates better-defined clusters.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
import matplotlib.pyplot as plt

ch_scores = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    ch_scores.append(calinski_harabasz_score(X, labels))

plt.plot(k_values, ch_scores, marker='o')
plt.xlabel('Number of clusters, K')
plt.ylabel('Calinski-Harabasz Index')
plt.title('Calinski-Harabasz Index for Optimal K')
plt.show()


## 6. Gap Statistic Method for K-Selection
Weâ€™ll define the Gap Statistic method. The steps are as follows:

Perform K-Means clustering on the real data for each k (number of clusters).

For each k, calculate the WCSS (Within-Cluster Sum of Squares).

Generate random uniform data (as the null hypothesis) and perform K-Means on this data.

Compute the gap statistic for each k and select the value of k that maximizes the gap.

In [None]:
def compute_wcss(X, k):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    return kmeans.inertia_

def gap_statistic(X, k_max):
    gaps = []
    for k in range(1, k_max + 1):
        # WCSS for real data
        wcss_real = compute_wcss(X, k)

        # Generate random uniform data (null hypothesis)
        random_data = np.random.uniform(low=X.min(), high=X.max(), size=X.shape)

        # WCSS for random data
        wcss_random = compute_wcss(random_data, k)

        # Compute the Gap statistic
        gap = np.log(wcss_random) - np.log(wcss_real)
        gaps.append(gap)

    return gaps

# Compute the Gap statistic for different values of k
k_max = 10
gaps = gap_statistic(X, k_max)

# Plot the Gap statistic
plt.plot(range(1, k_max + 1), gaps, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Gap Statistic')
plt.title('Gap Statistic for K-Selection')
plt.show()


## 6. Apply K-Means Clustering

After determining the optimal K=4 from the Elbow and Silhouette methods, we apply K-Means clustering and visualize the results with cluster centroids.


In [None]:
optimal_k = 4  # Chosen based on Elbow and Silhouette methods
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=30, alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X', label='Centroids')
plt.title("K-Means Clustering")
plt.legend()
plt.show()


## 7. Apply K-Medoids Clustering

Similarly, we apply K-Medoids clustering (using the PAM method) and visualize the results with medoids.


In [None]:
# Number of clusters
optimal_k = 4

# Randomly initialize medoids
np.random.seed(42)
medoids = X[np.random.choice(X.shape[0], optimal_k, replace=False)]
prev_medoids = None

# Repeat until medoids don't change
while not np.array_equal(medoids, prev_medoids):
    prev_medoids = medoids.copy()

    # Compute distances from each point to each medoid
    distances = cdist(X, medoids, 'euclidean')
    labels = np.argmin(distances, axis=1)  # Assign points to nearest medoid

    # Update medoids by choosing the point that minimizes the total distance within each cluster
    for i in range(optimal_k):
        cluster_points = X[labels == i]
        if len(cluster_points) > 0:
            medoid_idx = np.argmin(np.sum(cdist(cluster_points, cluster_points, 'euclidean'), axis=1))
            medoids[i] = cluster_points[medoid_idx]

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30, alpha=0.6)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', marker='X', label='Medoids')
plt.title("K-Medoids Clustering")
plt.legend()
plt.show()


## 8. Compare K-Means and K-Medoids

We compare the clustering quality metrics (inertia for K-Means and cost for K-Medoids).


In [None]:
print(f"K-Means Inertia: {kmeans.inertia_}")
kmedoids_cost = sum(np.min(cdist(X, medoids, 'euclidean'), axis=1))
print(f"K-Medoids Cost: {kmedoids_cost}")

## 9. Evaluate Cluster Quality

We evaluate the clustering performance of both algorithms using the Silhouette Score.


In [None]:
def evaluate_clustering(X, labels, method_name):
    silhouette_avg = silhouette_score(X, labels)
    print(f"Silhouette Score for {method_name}: {silhouette_avg:.4f}")

evaluate_clustering(X, kmeans_labels, "K-Means")
evaluate_clustering(X, labels, "K-Medoids")


## 10. Visualizing Cluster Boundaries

Finally, we visualize the decision boundaries of the clusters and display them on the plot. This helps to visualize how well the algorithm is separating the clusters.


In [None]:
from matplotlib.colors import ListedColormap

def plot_decision_boundaries(X, labels, centers, title):
    h = .02  # Step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = KMeans(n_clusters=optimal_k, random_state=42, n_init=10).fit_predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['yellow', 'blue']))
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30, alpha=0.6)
    plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', label='Centroids')
    plt.title(title)
    plt.legend()
    plt.show()

plot_decision_boundaries(X, kmeans_labels, kmeans.cluster_centers_, "K-Means Decision Boundaries")
plot_decision_boundaries(X, labels, medoids, "K-Medoids Decision Boundaries")


## Summary

In this notebook, we explored clustering with both K-Means and K-Medoids. We used the Elbow Method and Silhouette Score to determine the optimal number of clusters. We visualized the clustering results, compared the quality of clustering using inertia and cost metrics, and evaluated cluster quality using the Silhouette Score. Finally, we visualized the decision boundaries to show how well the algorithms separate the data into distinct clusters.
