# K-Means Clustering from Scratch

Implementation of K-means clustering algorithm from first principles.

**Key Concepts:**
- Centroid initialization (random, k-means++)
- Iterative assignment and update steps
- Convergence criteria
- Within-cluster sum of squares (WCSS)
- Elbow method for optimal K


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, load_iris, load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, adjusted_rand_score

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("=" * 80)
print("K-MEANS CLUSTERING FROM SCRATCH")
print("=" * 80)

## Mathematical Foundation

**Objective:**
Minimize within-cluster sum of squares (WCSS):
$$WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2$$

where:
- $K$ = number of clusters
- $C_i$ = cluster $i$
- $\mu_i$ = centroid of cluster $i$
- $||x - \mu_i||$ = Euclidean distance

**Algorithm:**
1. **Initialize** $K$ centroids randomly or using k-means++
2. **Assignment Step**: Assign each point to nearest centroid
   $$C_i = \{x : ||x - \mu_i|| \leq ||x - \mu_j||, \forall j\}$$
3. **Update Step**: Recompute centroids as mean of assigned points
   $$\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x$$
4. **Repeat** steps 2-3 until convergence

## K-Means Implementation

In [None]:
class KMeansScratch:
    """
    K-Means clustering implemented from scratch.
    
    Parameters:
    -----------
    n_clusters : int, default=3
        Number of clusters
    max_iters : int, default=100
        Maximum number of iterations
    init_method : str, default='random'
        Initialization method: 'random' or 'kmeans++'
    tol : float, default=1e-4
        Convergence tolerance
    random_state : int, default=None
        Random seed for reproducibility
    """
    
    def __init__(self, n_clusters=3, max_iters=100, init_method='random', 
                 tol=1e-4, random_state=None):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.init_method = init_method
        self.tol = tol
        self.random_state = random_state
        
        # Model attributes
        self.centroids = None
        self.labels_ = None
        self.inertia_ = None  # WCSS
        self.n_iter_ = 0
        
        # History for visualization
        self.centroid_history = []
        self.inertia_history = []
        
    def _initialize_centroids(self, X):
        """
        Initialize centroids using random or k-means++ method.
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples = X.shape[0]
        
        if self.init_method == 'random':
            # Randomly select K samples as initial centroids
            indices = np.random.choice(n_samples, self.n_clusters, replace=False)
            centroids = X[indices]
            
        elif self.init_method == 'kmeans++':
            # K-means++ initialization
            centroids = []
            
            # Choose first centroid randomly
            first_idx = np.random.randint(n_samples)
            centroids.append(X[first_idx])
            
            # Choose remaining centroids
            for _ in range(1, self.n_clusters):
                # Calculate distances to nearest centroid
                distances = np.array([
                    min([self._euclidean_distance(x, c) for c in centroids])
                    for x in X
                ])
                
                # Choose next centroid with probability proportional to distance²
                probabilities = distances ** 2
                probabilities /= probabilities.sum()
                next_idx = np.random.choice(n_samples, p=probabilities)
                centroids.append(X[next_idx])
            
            centroids = np.array(centroids)
        
        return centroids
    
    def _euclidean_distance(self, x1, x2):
        """
        Calculate Euclidean distance between two points.
        """
        return np.sqrt(np.sum((x1 - x2) ** 2))
    
    def _assign_clusters(self, X, centroids):
        """
        Assign each sample to the nearest centroid.
        """
        labels = []
        for x in X:
            # Calculate distances to all centroids
            distances = [self._euclidean_distance(x, c) for c in centroids]
            # Assign to closest centroid
            label = np.argmin(distances)
            labels.append(label)
        return np.array(labels)
    
    def _update_centroids(self, X, labels):
        """
        Update centroids as mean of assigned points.
        """
        centroids = np.zeros((self.n_clusters, X.shape[1]))
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                centroids[k] = cluster_points.mean(axis=0)
            else:
                # If cluster is empty, reinitialize randomly
                centroids[k] = X[np.random.randint(len(X))]
        return centroids
    
    def _calculate_inertia(self, X, labels, centroids):
        """
        Calculate within-cluster sum of squares (WCSS).
        """
        inertia = 0
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                inertia += np.sum((cluster_points - centroids[k]) ** 2)
        return inertia
    
    def fit(self, X):
        """
        Fit K-means clustering.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        """
        # Initialize centroids
        self.centroids = self._initialize_centroids(X)
        
        # Iterative optimization
        for iteration in range(self.max_iters):
            # Assignment step
            labels = self._assign_clusters(X, self.centroids)
            
            # Calculate inertia
            inertia = self._calculate_inertia(X, labels, self.centroids)
            self.inertia_history.append(inertia)
            
            # Store centroid history
            self.centroid_history.append(self.centroids.copy())
            
            # Update step
            new_centroids = self._update_centroids(X, labels)
            
            # Check convergence
            centroid_shift = np.sum(np.abs(new_centroids - self.centroids))
            if centroid_shift < self.tol:
                self.n_iter_ = iteration + 1
                break
            
            self.centroids = new_centroids
            self.n_iter_ = iteration + 1
        
        # Final assignment
        self.labels_ = self._assign_clusters(X, self.centroids)
        self.inertia_ = self._calculate_inertia(X, self.labels_, self.centroids)
        
        return self
    
    def predict(self, X):
        """
        Predict cluster labels for new data.
        """
        return self._assign_clusters(X, self.centroids)
    
    def fit_predict(self, X):
        """
        Fit and return cluster labels.
        """
        self.fit(X)
        return self.labels_

print("\n✓ KMeansScratch class defined")
print("  - Random and K-means++ initialization")
print("  - Iterative assignment and update")
print("  - Convergence tracking")

## Example 1: Simple 2D Synthetic Data

In [None]:
# Generate synthetic data with clear clusters
X_blob, y_true = make_blobs(
    n_samples=300,
    n_features=2,
    centers=4,
    cluster_std=0.6,
    random_state=42
)

print("Synthetic 2D Dataset:")
print(f"  Samples: {len(X_blob)}")
print(f"  Features: {X_blob.shape[1]}")
print(f"  True clusters: {len(np.unique(y_true))}")

# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y_true, cmap='viridis', 
            s=50, alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('True Cluster Assignments', fontsize=14, fontweight='bold')
plt.colorbar(label='Cluster')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Fit K-means with random initialization
kmeans_random = KMeansScratch(
    n_clusters=4,
    max_iters=100,
    init_method='random',
    random_state=42
)

labels_random = kmeans_random.fit_predict(X_blob)

print("\nK-Means (Random Initialization):")
print(f"  Converged in {kmeans_random.n_iter_} iterations")
print(f"  Final inertia (WCSS): {kmeans_random.inertia_:.2f}")
print(f"  Cluster sizes: {np.bincount(labels_random)}")

In [None]:
# Fit K-means with k-means++ initialization
kmeans_pp = KMeansScratch(
    n_clusters=4,
    max_iters=100,
    init_method='kmeans++',
    random_state=42
)

labels_pp = kmeans_pp.fit_predict(X_blob)

print("\nK-Means (K-means++ Initialization):")
print(f"  Converged in {kmeans_pp.n_iter_} iterations")
print(f"  Final inertia (WCSS): {kmeans_pp.inertia_:.2f}")
print(f"  Cluster sizes: {np.bincount(labels_pp)}")

In [None]:
# Compare results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# True clusters
axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=y_true, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[0].set_xlabel('Feature 1', fontsize=11)
axes[0].set_ylabel('Feature 2', fontsize=11)
axes[0].set_title('True Clusters', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Random initialization
axes[1].scatter(X_blob[:, 0], X_blob[:, 1], c=labels_random, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[1].scatter(kmeans_random.centroids[:, 0], kmeans_random.centroids[:, 1],
               c='red', s=200, marker='X', edgecolors='black', linewidth=2,
               label='Centroids')
axes[1].set_xlabel('Feature 1', fontsize=11)
axes[1].set_ylabel('Feature 2', fontsize=11)
axes[1].set_title(f'Random Init (WCSS: {kmeans_random.inertia_:.1f})', 
                 fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# K-means++ initialization
axes[2].scatter(X_blob[:, 0], X_blob[:, 1], c=labels_pp, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[2].scatter(kmeans_pp.centroids[:, 0], kmeans_pp.centroids[:, 1],
               c='red', s=200, marker='X', edgecolors='black', linewidth=2,
               label='Centroids')
axes[2].set_xlabel('Feature 1', fontsize=11)
axes[2].set_ylabel('Feature 2', fontsize=11)
axes[2].set_title(f'K-means++ Init (WCSS: {kmeans_pp.inertia_:.1f})', 
                 fontsize=12, fontweight='bold')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Plot convergence
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(kmeans_random.inertia_history, marker='o', linewidth=2, 
        label='Random Init', markersize=4)
ax.plot(kmeans_pp.inertia_history, marker='s', linewidth=2, 
        label='K-means++ Init', markersize=4)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Inertia (WCSS)', fontsize=12)
ax.set_title('Convergence Comparison', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✓ K-means++ typically converges faster and to better solutions")

## Visualize Clustering Process

In [None]:
# Show first few iterations
iterations_to_show = min(6, len(kmeans_pp.centroid_history))
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx in range(iterations_to_show):
    centroids = kmeans_pp.centroid_history[idx]
    labels = kmeans_pp._assign_clusters(X_blob, centroids)
    
    axes[idx].scatter(X_blob[:, 0], X_blob[:, 1], c=labels, cmap='viridis',
                     s=50, alpha=0.6, edgecolors='black')
    axes[idx].scatter(centroids[:, 0], centroids[:, 1],
                     c='red', s=200, marker='X', edgecolors='black', linewidth=2)
    axes[idx].set_xlabel('Feature 1', fontsize=10)
    axes[idx].set_ylabel('Feature 2', fontsize=10)
    axes[idx].set_title(f'Iteration {idx}', fontsize=11, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('K-Means Clustering Process', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Elbow Method - Finding Optimal K

In [None]:
print("\n" + "=" * 80)
print("ELBOW METHOD - FINDING OPTIMAL K")
print("=" * 80)

# Test different values of K
k_values = range(2, 11)
inertias = []
silhouette_scores = []

for k in k_values:
    kmeans = KMeansScratch(
        n_clusters=k,
        max_iters=100,
        init_method='kmeans++',
        random_state=42
    )
    labels = kmeans.fit_predict(X_blob)
    
    inertias.append(kmeans.inertia_)
    
    # Silhouette score (requires at least 2 clusters)
    if k >= 2:
        silhouette = silhouette_score(X_blob, labels)
        silhouette_scores.append(silhouette)
    
    print(f"K={k:2d} | Inertia: {kmeans.inertia_:8.2f} | Silhouette: {silhouette:.4f}")

In [None]:
# Plot elbow curve and silhouette scores
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Elbow curve
axes[0].plot(k_values, inertias, marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[0].set_ylabel('Inertia (WCSS)', fontsize=12)
axes[0].set_title('Elbow Method', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].axvline(x=4, color='red', linestyle='--', linewidth=2, label='Elbow at K=4')
axes[0].legend()

# Silhouette scores
axes[1].plot(k_values, silhouette_scores, marker='s', linewidth=2, 
            markersize=8, color='green')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Analysis', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
best_k = k_values[np.argmax(silhouette_scores)]
axes[1].axvline(x=best_k, color='red', linestyle='--', linewidth=2, 
               label=f'Best K={best_k}')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n✓ Elbow method: Look for 'elbow' in inertia curve")
print("✓ Silhouette score: Higher is better (range -1 to 1)")

## Example 2: Iris Dataset

In [None]:
# Load iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names

print("\nIris Dataset:")
print(f"  Samples: {len(X_iris)}")
print(f"  Features: {X_iris.shape[1]}")
print(f"  True classes: {len(np.unique(y_iris))}")
print(f"\nFeatures: {feature_names}")

# Standardize features
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)

In [None]:
# Fit K-means
kmeans_iris = KMeansScratch(
    n_clusters=3,
    max_iters=100,
    init_method='kmeans++',
    random_state=42
)

labels_iris = kmeans_iris.fit_predict(X_iris_scaled)

print("\nK-Means Results (K=3):")
print(f"  Converged in {kmeans_iris.n_iter_} iterations")
print(f"  Final inertia: {kmeans_iris.inertia_:.4f}")
print(f"  Cluster sizes: {np.bincount(labels_iris)}")

# Evaluation metrics
silhouette = silhouette_score(X_iris_scaled, labels_iris)
ari = adjusted_rand_score(y_iris, labels_iris)

print(f"\nEvaluation Metrics:")
print(f"  Silhouette Score: {silhouette:.4f}")
print(f"  Adjusted Rand Index: {ari:.4f}")

In [None]:
# Visualize with PCA (2D projection)
pca = PCA(n_components=2)
X_iris_pca = pca.fit_transform(X_iris_scaled)
centroids_pca = pca.transform(kmeans_iris.centroids)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# True labels
axes[0].scatter(X_iris_pca[:, 0], X_iris_pca[:, 1], c=y_iris, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} var)', fontsize=11)
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} var)', fontsize=11)
axes[0].set_title('True Species Labels', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# K-means clusters
axes[1].scatter(X_iris_pca[:, 0], X_iris_pca[:, 1], c=labels_iris, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[1].scatter(centroids_pca[:, 0], centroids_pca[:, 1],
               c='red', s=200, marker='X', edgecolors='black', linewidth=2,
               label='Centroids')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} var)', fontsize=11)
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} var)', fontsize=11)
axes[1].set_title(f'K-Means Clusters (Silhouette: {silhouette:.3f})', 
                 fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Example 3: High-Dimensional Data (Digits)

In [None]:
# Load digits dataset (subset for speed)
digits = load_digits()
X_digits = digits.data[:500]  # Use subset
y_digits = digits.target[:500]

print("\nDigits Dataset (Subset):")
print(f"  Samples: {len(X_digits)}")
print(f"  Features: {X_digits.shape[1]} (8x8 pixel images)")
print(f"  True classes: {len(np.unique(y_digits))}")

# Standardize
scaler_digits = StandardScaler()
X_digits_scaled = scaler_digits.fit_transform(X_digits)

In [None]:
# Elbow method for digits
k_values_digits = range(2, 11)
inertias_digits = []
silhouettes_digits = []

print("\nTesting different K values:")
for k in k_values_digits:
    kmeans = KMeansScratch(
        n_clusters=k,
        max_iters=50,
        init_method='kmeans++',
        random_state=42
    )
    labels = kmeans.fit_predict(X_digits_scaled)
    
    inertias_digits.append(kmeans.inertia_)
    silhouette = silhouette_score(X_digits_scaled, labels)
    silhouettes_digits.append(silhouette)
    
    print(f"K={k:2d} | Inertia: {kmeans.inertia_:10.2f} | Silhouette: {silhouette:.4f}")

In [None]:
# Plot elbow curve
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].plot(k_values_digits, inertias_digits, marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[0].set_ylabel('Inertia (WCSS)', fontsize=12)
axes[0].set_title('Elbow Method - Digits Dataset', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].plot(k_values_digits, silhouettes_digits, marker='s', linewidth=2, 
            markersize=8, color='green')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Analysis - Digits Dataset', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Fit with K=10 (true number of digits)
kmeans_digits = KMeansScratch(
    n_clusters=10,
    max_iters=50,
    init_method='kmeans++',
    random_state=42
)

labels_digits = kmeans_digits.fit_predict(X_digits_scaled)

print(f"\nK-Means Results (K=10):")
print(f"  Converged in {kmeans_digits.n_iter_} iterations")
print(f"  Cluster sizes: {np.bincount(labels_digits)}")

# Evaluation
silhouette_digits = silhouette_score(X_digits_scaled, labels_digits)
ari_digits = adjusted_rand_score(y_digits, labels_digits)

print(f"\nEvaluation:")
print(f"  Silhouette Score: {silhouette_digits:.4f}")
print(f"  Adjusted Rand Index: {ari_digits:.4f}")

In [None]:
# Visualize cluster centers as images
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.ravel()

# Inverse transform centroids to original scale
centroids_original = scaler_digits.inverse_transform(kmeans_digits.centroids)

for idx in range(10):
    centroid_image = centroids_original[idx].reshape(8, 8)
    axes[idx].imshow(centroid_image, cmap='gray')
    axes[idx].set_title(f'Cluster {idx}', fontsize=10)
    axes[idx].axis('off')

plt.suptitle('Cluster Centroids (Average Digit Images)', 
            fontsize=14, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

## Comparison with Different K Values

In [None]:
# Compare different K values on 2D data
k_test = [2, 3, 4, 6]
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, k in enumerate(k_test):
    kmeans = KMeansScratch(
        n_clusters=k,
        max_iters=100,
        init_method='kmeans++',
        random_state=42
    )
    labels = kmeans.fit_predict(X_blob)
    silhouette = silhouette_score(X_blob, labels)
    
    axes[idx].scatter(X_blob[:, 0], X_blob[:, 1], c=labels, cmap='viridis',
                     s=50, alpha=0.6, edgecolors='black')
    axes[idx].scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1],
                     c='red', s=200, marker='X', edgecolors='black', linewidth=2)
    axes[idx].set_xlabel('Feature 1', fontsize=11)
    axes[idx].set_ylabel('Feature 2', fontsize=11)
    axes[idx].set_title(f'K={k} (Silhouette: {silhouette:.3f})', 
                       fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Effect of K on Clustering', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Summary

**Key Concepts:**

1. **Algorithm**:
   - Initialize K centroids
   - Assign points to nearest centroid
   - Update centroids as cluster mean
   - Repeat until convergence

2. **Initialization**:
   - Random: Simple but may lead to poor local optima
   - K-means++: Better initial centroids, faster convergence

3. **Choosing K**:
   - Elbow method: Look for "elbow" in WCSS curve
   - Silhouette score: Measure cluster quality (-1 to 1)
   - Domain knowledge: Use prior information

4. **Evaluation**:
   - Inertia (WCSS): Lower is better
   - Silhouette score: Higher is better
   - Adjusted Rand Index: Compare with true labels (if available)

5. **Limitations**:
   - Assumes spherical clusters
   - Sensitive to outliers
   - Must specify K in advance
   - May converge to local optima
