# 02 â€” K-Means Clustering (Hands-on)

Objectives:
- Understand assumptions: spherical, similarly sized clusters; numeric and scaled features; Euclidean distance
- Preprocess with standardization and choose k using elbow and silhouette methods
- Fit K-Means with k-means++ init and multiple initializations
- Visualize results using PCA projection and inspect cluster characteristics

Assumptions:
- Clusters are roughly spherical and equal density
- Features are numeric and comparable (scale matters)
- Euclidean distance is meaningful under feature scaling

Cautions/Data Prep:
- Always standardize/normalize features prior to K-Means
- Remove irrelevant or categorical features (or encode appropriately)
- Try several k values; random initialization can vary results (use k-means++ and multiple runs)


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import make_blobs, load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples, adjusted_rand_score, normalized_mutual_info_score

sns.set(style='whitegrid', context='notebook')
np.random.seed(42)

## 1) Synthetic dataset (numeric, well-separated)
Create a CPU-friendly synthetic dataset with 4 clusters. We will standardize features before clustering.

In [None]:
X, y_true = make_blobs(n_samples=600, centers=4, n_features=4, cluster_std=[1.0, 1.2, 0.8, 1.1], random_state=42)
df = pd.DataFrame(X, columns=[f"x{i+1}" for i in range(X.shape[1])])
df.head()

Standardize features so each dimension contributes equally to Euclidean distance.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.values)
X_scaled[:3]

## 2) Choose k with Elbow Method (inertia/WCSS)
Plot inertia across k and look for an "elbow" where marginal gains diminish.

In [None]:
ks = range(2, 11)
inertias = []
for k in ks:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(6,4))
plt.plot(list(ks), inertias, marker='o')
plt.title('Elbow Method (Inertia vs k)')
plt.xlabel('k')
plt.ylabel('Inertia (WCSS)')
plt.xticks(list(ks))
plt.show()
inertias

## 3) Silhouette analysis
Silhouette score ([-1, 1]) measures separation and cohesion. Higher is better. Compare across k.

In [None]:
sil_scores = []
for k in ks:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    labels = km.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)
    sil_scores.append(sil)

plt.figure(figsize=(6,4))
plt.plot(list(ks), sil_scores, marker='o', color='green')
plt.title('Silhouette Score vs k')
plt.xlabel('k')
plt.ylabel('Average Silhouette Score')
plt.xticks(list(ks))
plt.show()
sil_scores

Pick k by silhouette maximum (ties broken by simplicity/smaller k if close).

In [None]:
best_k = int(ks[int(np.argmax(sil_scores))])
best_k

## 4) Fit final K-Means and visualize (PCA 2D)
Use PCA for 2D visualization; K-Means is fit in scaled space. We transform cluster centers through the same PCA for plotting.

In [None]:
kmeans = KMeans(n_clusters=best_k, init='k-means++', n_init=10, max_iter=300, random_state=42)
labels = kmeans.fit_predict(X_scaled)
avg_sil = silhouette_score(X_scaled, labels)
print({'k': best_k, 'inertia': kmeans.inertia_, 'silhouette': round(avg_sil, 4)})

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
centers_pca = pca.transform(kmeans.cluster_centers_)

plt.figure(figsize=(6,5))
palette = sns.color_palette('tab10', n_colors=best_k)
for i in range(best_k):
    plt.scatter(X_pca[labels==i,0], X_pca[labels==i,1], s=15, color=palette[i], label=f'Cluster {i}')
plt.scatter(centers_pca[:,0], centers_pca[:,1], c='black', s=120, marker='X', label='Centroids')
plt.title('K-Means Clusters (PCA 2D)')
plt.legend(loc='best', fontsize=8)
plt.tight_layout()
plt.show()

pd.Series(labels).value_counts().sort_index().rename('cluster_counts')

Silhouette per-sample distribution helps spot poorly assigned points (near 0 or negative).

In [None]:
sil_samples = silhouette_samples(X_scaled, labels)
plt.figure(figsize=(6,4))
sns.histplot(sil_samples, bins=30, kde=True)
plt.title('Silhouette score distribution (per sample)')
plt.xlabel('Silhouette score')
plt.tight_layout()
plt.show()

## 5) Multiple initializations
K-Means can find different solutions from different initializations. Inspect inertia distribution across seeds (k-means++ helps).

In [None]:
seeds = range(20)
inertias_multi = []
for s in seeds:
    km = KMeans(n_clusters=best_k, init='k-means++', n_init=10, max_iter=300, random_state=s)
    km.fit(X_scaled)
    inertias_multi.append(km.inertia_)
plt.figure(figsize=(6,4))
sns.boxplot(x=inertias_multi)
plt.title('Inertia across random seeds')
plt.xlabel('Inertia')
plt.tight_layout()
plt.show()
np.min(inertias_multi), np.median(inertias_multi), np.max(inertias_multi)

## Exercises
Complete the tasks below. Instructor solutions are hidden/collapsed.

In [None]:
# Exercise 1: No scaling vs scaling
# TODO: Run K-Means on the original (unscaled) df values for k=best_k and compare inertia and silhouette to the scaled version.
# Hint: use KMeans(...).fit_predict(df.values) and silhouette_score(df.values, labels_unscaled)
...

In [None]:
# Solution 1 (hidden)
km_unscaled = KMeans(n_clusters=best_k, init='k-means++', n_init=10, random_state=42)
labels_unscaled = km_unscaled.fit_predict(df.values)
sil_unscaled = silhouette_score(df.values, labels_unscaled)
print({'scaled_silhouette': round(avg_sil, 4), 'unscaled_silhouette': round(sil_unscaled, 4)})

In [None]:
# Exercise 2: Try different k and justify your choice
# TODO: For k in [3,4,5,6], compute silhouette scores and plot them. Which k would you pick and why?
...

In [None]:
# Solution 2 (hidden)
ks_try = [3,4,5,6]
sil_try = []
for kk in ks_try:
    km = KMeans(n_clusters=kk, init='k-means++', n_init=10, random_state=42)
    sil_try.append(silhouette_score(X_scaled, km.fit_predict(X_scaled)))
plt.figure(figsize=(5,3))
plt.plot(ks_try, sil_try, marker='o', color='purple')
plt.title('Silhouette for selected k')
plt.xlabel('k')
plt.ylabel('Silhouette')
plt.tight_layout()
plt.show()
list(zip(ks_try, [round(s,4) for s in sil_try]))

In [None]:
# Exercise 3: Iris dataset (unsupervised) and evaluation vs true labels
# TODO: Load iris, drop labels, standardize, run K-Means with k=3, and compute ARI and NMI vs true labels.
...

In [None]:
# Solution 3 (hidden)
iris = load_iris()
Xi = iris.data
yi = iris.target
Xi_scaled = StandardScaler().fit_transform(Xi)
km_i = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
labels_i = km_i.fit_predict(Xi_scaled)
ari = adjusted_rand_score(yi, labels_i)
nmi = normalized_mutual_info_score(yi, labels_i)
print({'ARI': round(ari, 4), 'NMI': round(nmi, 4)})

## Wrap-up checklist
- [ ] Standardize/normalize features
- [ ] Explore several k values (elbow + silhouette)
- [ ] Use k-means++ init and multiple initializations
- [ ] Inspect per-sample silhouette and cluster sizes
- [ ] Visualize with PCA when dimensionality > 2
