# K-Means Clustering and Cluster Evaluation

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain unsupervised learning and the K-Means algorithm step by step
2. Use `KMeans` from scikit-learn with key parameters (`n_clusters`, `init`, `n_init`, `max_iter`)
3. Evaluate clustering quality using **inertia** and the **Elbow method**
4. Compute and interpret the **Silhouette score**
5. Recognize the limitations of K-Means

## Prerequisites

- Python fundamentals and NumPy/Pandas basics
- Matplotlib for plotting
- Conceptual understanding of supervised learning (to contrast with unsupervised)

## Table of Contents

1. [Theory: Unsupervised Learning and Clustering](#1-theory-unsupervised-learning-and-clustering)
2. [The K-Means Algorithm](#2-the-k-means-algorithm)
3. [K-Means in scikit-learn](#3-k-means-in-scikit-learn)
4. [Inertia and the Elbow Method](#4-inertia-and-the-elbow-method)
5. [Silhouette Score](#5-silhouette-score)
6. [Silhouette Plots for Different k](#6-silhouette-plots-for-different-k)
7. [Limitations of K-Means](#7-limitations-of-k-means)
8. [Common Mistakes](#8-common-mistakes)
9. [Exercise](#9-exercise)

---

## 1. Theory: Unsupervised Learning and Clustering

In **supervised learning**, we have labeled data (features + target). In **unsupervised learning**, we have only features -- no labels. The goal is to discover hidden structure in the data.

**Clustering** is a key unsupervised task: group data points so that points in the same cluster are more similar to each other than to points in other clusters.

Applications include:
- Customer segmentation
- Document grouping
- Image compression
- Anomaly detection

---

## 2. The K-Means Algorithm

K-Means partitions data into **k** clusters by minimizing the within-cluster sum of squares (inertia).

**Algorithm steps:**
1. Choose k (number of clusters).
2. Randomly initialize k **centroids**.
3. **Assign** each point to the nearest centroid.
4. **Update** each centroid to be the mean of all points assigned to it.
5. Repeat steps 3-4 until centroids stop moving (or `max_iter` is reached).

**Objective function (inertia):**

$$J = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2$$

where $C_i$ is cluster $i$ and $\boldsymbol{\mu}_i$ is its centroid.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples

sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Imports complete.")

---

## 3. K-Means in scikit-learn

Key parameters:
- **`n_clusters`**: number of clusters (k)
- **`init`**: initialization method -- `'k-means++'` (default, smart init) or `'random'`
- **`n_init`**: number of times to run with different seeds (default=10); best result is kept
- **`max_iter`**: max iterations per run (default=300)
- **`random_state`**: for reproducibility

In [None]:
# Generate synthetic blob data
X_blobs, y_true = make_blobs(
    n_samples=300, centers=4, cluster_std=0.8, random_state=42
)

plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap="viridis", s=30, edgecolors="k")
plt.title("Synthetic Blobs (true labels)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

In [None]:
# Run KMeans
kmeans = KMeans(n_clusters=4, init="k-means++", n_init=10, max_iter=300, random_state=42)
labels = kmeans.fit_predict(X_blobs)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap="viridis", s=30, edgecolors="k")
axes[0].set_title("True Labels")

axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels, cmap="viridis", s=30, edgecolors="k")
axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                c="red", marker="X", s=200, edgecolors="k", label="Centroids")
axes[1].set_title("KMeans Labels (k=4)")
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Inertia: {kmeans.inertia_:.2f}")
print(f"Number of iterations: {kmeans.n_iter_}")

---

## 4. Inertia and the Elbow Method

**Inertia** (within-cluster sum of squares) always decreases as k increases. The **Elbow method** looks for the k where adding more clusters gives diminishing returns -- the "elbow" of the curve.

In [None]:
inertias = []
k_range = range(1, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_blobs)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, "bo-", linewidth=2)
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia (within-cluster sum of squares)")
plt.title("Elbow Method")
plt.xticks(list(k_range))
plt.grid(True)
plt.tight_layout()
plt.show()

print("The 'elbow' is at k=4, matching the true number of clusters.")

---

## 5. Silhouette Score

The **silhouette score** measures how similar a point is to its own cluster compared to other clusters.

For each sample $i$:
- $a(i)$: mean distance to other points in the same cluster
- $b(i)$: mean distance to points in the nearest neighboring cluster

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

- $s(i) \approx 1$: well-clustered
- $s(i) \approx 0$: on the boundary
- $s(i) < 0$: possibly assigned to the wrong cluster

The overall score is the mean across all samples.

In [None]:
sil_scores = []
k_range_sil = range(2, 11)  # silhouette needs at least 2 clusters

for k in k_range_sil:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X_blobs)
    score = silhouette_score(X_blobs, labels_k)
    sil_scores.append(score)
    print(f"k={k}: Silhouette Score = {score:.4f}")

plt.figure(figsize=(8, 5))
plt.plot(list(k_range_sil), sil_scores, "go-", linewidth=2)
plt.xlabel("Number of clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs. k")
plt.xticks(list(k_range_sil))
plt.grid(True)
plt.tight_layout()
plt.show()

---

## 6. Silhouette Plots for Different k

A silhouette plot shows the silhouette coefficient for each sample, grouped by cluster. Well-formed clusters have uniformly thick "knives" above the average score line.

In [None]:
import matplotlib.cm as cm

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, k in enumerate([3, 4, 5]):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X_blobs)
    sil_vals = silhouette_samples(X_blobs, labels_k)
    avg_score = silhouette_score(X_blobs, labels_k)

    ax = axes[idx]
    y_lower = 10
    for i in range(k):
        cluster_sil = np.sort(sil_vals[labels_k == i])
        y_upper = y_lower + len(cluster_sil)
        color = cm.nipy_spectral(float(i) / k)
        ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,
                         facecolor=color, alpha=0.7)
        ax.text(-0.05, y_lower + 0.5 * len(cluster_sil), str(i))
        y_lower = y_upper + 10

    ax.axvline(x=avg_score, color="red", linestyle="--", label=f"Avg: {avg_score:.3f}")
    ax.set_title(f"Silhouette Plot (k={k})")
    ax.set_xlabel("Silhouette Coefficient")
    ax.set_ylabel("Cluster")
    ax.legend(loc="upper right")

plt.tight_layout()
plt.show()
print("k=4 shows the most uniform silhouette widths, confirming it as the best choice.")

---

## 7. Limitations of K-Means

K-Means has several important limitations:

1. **Assumes spherical (globular) clusters** -- struggles with elongated or irregular shapes.
2. **Sensitive to initialization** -- `k-means++` helps but does not guarantee the global optimum.
3. **Must specify k in advance** -- the elbow method helps, but it is not always clear-cut.
4. **Sensitive to outliers** -- a single outlier can pull a centroid significantly.

In [None]:
# Demonstration: KMeans fails on non-globular data
X_moons, y_moons = make_moons(n_samples=300, noise=0.05, random_state=42)

km_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_moons = km_moons.fit_predict(X_moons)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap="viridis", s=30, edgecolors="k")
axes[0].set_title("True Labels (moons)")

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap="viridis", s=30, edgecolors="k")
axes[1].set_title("KMeans Labels (k=2) -- FAILS on moons")

plt.tight_layout()
plt.show()
print("KMeans cannot separate the two moons because they are not spherical clusters.")
print("We will see that DBSCAN handles this correctly in Notebook 04.")

---

## 8. Common Mistakes

| Mistake | Why It Matters |
|---|---|
| **Not scaling features** | Features with larger ranges dominate the Euclidean distance used by KMeans. Always scale first. |
| **Choosing k without evaluation** | Always use the Elbow method and/or Silhouette score. Do not guess k. |
| **Applying KMeans to non-globular data** | KMeans assumes spherical clusters. Use DBSCAN or hierarchical clustering for non-convex shapes. |
| **Ignoring `n_init`** | Running KMeans only once may converge to a poor local minimum. Use `n_init >= 10`. |
| **Interpreting cluster labels as meaningful** | KMeans labels (0, 1, 2...) are arbitrary -- they do not correspond to any ordering or ground truth. |

---

## 9. Exercise

**Task:** Generate a dataset using `make_blobs` with `n_samples=500, centers=5, cluster_std=1.0, random_state=42`. Scale the data with `StandardScaler`. Run KMeans for k = 2 to 10. Plot both the Elbow curve (inertia) and the Silhouette score curve side by side. Identify the best k.

Bonus: Create a silhouette plot for the best k.

In [None]:
# YOUR CODE HERE

# 1. Generate blobs
# 2. Scale data
# 3. Loop over k = 2..10, compute inertia and silhouette score
# 4. Plot both curves side by side
# 5. Print the best k