# DBSCAN and Density-Based Clustering

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the concepts of density-based clustering and the roles of core, border, and noise points
2. Use DBSCAN from scikit-learn with the key parameters `eps` and `min_samples`
3. Demonstrate how DBSCAN handles arbitrary cluster shapes where K-Means fails
4. Tune DBSCAN parameters using the k-distance plot
5. Visualize noise points and understand when DBSCAN is appropriate

## Prerequisites

- Notebook 02 (K-Means and Cluster Evaluation)
- Understanding of distance metrics (Notebook 01)
- NumPy, Matplotlib basics

## Table of Contents

1. [Theory: Density-Based Clustering](#1-theory-density-based-clustering)
2. [Core, Border, and Noise Points](#2-core-border-and-noise-points)
3. [DBSCAN Parameters](#3-dbscan-parameters)
4. [DBSCAN vs. K-Means on Moons Data](#4-dbscan-vs-k-means-on-moons-data)
5. [Parameter Tuning: k-Distance Plot](#5-parameter-tuning-k-distance-plot)
6. [Visualizing Noise Points](#6-visualizing-noise-points)
7. [Brief Mention: HDBSCAN (Optional)](#7-brief-mention-hdbscan-optional)
8. [Common Mistakes](#8-common-mistakes)
9. [Exercise](#9-exercise)

---

## 1. Theory: Density-Based Clustering

Unlike K-Means (which assumes spherical clusters) or hierarchical clustering, **density-based** methods define clusters as regions of high density separated by regions of low density.

**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) was proposed by Ester et al. in 1996. Its key advantages:

- Finds clusters of **arbitrary shape** (not just spherical)
- **Does not require specifying k** (the number of clusters) in advance
- Naturally identifies **noise/outlier** points
- Deterministic (given the same parameters and data)

---

## 2. Core, Border, and Noise Points

DBSCAN classifies every point into one of three categories:

- **Core point:** has at least `min_samples` neighbors within a radius of `eps`
- **Border point:** is within `eps` of a core point but does not itself have `min_samples` neighbors
- **Noise point:** is neither core nor border -- it is an outlier

Clusters are formed by connecting core points that are within `eps` of each other. Border points are assigned to the cluster of their nearest core point.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_moons, make_blobs, make_circles
from sklearn.cluster import DBSCAN, KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score

sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Imports complete.")

---

## 3. DBSCAN Parameters

| Parameter | Description | Effect |
|---|---|---|
| **`eps`** | Radius of the neighborhood | Smaller eps = more noise, more clusters; larger eps = fewer clusters |
| **`min_samples`** | Min points in a neighborhood to be a core point | Larger values = stricter core definition, more noise points |

There is no "k" to set -- the number of clusters is determined by the data and the parameters.

In [None]:
# Generate moons dataset
X_moons, y_moons = make_moons(n_samples=300, noise=0.05, random_state=42)

plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap="viridis", s=30, edgecolors="k")
plt.title("Moons Dataset (true labels)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

---

## 4. DBSCAN vs. K-Means on Moons Data

K-Means assumes spherical clusters and will fail on the moons dataset. DBSCAN can correctly identify the two crescent-shaped clusters.

In [None]:
# K-Means on moons
km_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_km = km_moons.fit_predict(X_moons)

# DBSCAN on moons
db_moons = DBSCAN(eps=0.2, min_samples=5)
labels_db = db_moons.fit_predict(X_moons)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap="viridis", s=30, edgecolors="k")
axes[0].set_title("True Labels")

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_km, cmap="viridis", s=30, edgecolors="k")
axes[1].set_title("K-Means (k=2) -- FAILS")

# Color noise points (-1) in red
colors_db = labels_db.copy().astype(float)
colors_db[labels_db == -1] = -1
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_db, cmap="viridis", s=30, edgecolors="k")
axes[2].set_title(f"DBSCAN (eps=0.2, min_samples=5) -- {len(set(labels_db)) - (1 if -1 in labels_db else 0)} clusters")

plt.tight_layout()
plt.show()
print("DBSCAN correctly identifies the two moon shapes. K-Means splits them incorrectly.")

In [None]:
# Another example: concentric circles
X_circles, y_circles = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

km_circles = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_km_c = km_circles.fit_predict(X_circles)

db_circles = DBSCAN(eps=0.2, min_samples=5)
labels_db_c = db_circles.fit_predict(X_circles)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap="viridis", s=30, edgecolors="k")
axes[0].set_title("True Labels (circles)")

axes[1].scatter(X_circles[:, 0], X_circles[:, 1], c=labels_km_c, cmap="viridis", s=30, edgecolors="k")
axes[1].set_title("K-Means -- FAILS")

n_clusters_c = len(set(labels_db_c)) - (1 if -1 in labels_db_c else 0)
axes[2].scatter(X_circles[:, 0], X_circles[:, 1], c=labels_db_c, cmap="viridis", s=30, edgecolors="k")
axes[2].set_title(f"DBSCAN -- {n_clusters_c} clusters")

plt.tight_layout()
plt.show()

---

## 5. Parameter Tuning: k-Distance Plot

The **k-distance plot** helps choose `eps`. For each point, compute the distance to its k-th nearest neighbor (where k = `min_samples`). Sort these distances and plot them. The "elbow" in this plot suggests a good value for `eps`.

In [None]:
# k-distance plot for the moons data
min_samples = 5

nn = NearestNeighbors(n_neighbors=min_samples)
nn.fit(X_moons)
distances, indices = nn.kneighbors(X_moons)

# Sort the distance to the k-th neighbor (last column)
k_distances = np.sort(distances[:, -1])

plt.figure(figsize=(10, 5))
plt.plot(k_distances, linewidth=2)
plt.axhline(y=0.2, color="r", linestyle="--", label="eps = 0.2")
plt.xlabel("Points (sorted by k-distance)")
plt.ylabel(f"Distance to {min_samples}-th nearest neighbor")
plt.title("k-Distance Plot for eps Selection")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

print("The elbow is around 0.2, confirming our eps choice.")

In [None]:
# Demonstrate effect of different eps values
fig, axes = plt.subplots(1, 4, figsize=(20, 4))

for ax, eps in zip(axes, [0.05, 0.1, 0.2, 0.5]):
    db = DBSCAN(eps=eps, min_samples=5)
    labels_eps = db.fit_predict(X_moons)
    n_clusters = len(set(labels_eps)) - (1 if -1 in labels_eps else 0)
    n_noise = np.sum(labels_eps == -1)
    ax.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_eps, cmap="viridis", s=20, edgecolors="k")
    ax.set_title(f"eps={eps}\nclusters={n_clusters}, noise={n_noise}")

plt.suptitle("Effect of eps on DBSCAN Clustering", fontsize=14, y=1.05)
plt.tight_layout()
plt.show()

---

## 6. Visualizing Noise Points

DBSCAN labels noise points as `-1`. Below we highlight them explicitly.

In [None]:
# Add some outliers to the moons data
np.random.seed(42)
outliers = np.random.uniform(low=-1.5, high=2.5, size=(20, 2))
X_noisy = np.vstack([X_moons, outliers])

db_noisy = DBSCAN(eps=0.2, min_samples=5)
labels_noisy = db_noisy.fit_predict(X_noisy)

mask_noise = labels_noisy == -1
mask_cluster = ~mask_noise

plt.figure(figsize=(10, 6))
plt.scatter(X_noisy[mask_cluster, 0], X_noisy[mask_cluster, 1],
            c=labels_noisy[mask_cluster], cmap="viridis", s=30, edgecolors="k", label="Clustered")
plt.scatter(X_noisy[mask_noise, 0], X_noisy[mask_noise, 1],
            c="red", marker="x", s=80, linewidths=2, label="Noise")
plt.title("DBSCAN with Noise Points Highlighted")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.tight_layout()
plt.show()

n_clusters = len(set(labels_noisy)) - (1 if -1 in labels_noisy else 0)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {np.sum(mask_noise)}")

---

## 7. Brief Mention: HDBSCAN (Optional)

**HDBSCAN** (Hierarchical DBSCAN) extends DBSCAN by:
- Eliminating the need to choose `eps` -- it varies the density threshold automatically
- Building a hierarchy of clusters and extracting the most stable ones
- Handling clusters of varying densities

HDBSCAN is available via the `hdbscan` package or in recent versions of scikit-learn (1.3+).

In [None]:
# HDBSCAN demo (optional -- will skip gracefully if not installed)
try:
    from sklearn.cluster import HDBSCAN as SklearnHDBSCAN
    hdb = SklearnHDBSCAN(min_cluster_size=10)
    labels_hdb = hdb.fit_predict(X_noisy)
    n_clusters_hdb = len(set(labels_hdb)) - (1 if -1 in labels_hdb else 0)
    
    plt.figure(figsize=(10, 6))
    mask_noise_h = labels_hdb == -1
    plt.scatter(X_noisy[~mask_noise_h, 0], X_noisy[~mask_noise_h, 1],
                c=labels_hdb[~mask_noise_h], cmap="viridis", s=30, edgecolors="k", label="Clustered")
    plt.scatter(X_noisy[mask_noise_h, 0], X_noisy[mask_noise_h, 1],
                c="red", marker="x", s=80, linewidths=2, label="Noise")
    plt.title(f"HDBSCAN: {n_clusters_hdb} clusters found")
    plt.legend()
    plt.tight_layout()
    plt.show()
except ImportError:
    print("HDBSCAN is not available in this environment.")
    print("Install with: pip install hdbscan  (or upgrade scikit-learn >= 1.3)")
except Exception as e:
    print(f"HDBSCAN could not run: {e}")

---

## 8. Common Mistakes

| Mistake | Why It Matters |
|---|---|
| **Wrong `eps` / `min_samples`** | Too small eps = everything is noise. Too large eps = one giant cluster. Use the k-distance plot. |
| **Not scaling features** | DBSCAN uses distance-based neighborhoods. Unscaled features distort the density estimation. |
| **Expecting DBSCAN to work like K-Means** | DBSCAN does not require k and does not produce centroids. It finds dense regions and labels outliers as noise (-1). |
| **Ignoring noise points** | Points labeled -1 are noise. Do not silently drop them or force them into clusters. Analyze them as potential outliers. |
| **Using DBSCAN on data with varying densities** | Standard DBSCAN uses a single eps for all clusters. If clusters have very different densities, consider HDBSCAN. |

---

## 9. Exercise

**Task:** Generate a dataset with `make_blobs(n_samples=400, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)`. Scale the data. Use the k-distance plot to choose `eps` (with `min_samples=5`). Run DBSCAN and visualize the result, highlighting noise points in red.

Bonus: Compare the DBSCAN result with K-Means (k=3) on the same data.

In [None]:
# YOUR CODE HERE

# 1. Generate blobs with varying cluster_std and scale
# 2. Compute k-distance plot and choose eps
# 3. Run DBSCAN and visualize clusters + noise
# 4. (Bonus) Run KMeans(k=3) and compare