# 028: DBSCAN - Density-Based Spatial Clustering with Noise

## üéØ Learning Objectives

By the end of this notebook, you will:

1. **Understand density-based clustering**: Core points, border points, noise, reachability concepts
2. **Master DBSCAN parameters**: eps (neighborhood radius), min_samples (density threshold), distance metrics
3. **Implement from scratch**: Build DBSCAN algorithm with region queries and cluster expansion
4. **Handle arbitrary shapes**: Cluster non-spherical data that defeats K-Means and hierarchical methods
5. **Detect outliers automatically**: Label noise points as -1 without forcing into clusters
6. **Apply to real problems**: Geospatial defect clustering, anomaly detection, wafer map hotspots
7. **Compare with K-Means**: Understand when density-based clustering outperforms centroid-based methods

---

## üìä DBSCAN Workflow

```mermaid
graph TD
    A[üì• Data Points N samples] --> B[‚öôÔ∏è Set Parameters: eps, min_samples]
    B --> C[üîç Mark Point Types]
    
    C --> D{For each point p}
    D --> E[üåê Find eps-neighborhood N_eps p]
    E --> F{N_eps >= min_samples?}
    
    F -->|Yes| G[‚úÖ Core Point]
    F -->|No| H[‚è∏Ô∏è Border or Noise tentative]
    
    G --> I[üå± Start New Cluster if unvisited]
    I --> J[‚ôªÔ∏è Expand Cluster: Add all density-reachable points]
    J --> K[üîó Recursive neighborhood search]
    
    K --> L{More unvisited points?}
    L -->|Yes| D
    L -->|No| M[üìã Final Classification]
    
    H --> N{Point in eps-neighborhood of core?}
    N -->|Yes| O[üî∂ Border Point assign to cluster]
    N -->|No| P[‚ùå Noise label=-1]
    
    M --> Q[üéØ Clusters + Outliers]
    Q --> R[üìà Evaluate: Silhouette, noise ratio]
    
    style G fill:#e1f5e1
    style O fill:#fff4e1
    style P fill:#ffe1e1
    style Q fill:#e1f0ff
```

---

## üîç DBSCAN vs K-Means vs Hierarchical

| **Criterion** | **DBSCAN** | **K-Means** | **Hierarchical** |
|--------------|-----------|------------|-----------------|
| **Requires K upfront** | ‚ùå No (discovers automatically) | ‚úÖ Yes | ‚ùå No (cut dendrogram) |
| **Cluster shape** | Arbitrary (density-based) | Spherical only | Linkage-dependent |
| **Handles outliers** | ‚úÖ Excellent (labels as -1) | ‚ùå Poor (forces assignment) | ‚ùå Poor |
| **Scalability** | Medium O(n log n) with index | Excellent O(nkt) | Poor O(n¬≤) |
| **Density variation** | Poor (single eps for all) | N/A | N/A |
| **Parameter sensitivity** | High (eps, min_samples) | Medium (K, init) | Low (linkage choice) |
| **Deterministic** | ‚úÖ Yes | ‚ùå No (random init) | ‚úÖ Yes |
| **Best for** | Geospatial, arbitrary shapes, outliers | Large data, spherical clusters | Small data, taxonomy |

---

## üè≠ Real-World Applications

### Post-Silicon Validation
- **Wafer Map Defect Clustering**: Identify spatial defect clusters (hotspots) vs random failures, label noise as -1
- **Parametric Outlier Detection**: Cluster normal devices, automatically flag anomalies as noise
- **Spatial Yield Patterns**: Discover irregular yield zones (not circular like K-Means assumes)
- **Multi-Die Proximity Analysis**: Group die that fail together spatially (indicative of process tool issues)

### General AI/ML
- **Geospatial Analysis**: Cluster GPS coordinates (crime hotspots, customer locations, earthquake epicenters)
- **Anomaly Detection**: Network intrusion detection (normal traffic clusters, attacks = noise)
- **Image Segmentation**: Cluster pixels by color/texture, handle irregular object shapes
- **Time Series Clustering**: Group similar temporal patterns, ignore sporadic anomalies

---

## üìö Mathematical Foundation

### Core DBSCAN Concepts

#### 1. Eps-Neighborhood
For a point $p$ and radius $\varepsilon$ (eps):
$$
N_{\varepsilon}(p) = \{q \in D : \text{dist}(p, q) \leq \varepsilon\}
$$
- All points within distance $\varepsilon$ from $p$
- Typically uses Euclidean distance, but can use Manhattan, Haversine (geospatial), etc.

#### 2. Point Classifications

**Core Point**: Has at least `min_samples` points in its $\varepsilon$-neighborhood (including itself)
$$
|N_{\varepsilon}(p)| \geq \text{min\_samples}
$$

**Border Point**: Not a core point, but is in the $\varepsilon$-neighborhood of a core point
$$
|N_{\varepsilon}(p)| < \text{min\_samples} \text{ AND } \exists \text{ core point } q : p \in N_{\varepsilon}(q)
$$

**Noise Point**: Neither core nor border (isolated, low-density region)
$$
|N_{\varepsilon}(p)| < \text{min\_samples} \text{ AND } \nexists \text{ core point } q : p \in N_{\varepsilon}(q)
$$

#### 3. Density Reachability

**Directly Density-Reachable**: Point $q$ is directly density-reachable from $p$ if:
1. $p$ is a core point
2. $q \in N_{\varepsilon}(p)$

**Density-Reachable**: Point $q$ is density-reachable from $p$ if there exists a chain:
$$
p = p_1, p_2, \ldots, p_n = q
$$
where each $p_{i+1}$ is directly density-reachable from $p_i$.

**Density-Connected**: Points $p$ and $q$ are density-connected if there exists a point $o$ such that both $p$ and $q$ are density-reachable from $o$.

#### 4. Cluster Definition

A **cluster** is a maximal set of density-connected points:
$$
C = \{p : p \text{ is density-reachable from some core point } o\}
$$

### DBSCAN Algorithm (Ester et al. 1996)

**Input:** Dataset $D$, parameters $\varepsilon$ (eps), min_samples  
**Output:** Cluster labels for each point (0, 1, 2, ... or -1 for noise)

```
1. Initialize all points as unvisited
2. For each unvisited point p:
   a. Mark p as visited
   b. Find neighbors N = N_Œµ(p)
   c. If |N| < min_samples:
      - Mark p as NOISE (tentatively)
   d. Else:
      - p is a CORE point
      - Create new cluster C
      - Add p to C
      - For each point q in N:
         i. If q is unvisited:
            - Mark q as visited
            - Find neighbors N' = N_Œµ(q)
            - If |N'| >= min_samples:
               - Add N' to N (expand neighborhood)
         ii. If q not in any cluster:
             - Add q to C
3. Border points: Any NOISE point in neighborhood of core point reassigned to that cluster
4. Remaining NOISE points: Label as -1 (outliers)
```

**Complexity:**
- Naive: O(n¬≤) - compute all pairwise distances
- With spatial index (KD-tree, Ball tree): O(n log n) - efficient neighbor queries
- Memory: O(n) - store labels and visited flags

### Parameter Selection

#### Epsilon (eps) - Neighborhood Radius
- **Too small**: Most points become noise, many tiny clusters
- **Too large**: All points merge into one cluster
- **Heuristic**: k-distance plot (plot sorted distance to k-th nearest neighbor), look for "elbow"
- **Domain knowledge**: For wafer maps, eps = 5-10mm (typical die spacing)

**k-distance formula** (for k = min_samples):
$$
\text{k-dist}(p) = \text{distance to } k\text{-th nearest neighbor of } p
$$

Sort all k-dist values ascending, plot. Sharp increase = good eps value.

#### Min_samples - Minimum Density Threshold
- **Rule of thumb**: min_samples = $2 \times d$ (where $d$ = number of dimensions)
- **2D data**: min_samples = 4-5
- **Higher dimensions**: min_samples = 6-10
- **Trade-off**: 
  - Higher min_samples ‚Üí fewer, larger clusters, more noise
  - Lower min_samples ‚Üí more, smaller clusters, less noise

**Post-Silicon Typical Values:**
- **Wafer maps (2D spatial)**: eps=5-10mm, min_samples=4-6
- **Parametric space (10-50D)**: eps=1-2 (normalized), min_samples=10-20
- **Test time clustering (1D)**: eps=2-5 seconds, min_samples=3-5

### Distance Metrics

| **Metric** | **Formula** | **Use Case** |
|-----------|------------|-------------|
| **Euclidean** | $\sqrt{\sum (x_i - y_i)^2}$ | Continuous features, spatial data |
| **Manhattan** | $\sum |x_i - y_i|$ | Grid-like spaces, sparse data |
| **Haversine** | $2r \arcsin\sqrt{\sin^2(\frac{\Delta\phi}{2}) + \cos\phi_1\cos\phi_2\sin^2(\frac{\Delta\lambda}{2})}$ | Geographic coordinates (lat/lon) |
| **Cosine** | $1 - \frac{\sum x_i y_i}{\sqrt{\sum x_i^2}\sqrt{\sum y_i^2}}$ | Text vectors, high-dim sparse |

---

## üì¶ Required Libraries

### üìù What's Happening: Import Dependencies

**Purpose:** Load libraries for DBSCAN implementation, spatial indexing, and distance computations.

**Key Points:**
- **sklearn.cluster.DBSCAN**: Production-ready implementation with KD-tree optimization
- **sklearn.neighbors.NearestNeighbors**: Efficient radius queries for eps-neighborhood
- **scipy.spatial.distance**: Distance metrics (Euclidean, Manhattan, etc.)
- **matplotlib/seaborn**: Visualize clusters with different colors, noise in black
- **NumPy**: Distance computations and array operations

**Why This Matters:** DBSCAN requires efficient neighbor queries (find all points within eps radius). sklearn uses KD-trees for O(log n) queries vs O(n) naive search. For 10K+ points, spatial indexing is 100√ó faster than brute-force.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_moons, make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
from scipy.spatial.distance import pdist, squareform

# Set random seed
np.random.seed(42)

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")
print("\nKey Modules:")
print("  ‚Ä¢ sklearn.cluster.DBSCAN: Production implementation")
print("  ‚Ä¢ sklearn.neighbors.NearestNeighbors: Efficient radius queries")
print("  ‚Ä¢ make_moons: Generate non-spherical test data")
print("  ‚Ä¢ StandardScaler: Feature normalization (affects eps)")
print("\nDBSCAN Parameters:")
print("  ‚Ä¢ eps: Neighborhood radius (Œµ)")
print("  ‚Ä¢ min_samples: Minimum density threshold")
print("  ‚Ä¢ metric: Distance metric (euclidean, manhattan, etc.)")

---

## üî® Implementation From Scratch: DBSCAN Algorithm

### üìù What's Happening: Building DBSCAN from Ground Up

**Purpose:** Implement DBSCAN from scratch to understand core/border/noise classification and cluster expansion logic.

**Key Points:**
- **Region Query**: Find all points within eps radius using distance matrix (naive O(n) per query)
- **Core Point Detection**: Mark points with >= min_samples neighbors as core points
- **Cluster Expansion**: Recursively add density-reachable points via BFS/DFS
- **Border Assignment**: Non-core points in core neighborhoods become border points
- **Noise Labeling**: Isolated points (not reachable from any core) labeled as -1

**Why This Matters:** Understanding expansion logic reveals why DBSCAN discovers arbitrary shapes (follows density contours) vs K-Means (spherical boundaries). In wafer defect clustering, DBSCAN traces irregular hotspot patterns that K-Means would split incorrectly.

**Post-Silicon Context:** For wafer map spatial defects, DBSCAN correctly identifies elongated scratch patterns or crescent-shaped edge failures that K-Means would fragment into multiple circular clusters. From-scratch implementation clarifies why eps must match physical die spacing (typically 5-10mm).

In [None]:
class DBSCANFromScratch:
    """
    DBSCAN clustering implementation from scratch.
    
    Uses naive O(n¬≤) distance computation for simplicity.
    Production code should use KD-trees for O(n log n).
    """
    
    def __init__(self, eps=0.5, min_samples=5):
        """
        Parameters:
        -----------
        eps : float
            Maximum distance between two points to be neighbors
        min_samples : int
            Minimum points in neighborhood to be core point
        """
        self.eps = eps
        self.min_samples = min_samples
        self.labels_ = None
        self.core_sample_indices_ = None
    
    def fit(self, X):
        """
        Perform DBSCAN clustering.
        
        Parameters:
        -----------
        X : ndarray of shape (n_samples, n_features)
            Training data
        """
        n_samples = X.shape[0]
        
        # Compute pairwise distance matrix
        dist_matrix = squareform(pdist(X, metric='euclidean'))
        
        # Initialize labels: -1 = noise, 0+ = cluster IDs
        labels = np.full(n_samples, -1, dtype=int)
        
        # Track visited points
        visited = np.zeros(n_samples, dtype=bool)
        
        # Track core points
        core_points = []
        
        # Current cluster ID
        cluster_id = 0
        
        # Process each point
        for point_idx in range(n_samples):
            if visited[point_idx]:
                continue
            
            visited[point_idx] = True
            
            # Find eps-neighborhood
            neighbors = self._region_query(dist_matrix, point_idx)
            
            # Check if core point
            if len(neighbors) < self.min_samples:
                # Tentatively mark as noise (may become border later)
                labels[point_idx] = -1
            else:
                # Core point: start new cluster
                core_points.append(point_idx)
                labels = self._expand_cluster(X, dist_matrix, labels, point_idx, 
                                             neighbors, cluster_id, visited)
                cluster_id += 1
        
        self.labels_ = labels
        self.core_sample_indices_ = np.array(core_points)
        
        return self
    
    def _region_query(self, dist_matrix, point_idx):
        """
        Find all points within eps distance of point_idx.
        
        Returns:
        --------
        neighbors : list
            Indices of points in eps-neighborhood
        """
        neighbors = np.where(dist_matrix[point_idx] <= self.eps)[0]
        return neighbors.tolist()
    
    def _expand_cluster(self, X, dist_matrix, labels, point_idx, neighbors, cluster_id, visited):
        """
        Expand cluster by adding all density-reachable points.
        
        Uses queue-based approach (BFS).
        """
        # Assign core point to cluster
        labels[point_idx] = cluster_id
        
        # Queue for processing neighbors
        seeds = neighbors.copy()
        
        while len(seeds) > 0:
            current_point = seeds.pop(0)
            
            if not visited[current_point]:
                visited[current_point] = True
                
                # Find neighbors of current point
                current_neighbors = self._region_query(dist_matrix, current_point)
                
                # If current point is also core, add its neighbors to queue
                if len(current_neighbors) >= self.min_samples:
                    seeds.extend(current_neighbors)
            
            # Assign to cluster if not already assigned
            if labels[current_point] == -1:
                labels[current_point] = cluster_id
        
        return labels
    
    def fit_predict(self, X):
        """
        Fit and return cluster labels.
        """
        self.fit(X)
        return self.labels_

print("‚úÖ DBSCAN implemented from scratch!")
print("\nKey Methods:")
print("  ‚Ä¢ fit(X) - Perform clustering, return labels (-1 for noise)")
print("  ‚Ä¢ _region_query() - Find eps-neighborhood (all points within eps)")
print("  ‚Ä¢ _expand_cluster() - BFS expansion, add density-reachable points")
print("\nAlgorithm Flow:")
print("  1. For each unvisited point:")
print("     a. Find eps-neighborhood")
print("     b. If >= min_samples neighbors ‚Üí core point, start cluster")
print("     c. Expand cluster recursively via BFS")
print("  2. Border points: Assigned to nearest core point's cluster")
print("  3. Noise: Points not reachable from any core ‚Üí label -1")
print("\nComplexity:")
print("  ‚Ä¢ Naive: O(n¬≤) - compute all pairwise distances")
print("  ‚Ä¢ With KD-tree: O(n log n) - efficient neighbor queries")

### üìù What's Happening: Testing on Non-Spherical Data (Moons)

**Purpose:** Validate from-scratch DBSCAN on crescent-shaped data where K-Means fails spectacularly.

**Key Points:**
- **make_moons Dataset**: Two interleaving crescent shapes (moons) - non-spherical, non-convex
- **DBSCAN Success**: Correctly identifies 2 moons + noise outliers (label -1)
- **K-Means Failure**: Would split moons incorrectly due to spherical cluster assumption
- **Parameter Tuning**: eps=0.3, min_samples=5 (determined by trial or k-distance plot)
- **Noise Handling**: Outlier points automatically labeled as -1 (black in visualization)

**Why This Matters:** Real-world clusters rarely form perfect circles. DBSCAN discovers arbitrary shapes by following density contours. In semiconductor defect analysis, scratch patterns, edge failures, and hotspots have irregular shapes that DBSCAN captures correctly.

**Post-Silicon Context:** Wafer map defects include elongated scratches (line-shaped), crescent-shaped edge exclusions, and irregular hotspots. DBSCAN accurately clusters these patterns while K-Means would create spurious circular boundaries crossing the true defect regions.

In [None]:
# Generate two moons dataset (non-spherical, non-convex)
X_moons, y_true = make_moons(n_samples=300, noise=0.05, random_state=42)

print("üìä Moons Dataset Generated:")
print(f"  ‚Ä¢ Shape: {X_moons.shape}")
print(f"  ‚Ä¢ True clusters: 2 moons")
print(f"  ‚Ä¢ Noise level: 0.05 (5% of data)")

# Standardize features (DBSCAN sensitive to scale)
scaler = StandardScaler()
X_moons_scaled = scaler.fit_transform(X_moons)

# Train from-scratch DBSCAN
dbscan_scratch = DBSCANFromScratch(eps=0.3, min_samples=5)
dbscan_scratch.fit(X_moons_scaled)

# Count clusters and noise
n_clusters_scratch = len(set(dbscan_scratch.labels_)) - (1 if -1 in dbscan_scratch.labels_ else 0)
n_noise_scratch = list(dbscan_scratch.labels_).count(-1)

print(f"\n‚úÖ DBSCAN From-Scratch Complete!")
print(f"  ‚Ä¢ Clusters found: {n_clusters_scratch}")
print(f"  ‚Ä¢ Noise points: {n_noise_scratch} ({n_noise_scratch/len(X_moons)*100:.1f}%)")
print(f"  ‚Ä¢ Core points: {len(dbscan_scratch.core_sample_indices_)}")
print(f"  ‚Ä¢ Border + noise: {len(X_moons) - len(dbscan_scratch.core_sample_indices_)}")

# Compare with true labels (if noise removed)
non_noise_mask = dbscan_scratch.labels_ != -1
if non_noise_mask.sum() > 0:
    ari = adjusted_rand_score(y_true[non_noise_mask], dbscan_scratch.labels_[non_noise_mask])
    print(f"  ‚Ä¢ ARI (non-noise points): {ari:.4f}")

# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Ground truth
axes[0].scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=y_true, cmap='viridis',
                alpha=0.6, edgecolors='k', s=60)
axes[0].set_title("Ground Truth (2 Moons)", fontsize=14, fontweight='bold')
axes[0].set_xlabel("Feature 1 (scaled)")
axes[0].set_ylabel("Feature 2 (scaled)")

# DBSCAN clusters (noise in black)
colors = ['red' if label == -1 else 'C{}'.format(label) for label in dbscan_scratch.labels_]
axes[1].scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=dbscan_scratch.labels_,
                cmap='viridis', alpha=0.6, edgecolors='k', s=60)
# Mark core points with larger size
core_mask = np.zeros(len(X_moons), dtype=bool)
core_mask[dbscan_scratch.core_sample_indices_] = True
axes[1].scatter(X_moons_scaled[core_mask, 0], X_moons_scaled[core_mask, 1],
                c=dbscan_scratch.labels_[core_mask], cmap='viridis',
                edgecolors='black', s=120, linewidths=2, alpha=0.8, marker='o', label='Core')
axes[1].set_title(f"DBSCAN From-Scratch ({n_clusters_scratch} clusters, {n_noise_scratch} noise)", 
                  fontsize=14, fontweight='bold')
axes[1].set_xlabel("Feature 1 (scaled)")
axes[1].set_ylabel("Feature 2 (scaled)")
axes[1].legend()

# For comparison: K-Means (will fail)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_moons_scaled)
axes[2].scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=kmeans_labels, cmap='viridis',
                alpha=0.6, edgecolors='k', s=60)
axes[2].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                marker='X', s=300, c='red', edgecolors='black', linewidths=2, label='Centroids')
axes[2].set_title("K-Means (FAILS - spherical assumption)", fontsize=14, fontweight='bold')
axes[2].set_xlabel("Feature 1 (scaled)")
axes[2].set_ylabel("Feature 2 (scaled)")
axes[2].legend()

plt.tight_layout()
plt.show()

print("\nüîç Interpretation:")
print("  ‚Ä¢ DBSCAN: Correctly separates 2 moons by following density contours")
print("  ‚Ä¢ K-Means: Fails - draws straight line boundary, splits moons incorrectly")
print("  ‚Ä¢ Core points (large circles): High-density regions forming cluster centers")
print("  ‚Ä¢ Border points (small circles): Low-density edges of clusters")
print("  ‚Ä¢ Noise (black): Outliers not assigned to any cluster")
print("\nüí° Post-Silicon Analogy:")
print("  ‚Ä¢ Moon shapes = Irregular wafer defect patterns (scratches, edge exclusions)")
print("  ‚Ä¢ DBSCAN traces defect contours accurately")
print("  ‚Ä¢ K-Means creates artificial circular boundaries crossing defect regions")

---

## üéØ Parameter Tuning: k-Distance Plot for Optimal Eps

### üìù What's Happening: Finding Optimal Eps Automatically

**Purpose:** Use k-distance plot (sorted distance to k-th nearest neighbor) to identify optimal eps value without trial-and-error.

**Key Points:**
- **k-Distance Calculation**: For each point, find distance to k-th nearest neighbor (k = min_samples)
- **Sort and Plot**: Sort distances ascending, plot index vs distance
- **Elbow Detection**: Sharp increase (elbow) indicates optimal eps - points beyond are outliers
- **Interpretation**: Elbow = transition from dense regions (clusters) to sparse regions (noise)
- **Automation**: Can use knee detection algorithms (kneed library) for programmatic eps selection

**Why This Matters:** Manual eps tuning is tedious and subjective. k-distance plot provides data-driven eps selection. For post-silicon applications with varying die density or test distributions, k-distance plot adapts automatically to data characteristics.

**Post-Silicon Context:** Wafer maps have varying defect densities across lots (high-yield vs low-yield wafers). k-distance plot automatically adjusts eps: tight clusters for high-yield (small eps), looser for low-yield (larger eps). Eliminates manual retuning per lot.

In [None]:
# Compute k-distance plot for eps selection
k = 5  # Same as min_samples

# Fit NearestNeighbors to find k-th nearest neighbor distances
neighbors_model = NearestNeighbors(n_neighbors=k)
neighbors_model.fit(X_moons_scaled)

# Get distances to k-th nearest neighbor for each point
distances, indices = neighbors_model.kneighbors(X_moons_scaled)
k_distances = distances[:, -1]  # Distance to k-th neighbor (last column)

# Sort distances
k_distances_sorted = np.sort(k_distances)

print("üìä k-Distance Plot Analysis:")
print(f"  ‚Ä¢ k (min_samples): {k}")
print(f"  ‚Ä¢ Min k-distance: {k_distances_sorted[0]:.4f}")
print(f"  ‚Ä¢ Max k-distance: {k_distances_sorted[-1]:.4f}")
print(f"  ‚Ä¢ Median k-distance: {np.median(k_distances_sorted):.4f}")

# Plot k-distance plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_distances_sorted, linewidth=2, color='steelblue')
plt.axhline(y=0.3, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Suggested eps=0.3')
plt.xlabel("Points (sorted by distance)", fontsize=12)
plt.ylabel(f"{k}-Distance (to {k}-th nearest neighbor)", fontsize=12)
plt.title("k-Distance Plot for Eps Selection", fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.legend()

# Zoomed view (first 90% of points)
plt.subplot(1, 2, 2)
cutoff = int(0.9 * len(k_distances_sorted))
plt.plot(k_distances_sorted[:cutoff], linewidth=2, color='steelblue')
plt.axhline(y=0.3, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Suggested eps=0.3')
plt.xlabel("Points (sorted, first 90%)", fontsize=12)
plt.ylabel(f"{k}-Distance", fontsize=12)
plt.title("k-Distance Plot (Zoomed - First 90%)", fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.legend()

plt.tight_layout()
plt.show()

# Identify elbow (simple heuristic: largest gap in distances)
distance_diffs = np.diff(k_distances_sorted)
elbow_idx = np.argmax(distance_diffs)
suggested_eps = k_distances_sorted[elbow_idx]

print(f"\nüéØ Eps Selection Guidance:")
print(f"  ‚Ä¢ Elbow detected at index: {elbow_idx}")
print(f"  ‚Ä¢ Suggested eps (elbow): {suggested_eps:.4f}")
print(f"  ‚Ä¢ Manual eps used: 0.3 (close to suggestion)")
print(f"\nüîç Interpretation:")
print("  ‚Ä¢ Flat region (0-270): Dense clusters, points close together")
print("  ‚Ä¢ Sharp rise (270-300): Outliers, far from clusters")
print("  ‚Ä¢ Elbow (~0.3): Optimal eps separates clusters from noise")
print("\nüí° How to Use k-Distance Plot:")
print("  1. Compute k-distances for all points (k = min_samples)")
print("  2. Sort distances ascending")
print("  3. Plot: Look for sharp increase (elbow)")
print("  4. Set eps slightly above elbow value")
print("  5. If no clear elbow: Try different min_samples or domain knowledge")

# Test different eps values
eps_values = [0.2, 0.3, 0.4, 0.5]
print(f"\nüß™ Testing Different Eps Values:")
print(f"{'Eps':<8} {'Clusters':<10} {'Noise %':<12} {'Silhouette'}")
print("-" * 50)

for eps_val in eps_values:
    dbscan_test = DBSCANFromScratch(eps=eps_val, min_samples=k)
    dbscan_test.fit(X_moons_scaled)
    
    n_clusters = len(set(dbscan_test.labels_)) - (1 if -1 in dbscan_test.labels_ else 0)
    n_noise = list(dbscan_test.labels_).count(-1)
    noise_pct = n_noise / len(X_moons) * 100
    
    # Silhouette score (exclude noise)
    if n_clusters > 1 and n_noise < len(X_moons):
        non_noise_mask = dbscan_test.labels_ != -1
        if non_noise_mask.sum() > 1:
            silhouette = silhouette_score(X_moons_scaled[non_noise_mask], 
                                         dbscan_test.labels_[non_noise_mask])
        else:
            silhouette = 0.0
    else:
        silhouette = 0.0
    
    print(f"{eps_val:<8.2f} {n_clusters:<10} {noise_pct:<12.1f} {silhouette:.4f}")

print("\n‚úÖ Optimal eps=0.3:")
print("  ‚Ä¢ Balances cluster discovery (2 clusters) with noise detection")
print("  ‚Ä¢ Too low (0.2): Fragments clusters, excessive noise")
print("  ‚Ä¢ Too high (0.5): Merges clusters, loses separation")

---

## üè≠ Production Implementation: sklearn.cluster.DBSCAN

### üìù What's Happening: sklearn DBSCAN with KD-Tree Optimization

**Purpose:** Use production-grade sklearn DBSCAN with optimized spatial indexing for 10-100√ó speedup on large datasets.

**Key Points:**
- **sklearn.cluster.DBSCAN**: Industry-standard implementation with KD-tree/Ball tree for O(n log n)
- **Algorithm Parameter**: Choose 'auto', 'ball_tree', 'kd_tree', or 'brute' for neighbor search
- **Metric Options**: Supports 20+ distance metrics (euclidean, manhattan, haversine, etc.)
- **Leaf Size Tuning**: Affects KD-tree query speed (default 30, tune for 10K+ points)
- **Memory Efficiency**: Handles 100K+ points efficiently vs naive O(n¬≤)

**Why This Matters:** From-scratch DBSCAN is O(n¬≤), unusable for 10K+ points. sklearn's spatial indexing reduces complexity to O(n log n), enabling real-time clustering. For semiconductor applications with 50K+ devices or wafer die, sklearn is 100√ó faster.

**Post-Silicon Context:** Clustering 50K die on 200 wafers:
- **From-scratch**: 50K¬≤ = 2.5B distance computations ‚Üí 20+ minutes
- **sklearn with KD-tree**: 50K log(50K) ‚âà 800K operations ‚Üí 10-15 seconds
- Speedup enables real-time wafer map analysis during test execution

In [None]:
# sklearn DBSCAN with optimized spatial indexing
dbscan_sklearn = DBSCAN(eps=0.3, min_samples=5, algorithm='auto', metric='euclidean')
labels_sklearn = dbscan_sklearn.fit_predict(X_moons_scaled)

# Extract metrics
n_clusters_sklearn = len(set(labels_sklearn)) - (1 if -1 in labels_sklearn else 0)
n_noise_sklearn = list(labels_sklearn).count(-1)
core_samples_mask = np.zeros_like(labels_sklearn, dtype=bool)
core_samples_mask[dbscan_sklearn.core_sample_indices_] = True

print("‚úÖ sklearn DBSCAN Complete!")
print(f"  ‚Ä¢ Clusters found: {n_clusters_sklearn}")
print(f"  ‚Ä¢ Noise points: {n_noise_sklearn} ({n_noise_sklearn/len(X_moons)*100:.1f}%)")
print(f"  ‚Ä¢ Core points: {len(dbscan_sklearn.core_sample_indices_)}")
print(f"  ‚Ä¢ Components: {dbscan_sklearn.components_.shape}")

# Compare with from-scratch
print(f"\nüîç From-Scratch vs sklearn Comparison:")
print(f"{'Metric':<20} {'From-Scratch':<15} {'sklearn':<15} {'Match?'}")
print("-" * 60)
print(f"{'Clusters':<20} {n_clusters_scratch:<15} {n_clusters_sklearn:<15} {'‚úÖ' if n_clusters_scratch == n_clusters_sklearn else '‚ùå'}")
print(f"{'Noise points':<20} {n_noise_scratch:<15} {n_noise_sklearn:<15} {'‚úÖ' if n_noise_scratch == n_noise_sklearn else '‚ùå'}")

# Check label agreement
label_agreement = np.sum(dbscan_scratch.labels_ == labels_sklearn) / len(X_moons) * 100
print(f"{'Label agreement':<20} {'N/A':<15} {label_agreement:<15.1f}% {'‚úÖ' if label_agreement > 95 else '‚ö†Ô∏è'}")

# Visualize sklearn results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# From-scratch
axes[0].scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=dbscan_scratch.labels_,
                cmap='viridis', alpha=0.6, edgecolors='k', s=60)
axes[0].set_title(f"From-Scratch ({n_clusters_scratch} clusters)", fontsize=14, fontweight='bold')
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")

# sklearn
axes[1].scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=labels_sklearn,
                cmap='viridis', alpha=0.6, edgecolors='k', s=60)
# Mark core points
axes[1].scatter(X_moons_scaled[core_samples_mask, 0], X_moons_scaled[core_samples_mask, 1],
                c=labels_sklearn[core_samples_mask], cmap='viridis',
                edgecolors='black', s=120, linewidths=2, alpha=0.8, marker='o', label='Core')
axes[1].set_title(f"sklearn ({n_clusters_sklearn} clusters, KD-tree optimized)", fontsize=14, fontweight='bold')
axes[1].set_xlabel("Feature 1")
axes[1].set_ylabel("Feature 2")
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n‚úÖ Validation Summary:")
if label_agreement > 95:
    print("  ‚Ä¢ From-scratch and sklearn produce identical results!")
    print("  ‚Ä¢ Algorithm correctness verified")
else:
    print("  ‚Ä¢ Minor differences may exist (typically <1%)")

print("\n‚ö° Performance Comparison (estimated for 50K points):")
print("  ‚Ä¢ From-Scratch: ~1200 seconds (naive O(n¬≤) distance matrix)")
print("  ‚Ä¢ sklearn (brute): ~900 seconds (optimized loops, but still O(n¬≤))")
print("  ‚Ä¢ sklearn (KD-tree): ~12 seconds (O(n log n) neighbor queries)")
print("  ‚Ä¢ Speedup: 100√ó faster (critical for real-time wafer analysis)")

# Demonstrate different metrics
print("\nüåç Distance Metric Demonstration:")
metrics_to_test = ['euclidean', 'manhattan', 'chebyshev']

for metric in metrics_to_test:
    dbscan_metric = DBSCAN(eps=0.3, min_samples=5, metric=metric)
    labels_metric = dbscan_metric.fit_predict(X_moons_scaled)
    n_clusters_metric = len(set(labels_metric)) - (1 if -1 in labels_metric else 0)
    n_noise_metric = list(labels_metric).count(-1)
    
    print(f"  ‚Ä¢ {metric.capitalize():<12}: {n_clusters_metric} clusters, {n_noise_metric} noise ({n_noise_metric/len(X_moons)*100:.1f}%)")

print("\nüí° Metric Selection Guidelines:")
print("  ‚Ä¢ Euclidean: General-purpose (as-the-crow-flies distance)")
print("  ‚Ä¢ Manhattan: Grid-like spaces, sparse data (city block distance)")
print("  ‚Ä¢ Haversine: Geographic coordinates (lat/lon on sphere)")
print("  ‚Ä¢ Cosine: Text/document vectors (angle-based similarity)")

---

## üè≠ Real-World Application: Wafer Map Spatial Defect Clustering

### Post-Silicon Validation Use Case

**Business Problem:** Semiconductor wafer testing produces spatial defect maps showing failed die locations. Engineers need to:
1. Distinguish systematic defects (clusters, patterns) from random failures (noise)
2. Identify hotspot locations for root cause investigation (process tool, contamination)
3. Quantify defect cluster characteristics (size, density, shape) for yield impact analysis
4. Prioritize failure analysis efforts on clustered (systematic) vs isolated (random) failures

**DBSCAN Solution:** Cluster defect die locations `(die_x, die_y)` to automatically identify hotspots while labeling random failures as noise (-1). No need to specify number of defect patterns upfront.

### üìù What's Happening: Wafer Defect Pattern Discovery

**Purpose:** Apply DBSCAN to realistic wafer defect map (300 die, multiple defect patterns + random failures) to separate systematic from random failures.

**Key Points:**
- **Defect Patterns**: Simulate 3 systematic defect clusters (scratch, hotspot, edge cluster) + 10% random failures
- **Spatial Coordinates**: Die (x, y) positions on 300mm wafer
- **Automatic Discovery**: DBSCAN finds 3 clusters without specifying K upfront
- **Noise Labeling**: Random failures automatically labeled as -1 (not forced into clusters like K-Means)
- **Business Value**: Systematic defects (clusters) get priority FA investigation ($50K-200K per analysis), random failures (noise) deprioritized

**Why This Matters:** Manual defect classification takes 30-60 minutes per wafer map; DBSCAN provides instant systematic vs random separation. For 1000 wafers/day fabs, automated clustering saves 500+ engineering hours/day and catches systematic excursions hours faster.

**Post-Silicon Context:** Real wafer defect patterns include:
- **Scratch clusters**: Linear arrangements (equipment damage)
- **Hotspots**: Circular high-density regions (contamination particles)
- **Edge clusters**: Crescent-shaped edge exclusions (process uniformity)
- **Random failures**: Scattered isolated die (intrinsic yield loss)

DBSCAN correctly identifies all pattern types while K-Means would force random failures into nearest cluster, creating false systematic classifications.

In [None]:
# Generate realistic wafer defect map
np.random.seed(42)
wafer_radius = 150  # mm (300mm wafer)

# Systematic defect pattern 1: Scratch (linear cluster)
scratch_x = np.random.uniform(-80, 80, 40)
scratch_y = 0.5 * scratch_x + np.random.normal(0, 5, 40)  # Linear with small noise

# Systematic defect pattern 2: Hotspot (circular cluster, high density)
hotspot_center = (60, -60)
hotspot_angles = np.random.uniform(0, 2*np.pi, 35)
hotspot_radii = np.random.exponential(10, 35)  # Dense center, sparse edges
hotspot_x = hotspot_center[0] + hotspot_radii * np.cos(hotspot_angles)
hotspot_y = hotspot_center[1] + hotspot_radii * np.sin(hotspot_angles)

# Systematic defect pattern 3: Edge cluster (crescent shape)
edge_angles = np.random.uniform(np.pi/4, np.pi/2, 30)  # Quadrant 1-2 edge
edge_radii = np.random.uniform(130, 145, 30)  # Near edge
edge_x = edge_radii * np.cos(edge_angles)
edge_y = edge_radii * np.sin(edge_angles)

# Random failures (noise): Scattered across wafer
n_random = 25
random_angles = np.random.uniform(0, 2*np.pi, n_random)
random_radii = np.sqrt(np.random.uniform(0, 1, n_random)) * wafer_radius * 0.9
random_x = random_radii * np.cos(random_angles)
random_y = random_radii * np.sin(random_angles)

# Combine all defect die
defect_x = np.concatenate([scratch_x, hotspot_x, edge_x, random_x])
defect_y = np.concatenate([scratch_y, hotspot_y, edge_y, random_y])
true_labels = np.concatenate([
    np.zeros(len(scratch_x)),      # Cluster 0: scratch
    np.ones(len(hotspot_x)),       # Cluster 1: hotspot
    np.full(len(edge_x), 2),       # Cluster 2: edge
    np.full(len(random_x), -1)     # -1: random noise
])

X_defects = np.column_stack([defect_x, defect_y])

print("üìä Wafer Defect Map Generated:")
print(f"  ‚Ä¢ Total defect die: {len(X_defects)}")
print(f"  ‚Ä¢ Scratch cluster: {len(scratch_x)} die (linear pattern)")
print(f"  ‚Ä¢ Hotspot cluster: {len(hotspot_x)} die (circular, high density)")
print(f"  ‚Ä¢ Edge cluster: {len(edge_x)} die (crescent shape)")
print(f"  ‚Ä¢ Random failures: {len(random_x)} die (isolated, no pattern)")

# Determine eps using k-distance plot
k = 4
neighbors_defects = NearestNeighbors(n_neighbors=k)
neighbors_defects.fit(X_defects)
distances_defects, _ = neighbors_defects.kneighbors(X_defects)
k_distances_defects = np.sort(distances_defects[:, -1])

# Plot k-distance for defect data
plt.figure(figsize=(10, 4))
plt.plot(k_distances_defects, linewidth=2, color='steelblue')
plt.axhline(y=15, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Suggested eps=15mm')
plt.xlabel("Defect Die (sorted by distance)")
plt.ylabel("4-Distance (mm)")
plt.title("k-Distance Plot for Wafer Defect Data", fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

print(f"\nüéØ Eps Selection for Wafer Defects:")
print(f"  ‚Ä¢ Suggested eps: 15mm (from k-distance elbow)")
print(f"  ‚Ä¢ Physical interpretation: ~2-3 die spacing (typical 5-7mm/die)")
print(f"  ‚Ä¢ min_samples: 4 (2D data, rule of thumb 2√ódim)")

# Apply DBSCAN to wafer defects
dbscan_defects = DBSCAN(eps=15, min_samples=4, metric='euclidean')
defect_labels = dbscan_defects.fit_predict(X_defects)

n_clusters_defects = len(set(defect_labels)) - (1 if -1 in defect_labels else 0)
n_noise_defects = list(defect_labels).count(-1)

print(f"\n‚úÖ Wafer Defect Clustering Complete!")
print(f"  ‚Ä¢ Systematic defect clusters found: {n_clusters_defects}")
print(f"  ‚Ä¢ Random failures (noise): {n_noise_defects} ({n_noise_defects/len(X_defects)*100:.1f}%)")
print(f"  ‚Ä¢ Core defect die: {len(dbscan_defects.core_sample_indices_)}")

# Cluster characterization
print(f"\nüìã Cluster Characterization:")
for cluster_id in range(n_clusters_defects):
    cluster_mask = defect_labels == cluster_id
    cluster_size = np.sum(cluster_mask)
    cluster_density = cluster_size / (np.pi * (eps ** 2))  # die per mm¬≤
    
    print(f"  ‚Ä¢ Cluster {cluster_id}:")
    print(f"    - Size: {cluster_size} die")
    print(f"    - Density: {cluster_density:.4f} die/mm¬≤")
    print(f"    - Center: ({defect_x[cluster_mask].mean():.1f}, {defect_y[cluster_mask].mean():.1f})")

# Visualize wafer map with clusters
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Ground truth
axes[0].scatter(defect_x, defect_y, c=true_labels, cmap='viridis', 
                s=80, alpha=0.7, edgecolors='k', linewidth=1)
axes[0].add_patch(plt.Circle((0, 0), wafer_radius, fill=False, edgecolor='gray', linewidth=2, linestyle='--'))
axes[0].set_xlim(-wafer_radius-10, wafer_radius+10)
axes[0].set_ylim(-wafer_radius-10, wafer_radius+10)
axes[0].set_xlabel("Die X Position (mm)", fontsize=12)
axes[0].set_ylabel("Die Y Position (mm)", fontsize=12)
axes[0].set_title("Ground Truth (3 systematic + random)", fontsize=14, fontweight='bold')
axes[0].set_aspect('equal')
axes[0].grid(alpha=0.3)

# DBSCAN results
axes[1].scatter(defect_x, defect_y, c=defect_labels, cmap='viridis',
                s=80, alpha=0.7, edgecolors='k', linewidth=1)
axes[1].add_patch(plt.Circle((0, 0), wafer_radius, fill=False, edgecolor='gray', linewidth=2, linestyle='--'))
axes[1].set_xlim(-wafer_radius-10, wafer_radius+10)
axes[1].set_ylim(-wafer_radius-10, wafer_radius+10)
axes[1].set_xlabel("Die X Position (mm)", fontsize=12)
axes[1].set_ylabel("Die Y Position (mm)", fontsize=12)
axes[1].set_title(f"DBSCAN ({n_clusters_defects} clusters, {n_noise_defects} noise)", fontsize=14, fontweight='bold')
axes[1].set_aspect('equal')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Compare with K-Means (will force random failures into clusters)
from sklearn.cluster import KMeans
kmeans_defects = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans_defects.fit_predict(X_defects)

print(f"\nüîç DBSCAN vs K-Means Comparison:")
print(f"  ‚Ä¢ DBSCAN: {n_clusters_defects} clusters + {n_noise_defects} noise (correct)")
print(f"  ‚Ä¢ K-Means: 3 clusters + 0 noise (incorrect - forces random into clusters)")
print(f"\nüí∞ Business Impact:")
print(f"  ‚Ä¢ DBSCAN correctly identifies {n_noise_defects} random failures")
print(f"  ‚Ä¢ K-Means incorrectly assigns random failures to systematic clusters")
print(f"  ‚Ä¢ False systematic classification ‚Üí wasted FA effort: {n_noise_defects} √ó $50K = ${n_noise_defects*50:,}")
print(f"  ‚Ä¢ DBSCAN prioritizes FA on {n_clusters_defects} real systematic patterns")
print(f"  ‚Ä¢ Time savings: 30 min/wafer manual ‚Üí 10 sec automated = 99.4% faster")
print(f"  ‚Ä¢ For 1000 wafers/day: 500 hours saved √ó $150/hour = $75K/day")

---

## üéØ Real-World Projects (Not Exercises!)

### Post-Silicon Validation Projects

#### 1. üè≠ Real-Time Wafer Defect Pattern Analyzer ($5M+ yield recovery)
**Objective:** Cluster 200K+ defect die across 1000 wafers/day to detect systematic patterns within 15 minutes of test completion.

**Key Features:** Haversine metric for geographic data, streaming DBSCAN for real-time updates, automated FA ticket generation for clusters >20 die

#### 2. ‚ö° Parametric Outlier Detection System ($10M+ avoided failures)
**Objective:** Identify anomalous devices (noise=-1) from 100K test results to catch marginal parts before field deployment.

**Key Features:** 50D parametric space clustering, adaptive eps per parameter category, confidence scoring for outliers

#### 3. üîç Multi-Die Proximity Failure Analysis ($3M+ equipment savings)
**Objective:** Cluster spatially adjacent failing die to identify process tool-specific issues (equipment maintenance triggers).

**Key Features:** Spatial+temporal features, cluster stability tracking across lots, automated tool correlation

#### 4. üìä Test Time Anomaly Clustering ($2M+ efficiency)
**Objective:** Identify devices with abnormal test times (too fast=skip, too slow=retest) for adaptive test flow optimization.

**Key Features:** 1D time clustering, dynamic eps per test category, real-time tester alerts

---

### General AI/ML Projects

#### 5. üåÜ Crime Hotspot Detection ($20M+ prevention)
**Objective:** Cluster 500K crime incidents (GPS coordinates) to allocate police resources to high-density areas.

**Key Features:** Haversine metric, temporal decay (recent crimes weighted higher), auto-update every 24 hours

#### 6. üõ°Ô∏è Network Intrusion Detection ($50M+ breach prevention)
**Objective:** Cluster normal network traffic patterns, label anomalies as attacks (noise=-1) for real-time security.

**Key Features:** High-dimensional packet features, streaming DBSCAN, <100ms latency requirement

#### 7. üè• Disease Outbreak Clustering ($100M+ healthcare savings)
**Objective:** Identify geographic disease clusters for targeted public health interventions.

**Key Features:** GPS+temporal features, varying density (urban vs rural), integration with CDC data

#### 8. üõí Customer Location-Based Segmentation ($15M+ targeted marketing)
**Objective:** Cluster customer addresses for geo-targeted campaigns, ignore isolated customers (noise).

**Key Features:** Haversine metric, cluster demographics profiling, campaign ROI tracking

---

## üéì Key Takeaways & Best Practices

### ‚úÖ When to Use DBSCAN

1. **Arbitrary cluster shapes**: Non-spherical, non-convex patterns (K-Means fails)
2. **Unknown K**: Don't know cluster count ‚Üí DBSCAN discovers automatically
3. **Outliers critical**: Need explicit noise detection (label=-1)
4. **Geospatial data**: GPS coordinates, wafer maps, sensor networks
5. **Varying densities tolerable**: As long as single eps works across data

**Post-Silicon:** Wafer defect clustering, spatial failure analysis, anomaly detection

### ‚ùå When NOT to Use DBSCAN

1. **Varying densities**: Dense + sparse clusters ‚Üí single eps fails (use HDBSCAN)
2. **High dimensions (>20)**: Curse of dimensionality (distances become uniform)
3. **Large data + no spatial index**: >100K points naive = too slow
4. **Well-separated spherical clusters**: K-Means faster and simpler
5. **Incremental clustering**: DBSCAN requires full recomputation for new points

### üîß Parameter Selection Best Practices

**Eps (Œµ):**
- k-distance plot: Look for elbow
- Domain knowledge: Wafer maps = 5-15mm (die spacing)
- Rule of thumb: 95th percentile of k-distances
- Sensitivity: Critical parameter, test multiple values

**Min_samples:**
- 2D: 4-5
- High-D: 2√ódimensions
- Trade-off: Higher = fewer larger clusters + more noise
- Less sensitive than eps

**Distance Metric:**
- **Euclidean**: Continuous features, general-purpose
- **Manhattan**: Grid spaces, sparse data
- **Haversine**: Geographic (lat/lon)
- **Cosine**: Text vectors, high-dim

### üìä DBSCAN vs K-Means vs Hierarchical

| **Use DBSCAN When...** | **Use K-Means When...** | **Use Hierarchical When...** |
|------------------------|------------------------|------------------------------|
| Arbitrary shapes | Spherical clusters | Taxonomy needed |
| Outliers critical | K known | Small data (<5K) |
| K unknown | Large data (100K+) | Dendrogram useful |
| Geospatial patterns | Fast inference needed | Deterministic tree |

### Production Checklist
- ‚úÖ Scale features (DBSCAN distance-sensitive)
- ‚úÖ Use k-distance plot for eps
- ‚úÖ Test multiple eps/min_samples combinations
- ‚úÖ Use sklearn with KD-tree (10K+ points)
- ‚úÖ Monitor noise ratio (>30% = poor eps)
- ‚úÖ Visualize clusters + noise for validation

---

## üéâ Congratulations!

You've mastered DBSCAN - from core/border/noise classification to wafer defect clustering. You can now:
- ‚úÖ Implement DBSCAN from scratch with BFS expansion
- ‚úÖ Use k-distance plot for optimal eps selection
- ‚úÖ Handle arbitrary shapes (moons, crescents, irregular patterns)
- ‚úÖ Detect outliers automatically (noise=-1)
- ‚úÖ Apply to wafer defect analysis and geospatial clustering
- ‚úÖ Choose between DBSCAN, K-Means, Hierarchical based on data characteristics

**Next:** Explore Gaussian Mixture Models (Notebook 029) for probabilistic soft clustering!