# DBSCAN Clustering from Scratch

Implementation of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

**Key Concepts:**
- Density-based clustering
- Core points, border points, and noise
- Arbitrary cluster shapes
- No need to specify number of clusters
- Outlier detection


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons, make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, adjusted_rand_score
from collections import deque

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("=" * 80)
print("DBSCAN CLUSTERING FROM SCRATCH")
print("=" * 80)

## Mathematical Foundation

**DBSCAN Parameters:**
- $\epsilon$ (eps): Maximum distance between two points to be considered neighbors
- $MinPts$: Minimum number of points required to form a dense region (core point)

**Key Definitions:**

**1. $\epsilon$-neighborhood:**
$$N_{\epsilon}(p) = \{q \in D : dist(p, q) \leq \epsilon\}$$

**2. Core Point:**
- Point $p$ is a core point if $|N_{\epsilon}(p)| \geq MinPts$

**3. Border Point:**
- Point $p$ is in $\epsilon$-neighborhood of a core point
- But $|N_{\epsilon}(p)| < MinPts$

**4. Noise Point:**
- Point that is neither core nor border point

**5. Directly Density-Reachable:**
- Point $q$ is directly density-reachable from $p$ if:
  - $p$ is a core point
  - $q \in N_{\epsilon}(p)$

**6. Density-Reachable:**
- Point $q$ is density-reachable from $p$ if there exists a chain of points $p_1, ..., p_n$ where:
  - $p_1 = p$ and $p_n = q$
  - $p_{i+1}$ is directly density-reachable from $p_i$

**7. Density-Connected:**
- Points $p$ and $q$ are density-connected if there exists a point $o$ such that:
  - Both $p$ and $q$ are density-reachable from $o$

## DBSCAN Implementation

In [None]:
class DBSCANScratch:
    """
    DBSCAN clustering implemented from scratch.
    
    Parameters:
    -----------
    eps : float, default=0.5
        Maximum distance between two points for them to be neighbors
    min_samples : int, default=5
        Minimum number of points to form a dense region (core point)
    metric : str, default='euclidean'
        Distance metric to use
    """
    
    def __init__(self, eps=0.5, min_samples=5, metric='euclidean'):
        self.eps = eps
        self.min_samples = min_samples
        self.metric = metric
        
        # Results
        self.labels_ = None
        self.core_sample_indices_ = None
        self.components_ = None
        
        # Statistics
        self.n_clusters_ = 0
        self.n_noise_ = 0
        
    def _euclidean_distance(self, x1, x2):
        """
        Calculate Euclidean distance between two points.
        """
        return np.sqrt(np.sum((x1 - x2) ** 2))
    
    def _manhattan_distance(self, x1, x2):
        """
        Calculate Manhattan distance between two points.
        """
        return np.sum(np.abs(x1 - x2))
    
    def _distance(self, x1, x2):
        """
        Calculate distance between two points based on metric.
        """
        if self.metric == 'euclidean':
            return self._euclidean_distance(x1, x2)
        elif self.metric == 'manhattan':
            return self._manhattan_distance(x1, x2)
        else:
            raise ValueError(f"Unknown metric: {self.metric}")
    
    def _get_neighbors(self, X, point_idx):
        """
        Find all neighbors within eps distance of a point.
        
        Returns:
        --------
        neighbors : list
            Indices of neighboring points
        """
        neighbors = []
        for idx, point in enumerate(X):
            if self._distance(X[point_idx], point) <= self.eps:
                neighbors.append(idx)
        return neighbors
    
    def _expand_cluster(self, X, labels, point_idx, neighbors, cluster_id):
        """
        Expand cluster by adding density-reachable points.
        
        Uses breadth-first search to find all density-reachable points.
        """
        # Assign current point to cluster
        labels[point_idx] = cluster_id
        
        # Use queue for BFS
        queue = deque(neighbors)
        
        while queue:
            current_point = queue.popleft()
            
            # If point was noise, make it border point of this cluster
            if labels[current_point] == -1:
                labels[current_point] = cluster_id
            
            # If already processed, skip
            if labels[current_point] != 0:
                continue
            
            # Assign to cluster
            labels[current_point] = cluster_id
            
            # Find neighbors of current point
            current_neighbors = self._get_neighbors(X, current_point)
            
            # If it's a core point, add its neighbors to queue
            if len(current_neighbors) >= self.min_samples:
                for neighbor in current_neighbors:
                    if labels[neighbor] == 0:  # Unvisited
                        queue.append(neighbor)
    
    def fit(self, X):
        """
        Perform DBSCAN clustering.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        """
        n_samples = X.shape[0]
        
        # Initialize all points as unvisited (0)
        # -1 will represent noise
        # Positive integers represent cluster IDs
        labels = np.zeros(n_samples, dtype=int)
        
        cluster_id = 0
        core_samples = []
        
        # Process each point
        for point_idx in range(n_samples):
            # Skip if already processed
            if labels[point_idx] != 0:
                continue
            
            # Find neighbors
            neighbors = self._get_neighbors(X, point_idx)
            
            # Check if it's a core point
            if len(neighbors) < self.min_samples:
                # Mark as noise (for now)
                labels[point_idx] = -1
            else:
                # Start new cluster
                cluster_id += 1
                core_samples.append(point_idx)
                self._expand_cluster(X, labels, point_idx, neighbors, cluster_id)
        
        # Store results
        self.labels_ = labels
        self.core_sample_indices_ = np.array(core_samples)
        self.n_clusters_ = cluster_id
        self.n_noise_ = np.sum(labels == -1)
        
        # Store core samples
        if len(core_samples) > 0:
            self.components_ = X[core_samples]
        
        return self
    
    def fit_predict(self, X):
        """
        Fit and return cluster labels.
        """
        self.fit(X)
        return self.labels_
    
    def get_cluster_info(self):
        """
        Get information about clusters.
        """
        info = {
            'n_clusters': self.n_clusters_,
            'n_noise': self.n_noise_,
            'n_core_samples': len(self.core_sample_indices_) if self.core_sample_indices_ is not None else 0
        }
        
        if self.labels_ is not None:
            for cluster_id in range(1, self.n_clusters_ + 1):
                cluster_size = np.sum(self.labels_ == cluster_id)
                info[f'cluster_{cluster_id}_size'] = cluster_size
        
        return info

print("\n✓ DBSCANScratch class defined")
print("  - Density-based clustering")
print("  - Finds arbitrary-shaped clusters")
print("  - Identifies noise/outliers")

## Example 1: Simple 2D Data with Clear Clusters

In [None]:
# Generate synthetic data
X_blob, y_true = make_blobs(
    n_samples=300,
    n_features=2,
    centers=3,
    cluster_std=0.5,
    random_state=42
)

# Add some noise points
np.random.seed(42)
noise = np.random.uniform(X_blob.min(), X_blob.max(), size=(20, 2))
X_blob = np.vstack([X_blob, noise])
y_true = np.hstack([y_true, -np.ones(20)])

print("Synthetic Dataset with Noise:")
print(f"  Total samples: {len(X_blob)}")
print(f"  Features: {X_blob.shape[1]}")
print(f"  True clusters: 3")
print(f"  Added noise points: 20")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y_true, cmap='viridis',
           s=50, alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Data with Noise Points', fontsize=14, fontweight='bold')
plt.colorbar(label='True Cluster')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Fit DBSCAN
dbscan = DBSCANScratch(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_blob)

# Get cluster info
info = dbscan.get_cluster_info()

print("\nDBSCAN Results:")
print(f"  Number of clusters: {info['n_clusters']}")
print(f"  Number of noise points: {info['n_noise']}")
print(f"  Number of core samples: {info['n_core_samples']}")
print(f"\nCluster sizes:")
for cluster_id in range(1, info['n_clusters'] + 1):
    print(f"  Cluster {cluster_id}: {info[f'cluster_{cluster_id}_size']} points")

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# True labels
axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=y_true, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[0].set_xlabel('Feature 1', fontsize=11)
axes[0].set_ylabel('Feature 2', fontsize=11)
axes[0].set_title('True Labels', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# DBSCAN results
# Use different color for noise points
colors = np.array(['red'] + plt.cm.viridis(np.linspace(0, 1, dbscan.n_clusters_)).tolist())
point_colors = [colors[label] if label >= 0 else colors[0] for label in labels]

axes[1].scatter(X_blob[:, 0], X_blob[:, 1], c=point_colors,
               s=50, alpha=0.6, edgecolors='black')

# Highlight core samples
if len(dbscan.core_sample_indices_) > 0:
    axes[1].scatter(X_blob[dbscan.core_sample_indices_, 0],
                   X_blob[dbscan.core_sample_indices_, 1],
                   s=100, facecolors='none', edgecolors='blue',
                   linewidth=2, label='Core samples')

axes[1].set_xlabel('Feature 1', fontsize=11)
axes[1].set_ylabel('Feature 2', fontsize=11)
axes[1].set_title(f'DBSCAN (eps={dbscan.eps}, min_samples={dbscan.min_samples})',
                 fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Red points are noise")
print("✓ Blue circles indicate core samples")

## Example 2: Non-Convex Clusters (Moons Dataset)

In [None]:
# Generate moons dataset
X_moons, y_moons_true = make_moons(n_samples=300, noise=0.05, random_state=42)

print("\nMoons Dataset:")
print(f"  Samples: {len(X_moons)}")
print(f"  Features: {X_moons.shape[1]}")
print(f"  True clusters: 2 (non-convex)")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_true, cmap='viridis',
           s=50, alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Non-Convex Moons Dataset', fontsize=14, fontweight='bold')
plt.colorbar(label='True Cluster')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Fit DBSCAN
dbscan_moons = DBSCANScratch(eps=0.2, min_samples=5)
labels_moons = dbscan_moons.fit_predict(X_moons)

info_moons = dbscan_moons.get_cluster_info()

print("\nDBSCAN on Moons:")
print(f"  Number of clusters: {info_moons['n_clusters']}")
print(f"  Number of noise points: {info_moons['n_noise']}")

# Compare with K-means (for contrast)
from sklearn.cluster import KMeans
kmeans_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_kmeans_moons = kmeans_moons.fit_predict(X_moons)

In [None]:
# Compare DBSCAN vs K-means on non-convex data
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# True labels
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_true, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[0].set_xlabel('Feature 1', fontsize=11)
axes[0].set_ylabel('Feature 2', fontsize=11)
axes[0].set_title('True Labels', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# K-means
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_kmeans_moons, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[1].scatter(kmeans_moons.cluster_centers_[:, 0],
               kmeans_moons.cluster_centers_[:, 1],
               c='red', s=200, marker='X', edgecolors='black', linewidth=2)
axes[1].set_xlabel('Feature 1', fontsize=11)
axes[1].set_ylabel('Feature 2', fontsize=11)
axes[1].set_title('K-Means (Fails on Non-Convex)', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# DBSCAN
colors = np.array(['red'] + plt.cm.viridis(np.linspace(0, 1, dbscan_moons.n_clusters_)).tolist())
point_colors = [colors[label] if label >= 0 else colors[0] for label in labels_moons]

axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=point_colors,
               s=50, alpha=0.6, edgecolors='black')
axes[2].set_xlabel('Feature 1', fontsize=11)
axes[2].set_ylabel('Feature 2', fontsize=11)
axes[2].set_title('DBSCAN (Handles Non-Convex)', fontsize=12, fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ DBSCAN successfully identifies non-convex clusters")
print("✓ K-means struggles with non-spherical shapes")

## Example 3: Nested Circles

In [None]:
# Generate circles dataset
X_circles, y_circles_true = make_circles(
    n_samples=300, factor=0.5, noise=0.05, random_state=42
)

print("\nNested Circles Dataset:")
print(f"  Samples: {len(X_circles)}")
print(f"  Features: {X_circles.shape[1]}")
print(f"  True clusters: 2 (nested)")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles_true, cmap='viridis',
           s=50, alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Nested Circles Dataset', fontsize=14, fontweight='bold')
plt.colorbar(label='True Cluster')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Fit DBSCAN
dbscan_circles = DBSCANScratch(eps=0.15, min_samples=5)
labels_circles = dbscan_circles.fit_predict(X_circles)

info_circles = dbscan_circles.get_cluster_info()

print("\nDBSCAN on Circles:")
print(f"  Number of clusters: {info_circles['n_clusters']}")
print(f"  Number of noise points: {info_circles['n_noise']}")

# Visualize
plt.figure(figsize=(10, 6))
colors = np.array(['red'] + plt.cm.viridis(np.linspace(0, 1, dbscan_circles.n_clusters_)).tolist())
point_colors = [colors[label] if label >= 0 else colors[0] for label in labels_circles]

plt.scatter(X_circles[:, 0], X_circles[:, 1], c=point_colors,
           s=50, alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title(f'DBSCAN on Nested Circles (Found {dbscan_circles.n_clusters_} clusters)',
         fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Parameter Sensitivity Analysis

In [None]:
print("\n" + "=" * 80)
print("PARAMETER SENSITIVITY ANALYSIS")
print("=" * 80)

# Test different eps values
eps_values = [0.1, 0.3, 0.5, 0.7, 0.9]
results_eps = []

print("\nVarying eps (min_samples=5):")
for eps in eps_values:
    dbscan_test = DBSCANScratch(eps=eps, min_samples=5)
    dbscan_test.fit(X_blob)
    info = dbscan_test.get_cluster_info()
    
    results_eps.append({
        'eps': eps,
        'n_clusters': info['n_clusters'],
        'n_noise': info['n_noise']
    })
    
    print(f"  eps={eps:.1f} | Clusters: {info['n_clusters']} | Noise: {info['n_noise']}")

# Test different min_samples values
min_samples_values = [3, 5, 10, 15, 20]
results_min_samples = []

print("\nVarying min_samples (eps=0.5):")
for min_samp in min_samples_values:
    dbscan_test = DBSCANScratch(eps=0.5, min_samples=min_samp)
    dbscan_test.fit(X_blob)
    info = dbscan_test.get_cluster_info()
    
    results_min_samples.append({
        'min_samples': min_samp,
        'n_clusters': info['n_clusters'],
        'n_noise': info['n_noise']
    })
    
    print(f"  min_samples={min_samp:2d} | Clusters: {info['n_clusters']} | Noise: {info['n_noise']}")

In [None]:
# Visualize parameter effects
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Effect of eps
eps_df = pd.DataFrame(results_eps)
ax1 = axes[0]
ax1.plot(eps_df['eps'], eps_df['n_clusters'], marker='o', linewidth=2,
        markersize=8, label='Number of clusters')
ax1.set_xlabel('eps (neighborhood radius)', fontsize=12)
ax1.set_ylabel('Number of Clusters', fontsize=12, color='C0')
ax1.tick_params(axis='y', labelcolor='C0')
ax1.grid(True, alpha=0.3)

ax1_twin = ax1.twinx()
ax1_twin.plot(eps_df['eps'], eps_df['n_noise'], marker='s', linewidth=2,
             markersize=8, color='red', label='Noise points')
ax1_twin.set_ylabel('Number of Noise Points', fontsize=12, color='red')
ax1_twin.tick_params(axis='y', labelcolor='red')

ax1.set_title('Effect of eps Parameter', fontsize=14, fontweight='bold')
ax1.legend(loc='upper left')
ax1_twin.legend(loc='upper right')

# Effect of min_samples
min_samp_df = pd.DataFrame(results_min_samples)
ax2 = axes[1]
ax2.plot(min_samp_df['min_samples'], min_samp_df['n_clusters'], marker='o',
        linewidth=2, markersize=8, label='Number of clusters')
ax2.set_xlabel('min_samples (core point threshold)', fontsize=12)
ax2.set_ylabel('Number of Clusters', fontsize=12, color='C0')
ax2.tick_params(axis='y', labelcolor='C0')
ax2.grid(True, alpha=0.3)

ax2_twin = ax2.twinx()
ax2_twin.plot(min_samp_df['min_samples'], min_samp_df['n_noise'], marker='s',
             linewidth=2, markersize=8, color='red', label='Noise points')
ax2_twin.set_ylabel('Number of Noise Points', fontsize=12, color='red')
ax2_twin.tick_params(axis='y', labelcolor='red')

ax2.set_title('Effect of min_samples Parameter', fontsize=14, fontweight='bold')
ax2.legend(loc='upper left')
ax2_twin.legend(loc='upper right')

plt.tight_layout()
plt.show()

print("\n✓ Larger eps → fewer, larger clusters")
print("✓ Larger min_samples → more points classified as noise")

## Visualize Different Parameter Combinations

In [None]:
# Show effect of different parameter combinations
param_combinations = [
    (0.3, 5),
    (0.5, 5),
    (0.7, 5),
    (0.5, 3),
    (0.5, 10),
    (0.5, 15)
]

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, (eps, min_samp) in enumerate(param_combinations):
    dbscan_test = DBSCANScratch(eps=eps, min_samples=min_samp)
    labels_test = dbscan_test.fit_predict(X_blob)
    
    colors = np.array(['red'] + plt.cm.viridis(np.linspace(0, 1, dbscan_test.n_clusters_)).tolist())
    point_colors = [colors[label] if label >= 0 else colors[0] for label in labels_test]
    
    axes[idx].scatter(X_blob[:, 0], X_blob[:, 1], c=point_colors,
                     s=50, alpha=0.6, edgecolors='black')
    axes[idx].set_xlabel('Feature 1', fontsize=10)
    axes[idx].set_ylabel('Feature 2', fontsize=10)
    axes[idx].set_title(f'eps={eps}, min_samples={min_samp}\n'
                       f'Clusters: {dbscan_test.n_clusters_}, Noise: {dbscan_test.n_noise_}',
                       fontsize=11, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('DBSCAN with Different Parameters', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Outlier Detection

In [None]:
print("\n" + "=" * 80)
print("OUTLIER DETECTION")
print("=" * 80)

# Generate data with clear outliers
X_outlier, _ = make_blobs(n_samples=200, n_features=2, centers=2,
                         cluster_std=0.5, random_state=42)

# Add outliers
np.random.seed(42)
outliers = np.random.uniform(X_outlier.min() - 2, X_outlier.max() + 2, size=(30, 2))
X_outlier = np.vstack([X_outlier, outliers])

print(f"\nData: 200 normal points + 30 outliers")

# Fit DBSCAN
dbscan_outlier = DBSCANScratch(eps=0.5, min_samples=5)
labels_outlier = dbscan_outlier.fit_predict(X_outlier)

# Identify outliers
is_outlier = labels_outlier == -1
n_outliers_detected = np.sum(is_outlier)

print(f"\nDBSCAN detected {n_outliers_detected} outliers")
print(f"Detection rate: {n_outliers_detected/30*100:.1f}%")

In [None]:
# Visualize outlier detection
fig, ax = plt.subplots(figsize=(10, 8))

# Plot normal points
normal_mask = ~is_outlier
ax.scatter(X_outlier[normal_mask, 0], X_outlier[normal_mask, 1],
          c=labels_outlier[normal_mask], cmap='viridis',
          s=50, alpha=0.6, edgecolors='black', label='Normal points')

# Plot outliers
ax.scatter(X_outlier[is_outlier, 0], X_outlier[is_outlier, 1],
          c='red', s=100, marker='x', linewidth=2,
          label=f'Outliers ({n_outliers_detected})')

ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('DBSCAN for Outlier Detection', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✓ DBSCAN naturally identifies outliers as noise points")

## Comparison: DBSCAN vs K-Means vs GMM

In [None]:
print("\n" + "=" * 80)
print("ALGORITHM COMPARISON")
print("=" * 80)

# Use moons dataset for comparison
from sklearn.mixture import GaussianMixture

# Fit all three algorithms
kmeans_compare = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_kmeans_compare = kmeans_compare.fit_predict(X_moons)

gmm_compare = GaussianMixture(n_components=2, random_state=42)
labels_gmm_compare = gmm_compare.fit_predict(X_moons)

dbscan_compare = DBSCANScratch(eps=0.2, min_samples=5)
labels_dbscan_compare = dbscan_compare.fit_predict(X_moons)

# Visualize comparison
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

# True labels
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_true, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[0].set_xlabel('Feature 1', fontsize=10)
axes[0].set_ylabel('Feature 2', fontsize=10)
axes[0].set_title('True Labels', fontsize=11, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# K-means
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_kmeans_compare, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[1].set_xlabel('Feature 1', fontsize=10)
axes[1].set_ylabel('Feature 2', fontsize=10)
axes[1].set_title('K-Means\n(Spherical clusters)', fontsize=11, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# GMM
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_gmm_compare, cmap='viridis',
               s=50, alpha=0.6, edgecolors='black')
axes[2].set_xlabel('Feature 1', fontsize=10)
axes[2].set_ylabel('Feature 2', fontsize=10)
axes[2].set_title('GMM\n(Elliptical clusters)', fontsize=11, fontweight='bold')
axes[2].grid(True, alpha=0.3)

# DBSCAN
colors = np.array(['red'] + plt.cm.viridis(np.linspace(0, 1, dbscan_compare.n_clusters_)).tolist())
point_colors = [colors[label] if label >= 0 else colors[0] for label in labels_dbscan_compare]
axes[3].scatter(X_moons[:, 0], X_moons[:, 1], c=point_colors,
               s=50, alpha=0.6, edgecolors='black')
axes[3].set_xlabel('Feature 1', fontsize=10)
axes[3].set_ylabel('Feature 2', fontsize=10)
axes[3].set_title('DBSCAN\n(Arbitrary shapes)', fontsize=11, fontweight='bold')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

**DBSCAN (Density-Based Spatial Clustering):**

1. **Core Concepts**:
   - Core points: ≥ min_samples neighbors within eps
   - Border points: In neighborhood of core point but not core itself
   - Noise points: Neither core nor border

2. **Algorithm**:
   - Start with arbitrary unvisited point
   - If core point, expand cluster via density-reachability
   - If not core point, mark as noise (may become border later)
   - Repeat for all unvisited points

3. **Parameters**:
   - **eps**: Neighborhood radius
     - Larger → fewer, larger clusters
     - Smaller → more clusters, more noise
   - **min_samples**: Core point threshold
     - Larger → stricter definition of cluster, more noise
     - Smaller → more permissive, fewer noise points

4. **Advantages**:
   - Finds arbitrary-shaped clusters
   - No need to specify number of clusters
   - Robust to outliers (identifies them as noise)
   - Works well with varying cluster densities

5. **Limitations**:
   - Sensitive to parameters (eps, min_samples)
   - Struggles with varying densities
   - Not suitable for high-dimensional data (curse of dimensionality)
   - O(n²) complexity (can be improved with spatial indexing)

6. **When to Use DBSCAN**:
   - Unknown number of clusters
   - Non-convex cluster shapes
   - Presence of noise/outliers
   - Need outlier detection
   - Spatial data analysis

7. **Comparison with Other Methods**:
   - **vs K-Means**: Better for non-spherical clusters, handles noise
   - **vs GMM**: No probabilistic framework, simpler, faster
   - **vs Hierarchical**: Faster, but doesn't provide dendrogram
