In [None]:
%store -r

print("Project configuration:")
print(f"SLUG = {SLUG}")
print(f"DATA_DIR = {DATA_DIR}")
print(f"DATASET_KEY = {DATASET_KEY}")
print(f"FIG_DIR = {FIG_DIR}")
print(f"REP_DIR = {REP_DIR}")
print(f"NOTEBOOK_DIR = {NOTEBOOK_DIR}")

missing_vars = [var for var in ['SLUG', 'DATA_DIR', 'FIG_DIR', 'REP_DIR', 'NOTEBOOK_DIR', 'DATASET_KEY'] if var not in globals()]
print(f"Vars not found in globals: {missing_vars}")

# Set default values if variables are not found in store or are empty
if not SLUG:  # Check if empty string
    print(f"{SLUG=} is empty, initializing everything explicitly")
    SLUG = 'customer-segmentation'
    DATASET_KEY = 'vjchoudhary7/customer-segmentation-tutorial-in-python'
    GIT_ROOT = Path.cwd().parent.parent
    DATA_DIR = GIT_ROOT / 'data' / SLUG
    FIG_DIR = GIT_ROOT / 'figures' / SLUG
    REP_DIR = GIT_ROOT / 'reports' / SLUG
    NOTEBOOK_DIR = GIT_ROOT / 'notebooks' / SLUG


Project configuration:
SLUG = customer-segmentation
DATA_DIR = /Users/ravisharma/workdir/eda_practice/data/customer-segmentation
DATASET_KEY = vjchoudhary7/customer-segmentation-tutorial-in-python
FIG_DIR = /Users/ravisharma/workdir/eda_practice/figures/customer-segmentation
REP_DIR = /Users/ravisharma/workdir/eda_practice/reports/customer-segmentation
NOTEBOOK_DIR = /Users/ravisharma/workdir/eda_practice/notebooks/customer-segmentation
Vars not found in globals: []


In [None]:
from pathlib import Path
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display

In [None]:
# Downloading data

base_df = pd.DataFrame()

CSV_PATH = Path(DATA_DIR) / "Mall_Customers.csv"
if not CSV_PATH.exists:
    print(f"CSV {CSV_PATH} does not exist. base_df will remain empty.")
else:
    base_df = pd.read_csv(CSV_PATH)
    print(f"CSV {CSV_PATH} loaded successfully.")

base_df.head()

CSV /Users/ravisharma/workdir/eda_practice/data/customer-segmentation/Mall_Customers.csv loaded successfully.


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
base_df.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


## 📊 DBSCAN: Density-Based Clustering for Outlier Detection

### 📋 Code Breakdown
```python
# DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=2, metric='euclidean')
outliers = dbscan.fit_predict(base_df[['Age']])
print(outliers)
```

**Line-by-line explanation:**
1. **Import DBSCAN** from sklearn clustering module (originally clustering algorithm, adapted for outlier detection)
2. **Create DBSCAN instance** with epsilon=0.5, minimum samples=2, and Euclidean distance
3. **Fit and predict** on Age column (returns cluster labels: 0, 1, 2... for clusters, -1 for outliers)
4. **Print cluster labels** where -1 indicates noise/outliers

### 📚 Essential Documentation & Resources

#### **Official Documentation:**
- **[Scikit-learn DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)** - Official API reference
- **[Scikit-learn Clustering Guide](https://scikit-learn.org/stable/modules/clustering.html#dbscan)** - Comprehensive clustering overview
- **[Original Paper: "A Density-Based Algorithm for Discovering Clusters" by Ester et al. (1996)](https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf)** - Foundational research paper

#### **Helpful Blogs & Tutorials:**
- **[Towards Data Science: DBSCAN Explained](https://towardsdatascience.com/dbscan-algorithm-complete-guide-and-application-with-python-scikit-learn-d690cbae4c5d)**
- **[Machine Learning Mastery: DBSCAN Clustering](https://machinelearningmastery.com/dbscan-clustering-algorithm/)**
- **[Analytics Vidhya: DBSCAN Clustering Guide](https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/)**

#### **Advanced Resources:**
- **[DBSCAN Parameter Selection Study](https://iopscience.iop.org/article/10.1088/1742-6596/1168/2/022022)**
- **[Density-Based Clustering Comparison](https://link.springer.com/article/10.1007/s10115-016-0964-9)**
- **[DBSCAN for Anomaly Detection Applications](https://ieeexplore.ieee.org/document/8844706)**

### 🔍 How DBSCAN Works for Outlier Detection

#### **Core Algorithm Concept:**
1. **Density-Based Clustering**: Groups points in high-density areas
2. **Epsilon Neighborhood**: Points within `eps` distance are neighbors
3. **Core Points**: Points with ≥ `min_samples` neighbors in their epsilon neighborhood
4. **Border Points**: Non-core points within epsilon of a core point
5. **Noise Points**: Points that are neither core nor border → **OUTLIERS**

#### **Mathematical Foundation:**

```python
# DBSCAN Algorithm Steps:

# 1. For each point p:
#    - Find all points within eps distance (epsilon neighborhood)
#    - If neighborhood has ≥ min_samples points: p is CORE point

# 2. Form clusters:
#    - Core points in same eps-neighborhood belong to same cluster
#    - Border points belong to cluster of nearest core point

# 3. Label remaining points as NOISE (outliers):
#    - Points not core and not within eps of any core point = -1 (outliers)

# Key Parameters:
# eps (epsilon): Maximum distance between points in same neighborhood
# min_samples: Minimum points required to form dense region (cluster)
```

#### **Visual Intuition:**
- **Dense regions**: Become clusters (labeled 0, 1, 2, ...)
- **Sparse isolated points**: Become noise/outliers (labeled -1)
- **Border points**: Assigned to nearest cluster
- **Core points**: Centers of dense regions

### 📊 Output Interpretation

Your output will be an array like: `[0, 0, -1, 1, 1, -1, 0, ...]`

**Interpretation:**
- **0, 1, 2, ...**: Cluster labels (normal points in dense regions)
- **-1**: Noise/outliers (isolated points in sparse regions)

**Practical Usage:**
```python
# Get cluster labels and outlier analysis
cluster_labels = dbscan.fit_predict(base_df[['Age']])
outlier_mask = cluster_labels == -1
outlier_indices = np.where(outlier_mask)[0]
outlier_customers = base_df.iloc[outlier_indices]

# Cluster analysis
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_outliers = np.sum(outlier_mask)

print(f"Number of clusters: {n_clusters}")
print(f"Number of outliers: {n_outliers}")
print(f"Outlier percentage: {n_outliers/len(base_df)*100:.1f}%")

# Analyze each cluster
for cluster_id in set(cluster_labels):
    if cluster_id != -1:  # Skip outliers
        cluster_points = base_df[cluster_labels == cluster_id]
        print(f"\nCluster {cluster_id}:")
        print(f"  Size: {len(cluster_points)}")
        print(f"  Age range: {cluster_points['Age'].min():.1f} - {cluster_points['Age'].max():.1f}")
        print(f"  Age mean: {cluster_points['Age'].mean():.1f}")

# Show outlier details
if n_outliers > 0:
    print(f"\nOutlier Details:")
    outlier_details = pd.DataFrame({
        'Customer_Index': outlier_indices,
        'Age': base_df.iloc[outlier_indices]['Age'].values
    })
    print(outlier_details.sort_values('Age'))
```

**DBSCAN-Specific Insights:**
- **Natural clustering**: Reveals natural age groups in your customer data
- **Automatic outlier detection**: No need to specify contamination rate
- **Cluster characteristics**: Each cluster represents a customer age segment
- **Outliers**: Customers with unusual ages relative to main age groups

### ⚖️ DBSCAN vs Other Outlier Detection Methods

| **Method** | **Strengths** | **Weaknesses** | **Best Use Case** |
|------------|---------------|----------------|-------------------|
| **Standard Z-Score** | ✅ Simple, fast<br/>✅ Interpretable<br/>✅ Global outliers | ❌ Assumes normality<br/>❌ Misses local outliers<br/>❌ Univariate only | Normally distributed, global outliers |
| **Modified Z-Score** | ✅ Robust to outliers<br/>✅ No normality assumption<br/>✅ Interpretable | ❌ Still global approach<br/>❌ Univariate only<br/>❌ Misses local patterns | Robust univariate outlier detection |
| **Isolation Forest** | ✅ Multivariate<br/>✅ No assumptions<br/>✅ Scalable<br/>✅ Global patterns | ❌ Parameter sensitive<br/>❌ Poor with local outliers<br/>❌ Less interpretable | Large datasets, global anomalies |
| **Local Outlier Factor** | ✅ Local outliers<br/>✅ Density-aware<br/>✅ Interpretable scores<br/>✅ Handles clusters | ❌ Sensitive to k parameter<br/>❌ O(n²) complexity<br/>❌ High-dimensional issues | Clustered data, local anomalies |
| **DBSCAN** | ✅ **Natural clustering + outliers**<br/>✅ **No contamination needed**<br/>✅ **Arbitrary cluster shapes**<br/>✅ **Robust to noise**<br/>✅ **Discovers data structure** | ❌ **Very parameter sensitive**<br/>❌ **Struggles with varying densities**<br/>❌ **High-dimensional curse**<br/>❌ **Difficult parameter tuning** | **Unknown cluster structure, natural groupings** |

### 🎯 Detailed Comparison

#### **DBSCAN Unique Strengths:**
1. **Simultaneous Clustering & Outlier Detection**: One algorithm, dual purpose
2. **No Pre-specified Cluster Count**: Automatically determines number of clusters
3. **Arbitrary Cluster Shapes**: Not limited to spherical clusters like k-means
4. **Natural Outlier Definition**: Points that don't belong to any dense region
5. **Robust to Noise**: Designed specifically to handle noisy data
6. **No Contamination Parameter**: Doesn't require knowing expected outlier percentage

#### **DBSCAN Weaknesses:**
1. **Parameter Sensitivity**: Results highly dependent on `eps` and `min_samples`
2. **Varying Densities**: Struggles when clusters have different densities
3. **High Dimensionality**: Performance degrades with many features (curse of dimensionality)
4. **Parameter Selection**: No clear guidelines for choosing optimal parameters
5. **Border Point Ambiguity**: Border points can be assigned to different clusters
6. **Scalability**: O(n log n) with spatial indexing, O(n²) without

### 🚀 Advanced Usage and Parameter Tuning

#### **Parameter Selection Strategies:**

```python
def find_optimal_eps(X, min_samples=2, plot=True):
    """
    Find optimal eps using k-distance plot (elbow method)
    """
    from sklearn.neighbors import NearestNeighbors
    
    # Calculate k-distances (distance to k-th nearest neighbor)
    k = min_samples
    nbrs = NearestNeighbors(n_neighbors=k).fit(X)
    distances, indices = nbrs.kneighbors(X)
    
    # Sort distances to k-th neighbor
    k_distances = distances[:, k-1]
    k_distances_sorted = np.sort(k_distances)[::-1]
    
    if plot:
        plt.figure(figsize=(10, 6))
        plt.plot(range(len(k_distances_sorted)), k_distances_sorted, 'b-')
        plt.xlabel('Points (sorted by distance)')
        plt.ylabel(f'{k}-NN Distance')
        plt.title('K-Distance Plot for Eps Selection')
        plt.grid(True)
        
        # Try to find elbow automatically
        # Simple elbow detection using second derivative
        if len(k_distances_sorted) > 10:
            second_deriv = np.diff(k_distances_sorted, 2)
            elbow_idx = np.argmax(second_deriv) + 2
            suggested_eps = k_distances_sorted[elbow_idx]
            plt.axhline(y=suggested_eps, color='r', linestyle='--', 
                       label=f'Suggested eps: {suggested_eps:.2f}')
            plt.legend()
        
        plt.show()
    
    return k_distances_sorted

def evaluate_dbscan_parameters(X, eps_range, min_samples_range):
    """
    Evaluate different DBSCAN parameters
    """
    results = []
    
    for eps in eps_range:
        for min_samples in min_samples_range:
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            labels = dbscan.fit_predict(X)
            
            n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
            n_outliers = np.sum(labels == -1)
            outlier_ratio = n_outliers / len(X)
            
            # Calculate silhouette score if we have clusters
            if n_clusters > 1 and n_outliers < len(X):
                from sklearn.metrics import silhouette_score
                # Remove outliers for silhouette calculation
                mask = labels != -1
                if np.sum(mask) > 1:
                    silhouette = silhouette_score(X[mask], labels[mask])
                else:
                    silhouette = -1
            else:
                silhouette = -1
            
            results.append({
                'eps': eps,
                'min_samples': min_samples,
                'n_clusters': n_clusters,
                'n_outliers': n_outliers,
                'outlier_ratio': outlier_ratio,
                'silhouette_score': silhouette
            })
    
    return pd.DataFrame(results)

# Apply parameter optimization
k_distances = find_optimal_eps(base_df[['Age']], min_samples=3)

# Test parameter ranges
eps_range = np.arange(0.5, 5.0, 0.5)
min_samples_range = [2, 3, 4, 5]

param_results = evaluate_dbscan_parameters(base_df[['Age']], eps_range, min_samples_range)

# Find best parameters (balance between clusters and silhouette score)
best_params = param_results.loc[
    (param_results['n_clusters'] > 0) & 
    (param_results['n_clusters'] < 10) &
    (param_results['silhouette_score'] > 0)
].sort_values('silhouette_score', ascending=False).iloc[0]

print("Best DBSCAN Parameters:")
print(f"eps: {best_params['eps']}")
print(f"min_samples: {best_params['min_samples']}")
print(f"Results: {best_params['n_clusters']} clusters, {best_params['n_outliers']} outliers")
```

#### **Enhanced DBSCAN Implementation:**

```python
def enhanced_dbscan_analysis(X, eps=None, min_samples=None):
    """
    Enhanced DBSCAN with automatic parameter selection and detailed analysis
    """
    
    # Auto-select parameters if not provided
    if eps is None or min_samples is None:
        # Rule of thumb: min_samples = 2 * dimensions
        if min_samples is None:
            min_samples = max(2, 2 * X.shape[1])
        
        if eps is None:
            # Use k-distance method
            from sklearn.neighbors import NearestNeighbors
            k = min_samples
            nbrs = NearestNeighbors(n_neighbors=k).fit(X)
            distances, _ = nbrs.kneighbors(X)
            k_distances = np.sort(distances[:, k-1])[::-1]
            
            # Simple elbow detection
            if len(k_distances) > 10:
                second_deriv = np.diff(k_distances, 2)
                elbow_idx = np.argmax(second_deriv) + 2
                eps = k_distances[elbow_idx]
            else:
                eps = np.mean(k_distances)
    
    # Apply DBSCAN
    dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='euclidean')
    labels = dbscan.fit_predict(X)
    
    # Detailed analysis
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_outliers = np.sum(labels == -1)
    
    analysis = {
        'parameters': {'eps': eps, 'min_samples': min_samples},
        'results': {
            'n_clusters': n_clusters,
            'n_outliers': n_outliers,
            'outlier_ratio': n_outliers / len(X),
            'cluster_labels': labels
        },
        'cluster_details': {}
    }
    
    # Analyze each cluster
    for cluster_id in set(labels):
        if cluster_id != -1:
            cluster_mask = labels == cluster_id
            cluster_data = X[cluster_mask]
            
            analysis['cluster_details'][cluster_id] = {
                'size': np.sum(cluster_mask),
                'mean': np.mean(cluster_data, axis=0),
                'std': np.std(cluster_data, axis=0),
                'min': np.min(cluster_data, axis=0),
                'max': np.max(cluster_data, axis=0)
            }
    
    return analysis, dbscan

# Apply enhanced DBSCAN
analysis, dbscan_model = enhanced_dbscan_analysis(base_df[['Age']].values)

print("Enhanced DBSCAN Analysis:")
print(f"Parameters used: eps={analysis['parameters']['eps']:.2f}, min_samples={analysis['parameters']['min_samples']}")
print(f"Found {analysis['results']['n_clusters']} clusters and {analysis['results']['n_outliers']} outliers")
print(f"Outlier percentage: {analysis['results']['outlier_ratio']*100:.1f}%")

for cluster_id, details in analysis['cluster_details'].items():
    print(f"\nCluster {cluster_id}:")
    print(f"  Size: {details['size']}")
    print(f"  Age range: {details['min'][0]:.1f} - {details['max'][0]:.1f}")
    print(f"  Age mean ± std: {details['mean'][0]:.1f} ± {details['std'][0]:.1f}")
```

### 🎯 When to Use DBSCAN for Outlier Detection

**✅ Use DBSCAN when:**
- **Unknown cluster structure** - want to discover natural groupings
- **Arbitrary cluster shapes** - not limited to spherical clusters
- **Simultaneous clustering + outlier detection** needed
- **No prior knowledge** of outlier percentage
- **Robust noise handling** required
- **Moderate dataset size** (<10,000 points)

**❌ Don't use DBSCAN when:**
- **High-dimensional data** (>10 features) - curse of dimensionality
- **Uniform density required** - struggles with varying densities
- **Parameter tuning difficult** - unclear how to set eps/min_samples
- **Only outlier detection needed** - simpler methods may be better
- **Very large datasets** - scalability issues
- **Real-time applications** - parameter sensitivity makes it unreliable

### 🏆 Recommendation for Your Customer Segmentation

For customer segmentation analysis, **DBSCAN is valuable** because:

1. **Natural customer segments**: Discovers age-based customer groups automatically
2. **Outlier identification**: Finds customers who don't fit standard age segments  
3. **No assumptions**: Doesn't assume specific number of customer segments
4. **Business insights**: Clusters represent actionable customer segments

**Optimal implementation for your case:**
```python
# Recommended DBSCAN setup for customer age analysis
def customer_dbscan_analysis(customer_data):
    """
    DBSCAN optimized for customer segmentation
    """
    
    # Conservative parameters for business data
    # Start with rule-of-thumb and refine
    min_samples = 3  # At least 3 customers to form a segment
    
    # Use k-distance plot to find eps
    from sklearn.neighbors import NearestNeighbors
    nbrs = NearestNeighbors(n_neighbors=min_samples).fit(customer_data[['Age']])
    distances, _ = nbrs.kneighbors(customer_data[['Age']])
    k_distances = np.sort(distances[:, min_samples-1])[::-1]
    
    # Take 90th percentile as eps (conservative approach)
    eps = np.percentile(k_distances, 90)
    
    dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='euclidean')
    
    return dbscan, eps

# Apply to your customer data
optimal_dbscan, suggested_eps = customer_dbscan_analysis(base_df)
cluster_labels = optimal_dbscan.fit_predict(base_df[['Age']])

print(f"🎯 DBSCAN Customer Segmentation Results:")
print(f"Using eps={suggested_eps:.2f}, min_samples=3")

n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_outliers = np.sum(cluster_labels == -1)

print(f"Discovered {n_clusters} customer age segments")
print(f"Identified {n_outliers} outlier customers ({n_outliers/len(base_df)*100:.1f}%)")

# Business interpretation
if n_outliers > 0:
    outlier_customers = base_df[cluster_labels == -1]
    print(f"\nOutlier customers have ages: {sorted(outlier_customers['Age'].values)}")
    print("These customers may need special attention or represent niche segments!")
```

### 🎯 Summary: DBSCAN in Your Outlier Detection Arsenal

**Perfect Complementary Approach:**
1. **Z-Score**: Global statistical outliers
2. **Modified Z-Score**: Robust global outliers  
3. **Isolation Forest**: Multivariate global anomalies
4. **LOF**: Local density-based outliers
5. **DBSCAN**: **Clustering-based outliers + customer segmentation**

**Use DBSCAN specifically when you want to:**
- **Discover natural customer segments** while finding outliers
- **No prior assumptions** about number of customer groups
- **Find customers who don't belong to any major segment**
- **Combine clustering and outlier detection** in one step

DBSCAN is unique because it's primarily a **clustering algorithm that identifies outliers as a byproduct** - making it perfect for customer segmentation where you want both insights into customer groups AND identification of unusual customers! 🎯

**Key Takeaway**: DBSCAN gives you the most **business-actionable results** because outliers are defined as "customers who don't belong to any natural customer segment" - which is exactly what businesses want to know!

In [None]:
%store -r

print("Project configuration:")
print(f"SLUG = {SLUG}")
print(f"DATA_DIR = {DATA_DIR}")
print(f"DATASET_KEY = {DATASET_KEY}")
print(f"FIG_DIR = {FIG_DIR}")
print(f"REP_DIR = {REP_DIR}")
print(f"NOTEBOOK_DIR = {NOTEBOOK_DIR}")

missing_vars = [var for var in ['SLUG', 'DATA_DIR', 'FIG_DIR', 'REP_DIR', 'NOTEBOOK_DIR', 'DATASET_KEY'] if var not in globals()]
print(f"Vars not found in globals: {missing_vars}")

# Set default values if variables are not found in store or are empty
if not SLUG:  # Check if empty string
    print(f"{SLUG=} is empty, initializing everything explicitly")
    SLUG = 'customer-segmentation'
    DATASET_KEY = 'vjchoudhary7/customer-segmentation-tutorial-in-python'
    GIT_ROOT = Path.cwd().parent.parent
    DATA_DIR = GIT_ROOT / 'data' / SLUG
    FIG_DIR = GIT_ROOT / 'figures' / SLUG
    REP_DIR = GIT_ROOT / 'reports' / SLUG
    NOTEBOOK_DIR = GIT_ROOT / 'notebooks' / SLUG


Project configuration:
SLUG = customer-segmentation
DATA_DIR = /Users/ravisharma/workdir/eda_practice/data/customer-segmentation
DATASET_KEY = vjchoudhary7/customer-segmentation-tutorial-in-python
FIG_DIR = /Users/ravisharma/workdir/eda_practice/figures/customer-segmentation
REP_DIR = /Users/ravisharma/workdir/eda_practice/reports/customer-segmentation
NOTEBOOK_DIR = /Users/ravisharma/workdir/eda_practice/notebooks/customer-segmentation
Vars not found in globals: []


In [None]:
from pathlib import Path
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display

In [None]:
# Downloading data

base_df = pd.DataFrame()

CSV_PATH = Path(DATA_DIR) / "Mall_Customers.csv"
if not CSV_PATH.exists:
    print(f"CSV {CSV_PATH} does not exist. base_df will remain empty.")
else:
    base_df = pd.read_csv(CSV_PATH)
    print(f"CSV {CSV_PATH} loaded successfully.")

base_df.head()

CSV /Users/ravisharma/workdir/eda_practice/data/customer-segmentation/Mall_Customers.csv loaded successfully.


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
base_df.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


In [None]:
# DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=2, metric='euclidean')
outliers = dbscan.fit_predict(base_df[['Age']])
print(outliers)


[ 0  1  2  3  4  5  6  3 -1  7  8  6  9 10 11  5  6  2 12  6  6 13 14  4
 15 16 17  6 18  3 19  1 20 21 22  1 23  7 24  2 25 10 26  4 22 10 27 28
 16  4 22 29  4 30 27 31 32 -1 28 20 33  0  8 15 34 21 35 36  0 37 33 31
 19 19 30 38 17 18  3 22 39 40  8 14  1 26 -1  5 41 27 36 21 26 18 37 10
 31 28 26  2  3 22  8 38 22  1 42 15 36 42 25  0 40  0 21  0 34 22 32 27
 28 40 18 43  3  4 35 18 30 40 31 43 13  4  2 16 44 37  0  6 39 37 45 37
 13 45 26 37 41 41 35 43 44 40 31 28 11  7 41  7 -1 16  0  4 27 24 23 29
 24 37 18 45 24 24 12  7  9 28 30  6 11 37 14 16 46  7 15 45 46 24 41 37
 29 40 31  6 17 37 37  7]


## 🎯 DBSCAN Parameter Tuning: Comprehensive Guide

### 📊 Core Parameters and Data-Driven Selection

#### **1. `eps` (Epsilon) - The Most Critical Parameter**

**What it controls:** Maximum distance between points to be considered neighbors

**Data-driven selection methods:**

```python
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

def k_distance_plot_analysis(X, k_range=None, plot=True):
    """
    Comprehensive k-distance analysis for eps selection
    """
    
    if k_range is None:
        # Test multiple k values to understand data structure
        k_range = [2, 3, 4, 5, 6]
    
    eps_suggestions = {}
    
    for k in k_range:
        # Calculate k-distances
        nbrs = NearestNeighbors(n_neighbors=k).fit(X)
        distances, indices = nbrs.kneighbors(X)
        
        # Get k-th nearest neighbor distances
        k_distances = distances[:, k-1]
        k_distances_sorted = np.sort(k_distances)[::-1]
        
        # Multiple methods for elbow detection
        
        # Method 1: Maximum curvature (second derivative)
        if len(k_distances_sorted) > 10:
            # Smooth the curve first
            from scipy.ndimage import gaussian_filter1d
            smoothed = gaussian_filter1d(k_distances_sorted, sigma=2)
            
            # Calculate second derivative
            second_deriv = np.diff(smoothed, 2)
            elbow_idx = np.argmax(second_deriv) + 2
            eps_curvature = k_distances_sorted[elbow_idx]
        else:
            eps_curvature = np.median(k_distances_sorted)
        
        # Method 2: Percentile-based (conservative)
        eps_90th = np.percentile(k_distances_sorted, 90)
        eps_95th = np.percentile(k_distances_sorted, 95)
        
        # Method 3: Mean + std (statistical)
        eps_mean_std = np.mean(k_distances_sorted) + np.std(k_distances_sorted)
        
        # Method 4: Largest gap method
        gaps = np.diff(k_distances_sorted)
        largest_gap_idx = np.argmax(gaps)
        eps_gap = k_distances_sorted[largest_gap_idx]
        
        eps_suggestions[k] = {
            'curvature': eps_curvature,
            '90th_percentile': eps_90th,
            '95th_percentile': eps_95th,
            'mean_plus_std': eps_mean_std,
            'largest_gap': eps_gap,
            'distances': k_distances_sorted
        }
        
        if plot:
            plt.figure(figsize=(12, 8))
            plt.subplot(2, 2, k-1 if k <= 4 else 1)
            plt.plot(range(len(k_distances_sorted)), k_distances_sorted, 'b-', linewidth=2)
            plt.axhline(y=eps_curvature, color='r', linestyle='--', 
                       label=f'Curvature: {eps_curvature:.3f}')
            plt.axhline(y=eps_90th, color='g', linestyle='--', 
                       label=f'90th percentile: {eps_90th:.3f}')
            plt.axhline(y=eps_gap, color='orange', linestyle='--', 
                       label=f'Largest gap: {eps_gap:.3f}')
            
            plt.xlabel('Points (sorted by distance)')
            plt.ylabel(f'{k}-NN Distance')
            plt.title(f'K-Distance Plot (k={k})')
            plt.legend()
            plt.grid(True, alpha=0.3)
    
    if plot:
        plt.tight_layout()
        plt.show()
    
    return eps_suggestions

def analyze_data_characteristics(X):
    """
    Analyze data characteristics to inform parameter selection
    """
    
    n_samples, n_features = X.shape
    
    # Calculate basic statistics
    stats = {
        'n_samples': n_samples,
        'n_features': n_features,
        'data_range': np.ptp(X, axis=0),  # Peak-to-peak range
        'data_std': np.std(X, axis=0),
        'data_mean': np.mean(X, axis=0)
    }
    
    # Estimate data density
    from sklearn.neighbors import NearestNeighbors
    nbrs = NearestNeighbors(n_neighbors=min(10, n_samples-1)).fit(X)
    distances, _ = nbrs.kneighbors(X)
    avg_density = np.mean(distances[:, -1])  # Average distance to 10th neighbor
    
    stats['estimated_density'] = avg_density
    
    # Dimensionality considerations
    if n_features == 1:
        stats['dimensionality_type'] = 'univariate'
        stats['suggested_min_samples'] = max(3, int(np.sqrt(n_samples) * 0.1))
    elif n_features <= 3:
        stats['dimensionality_type'] = 'low_dimensional'
        stats['suggested_min_samples'] = max(4, int(np.sqrt(n_samples) * 0.15))
    elif n_features <= 10:
        stats['dimensionality_type'] = 'medium_dimensional'
        stats['suggested_min_samples'] = max(2 * n_features, int(np.sqrt(n_samples) * 0.2))
    else:
        stats['dimensionality_type'] = 'high_dimensional'
        stats['suggested_min_samples'] = max(2 * n_features, int(np.sqrt(n_samples) * 0.3))
        print("⚠️ Warning: DBSCAN may not perform well with high-dimensional data")
    
    # Sample size considerations
    if n_samples < 50:
        stats['sample_size_category'] = 'very_small'
        stats['eps_adjustment'] = 'increase'  # Need larger eps for connectivity
    elif n_samples < 200:
        stats['sample_size_category'] = 'small'
        stats['eps_adjustment'] = 'moderate'
    elif n_samples < 1000:
        stats['sample_size_category'] = 'medium'
        stats['eps_adjustment'] = 'standard'
    else:
        stats['sample_size_category'] = 'large'
        stats['eps_adjustment'] = 'decrease'  # Can use smaller eps
    
    return stats

# Apply comprehensive analysis to your data
print("🔍 Analyzing data characteristics...")
data_stats = analyze_data_characteristics(base_df[['Age']].values)

print("Data Characteristics:")
for key, value in data_stats.items():
    print(f"  {key}: {value}")

print(f"\n📊 K-distance analysis...")
eps_analysis = k_distance_plot_analysis(base_df[['Age']].values)

# Synthesize recommendations
print(f"\n💡 Eps Recommendations:")
for k, suggestions in eps_analysis.items():
    print(f"  k={k}:")
    for method, value in suggestions.items():
        if method != 'distances':
            print(f"    {method}: {value:.3f}")
```

#### **2. `min_samples` - Minimum Cluster Size**

**What it controls:** Minimum number of points required to form a dense region

**Heuristic Rules for min_samples:**

```python
def suggest_min_samples(X, domain_knowledge=None):
    """
    Suggest min_samples based on data characteristics and domain knowledge
    """
    
    n_samples, n_features = X.shape
    
    suggestions = {}
    
    # Rule 1: Classic heuristic (2 * dimensions)
    suggestions['classic_heuristic'] = 2 * n_features
    
    # Rule 2: Statistical rule based on sample size
    if n_samples < 100:
        suggestions['sample_size_based'] = max(2, int(n_samples * 0.02))  # 2% of data
    elif n_samples < 500:
        suggestions['sample_size_based'] = max(3, int(n_samples * 0.015))  # 1.5% of data
    else:
        suggestions['sample_size_based'] = max(4, int(n_samples * 0.01))   # 1% of data
    
    # Rule 3: Square root rule
    suggestions['sqrt_rule'] = max(2, int(np.sqrt(n_samples) * 0.2))
    
    # Rule 4: Domain-specific adjustments
    if domain_knowledge:
        if 'noise_level' in domain_knowledge:
            noise_level = domain_knowledge['noise_level']  # 'low', 'medium', 'high'
            if noise_level == 'low':
                suggestions['noise_adjusted'] = min(suggestions.values())  # More sensitive
            elif noise_level == 'medium':
                suggestions['noise_adjusted'] = int(np.median(list(suggestions.values())))
            else:  # high noise
                suggestions['noise_adjusted'] = max(suggestions.values())  # Less sensitive
        
        if 'min_cluster_size' in domain_knowledge:
            # Business requirement for minimum meaningful cluster size
            suggestions['business_requirement'] = domain_knowledge['min_cluster_size']
    
    # Rule 5: Data density consideration
    from sklearn.neighbors import NearestNeighbors
    nbrs = NearestNeighbors(n_neighbors=min(10, n_samples-1)).fit(X)
    distances, _ = nbrs.kneighbors(X)
    
    # If data is very sparse, need higher min_samples
    avg_10th_distance = np.mean(distances[:, -1])
    data_range = np.ptp(X)
    relative_sparsity = avg_10th_distance / data_range
    
    if relative_sparsity > 0.1:  # Sparse data
        suggestions['density_adjusted'] = max(suggestions.values()) + 1
    else:  # Dense data
        suggestions['density_adjusted'] = min(max(2, min(suggestions.values()) - 1), 
                                            max(suggestions.values()))
    
    # Final recommendation (conservative approach)
    final_suggestion = int(np.median(list(suggestions.values())))
    
    print("Min_samples suggestions:")
    for method, value in suggestions.items():
        print(f"  {method}: {value}")
    print(f"Final recommendation: {final_suggestion}")
    
    return final_suggestion, suggestions

# Apply min_samples analysis
domain_knowledge = {
    'noise_level': 'medium',  # Customer data typically has moderate noise
    'min_cluster_size': 3     # Want at least 3 customers per segment
}

optimal_min_samples, min_samples_analysis = suggest_min_samples(
    base_df[['Age']].values, 
    domain_knowledge
)
```

### 🧪 DBSCAN-Specific Validation Methodologies

#### **1. Parameter Grid Search with Multiple Metrics**

```python
def comprehensive_dbscan_validation(X, eps_range=None, min_samples_range=None):
    """
    Comprehensive validation of DBSCAN parameters using multiple metrics
    """
    
    if eps_range is None:
        # Auto-generate eps range based on k-distance analysis
        eps_analysis = k_distance_plot_analysis(X, plot=False)
        eps_suggestions = []
        for k_suggestions in eps_analysis.values():
            eps_suggestions.extend([
                k_suggestions['curvature'],
                k_suggestions['90th_percentile'],
                k_suggestions['largest_gap']
            ])
        
        eps_min = min(eps_suggestions) * 0.5
        eps_max = max(eps_suggestions) * 2.0
        eps_range = np.linspace(eps_min, eps_max, 15)
    
    if min_samples_range is None:
        min_samples_range = [2, 3, 4, 5, 6, 8, 10]
    
    results = []
    
    for eps in eps_range:
        for min_samples in min_samples_range:
            try:
                # Fit DBSCAN
                dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric='euclidean')
                labels = dbscan.fit_predict(X)
                
                # Basic metrics
                n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
                n_outliers = np.sum(labels == -1)
                outlier_ratio = n_outliers / len(X)
                
                # Advanced metrics
                metrics = {
                    'eps': eps,
                    'min_samples': min_samples,
                    'n_clusters': n_clusters,
                    'n_outliers': n_outliers,
                    'outlier_ratio': outlier_ratio
                }
                
                # Silhouette Score (only if we have clusters and not all points are outliers)
                if n_clusters > 1 and n_outliers < len(X) * 0.9:
                    mask = labels != -1
                    if np.sum(mask) > 1 and len(set(labels[mask])) > 1:
                        silhouette = silhouette_score(X[mask], labels[mask])
                        metrics['silhouette_score'] = silhouette
                    else:
                        metrics['silhouette_score'] = -1
                else:
                    metrics['silhouette_score'] = -1
                
                # Calinski-Harabasz Score (variance ratio)
                if n_clusters > 1 and n_outliers < len(X) * 0.9:
                    mask = labels != -1
                    if np.sum(mask) > n_clusters and len(set(labels[mask])) > 1:
                        from sklearn.metrics import calinski_harabasz_score
                        ch_score = calinski_harabasz_score(X[mask], labels[mask])
                        metrics['calinski_harabasz_score'] = ch_score
                    else:
                        metrics['calinski_harabasz_score'] = 0
                else:
                    metrics['calinski_harabasz_score'] = 0
                
                # Davies-Bouldin Score (lower is better)
                if n_clusters > 1 and n_outliers < len(X) * 0.9:
                    mask = labels != -1
                    if np.sum(mask) > n_clusters and len(set(labels[mask])) > 1:
                        from sklearn.metrics import davies_bouldin_score
                        db_score = davies_bouldin_score(X[mask], labels[mask])
                        metrics['davies_bouldin_score'] = db_score
                    else:
                        metrics['davies_bouldin_score'] = float('inf')
                else:
                    metrics['davies_bouldin_score'] = float('inf')
                
                # Cluster size statistics
                if n_clusters > 0:
                    cluster_sizes = []
                    for cluster_id in set(labels):
                        if cluster_id != -1:
                            cluster_sizes.append(np.sum(labels == cluster_id))
                    
                    metrics['min_cluster_size'] = min(cluster_sizes) if cluster_sizes else 0
                    metrics['max_cluster_size'] = max(cluster_sizes) if cluster_sizes else 0
                    metrics['avg_cluster_size'] = np.mean(cluster_sizes) if cluster_sizes else 0
                    metrics['cluster_size_std'] = np.std(cluster_sizes) if cluster_sizes else 0
                else:
                    metrics.update({
                        'min_cluster_size': 0,
                        'max_cluster_size': 0,
                        'avg_cluster_size': 0,
                        'cluster_size_std': 0
                    })
                
                # Custom quality score (weighted combination of metrics)
                quality_score = 0
                
                # Prefer reasonable number of clusters
                if 2 <= n_clusters <= min(10, len(X) // 10):
                    quality_score += 2
                elif n_clusters == 1:
                    quality_score += 0.5
                elif n_clusters == 0:
                    quality_score -= 2
                else:
                    quality_score -= 1
                
                # Prefer reasonable outlier ratio
                if 0.01 <= outlier_ratio <= 0.2:
                    quality_score += 2
                elif outlier_ratio <= 0.3:
                    quality_score += 1
                elif outlier_ratio > 0.5:
                    quality_score -= 2
                
                # Add silhouette score contribution
                if metrics['silhouette_score'] > 0:
                    quality_score += metrics['silhouette_score'] * 2
                
                # Penalize extreme parameter values
                if eps < 0.1 or eps > 10:
                    quality_score -= 1
                if min_samples > len(X) // 5:
                    quality_score -= 1
                
                metrics['quality_score'] = quality_score
                
                results.append(metrics)
                
            except Exception as e:
                print(f"Error with eps={eps:.3f}, min_samples={min_samples}: {e}")
                continue
    
    return pd.DataFrame(results)

# Apply comprehensive validation
print("🧪 Running comprehensive parameter validation...")
validation_results = comprehensive_dbscan_validation(base_df[['Age']].values)

# Find best parameters
best_results = validation_results[
    (validation_results['n_clusters'] > 0) &
    (validation_results['n_clusters'] <= 8) &
    (validation_results['outlier_ratio'] <= 0.3) &
    (validation_results['silhouette_score'] > 0)
].sort_values('quality_score', ascending=False)

if len(best_results) > 0:
    best_params = best_results.iloc[0]
    print(f"\n🏆 Best Parameters Found:")
    print(f"eps: {best_params['eps']:.3f}")
    print(f"min_samples: {int(best_params['min_samples'])}")
    print(f"Results: {int(best_params['n_clusters'])} clusters, {int(best_params['n_outliers'])} outliers")
    print(f"Quality Score: {best_params['quality_score']:.3f}")
    print(f"Silhouette Score: {best_params['silhouette_score']:.3f}")
else:
    print("⚠️ No suitable parameters found. Consider adjusting parameter ranges.")
```

#### **2. Stability Analysis**

```python
def dbscan_stability_analysis(X, eps, min_samples, n_trials=10, sample_fraction=0.8):
    """
    Analyze DBSCAN stability across different subsamples
    """
    
    n_samples = len(X)
    sample_size = int(n_samples * sample_fraction)
    
    stability_metrics = {
        'cluster_counts': [],
        'outlier_ratios': [],
        'cluster_assignments': [],
        'silhouette_scores': []
    }
    
    for trial in range(n_trials):
        # Random subsample
        np.random.seed(trial)
        sample_indices = np.random.choice(n_samples, sample_size, replace=False)
        X_sample = X[sample_indices]
        
        # Apply DBSCAN
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X_sample)
        
        # Record metrics
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        outlier_ratio = np.sum(labels == -1) / len(X_sample)
        
        stability_metrics['cluster_counts'].append(n_clusters)
        stability_metrics['outlier_ratios'].append(outlier_ratio)
        
        # Silhouette score
        if n_clusters > 1 and outlier_ratio < 0.9:
            mask = labels != -1
            if np.sum(mask) > 1 and len(set(labels[mask])) > 1:
                silhouette = silhouette_score(X_sample[mask], labels[mask])
                stability_metrics['silhouette_scores'].append(silhouette)
            else:
                stability_metrics['silhouette_scores'].append(-1)
        else:
            stability_metrics['silhouette_scores'].append(-1)
        
        # Store cluster assignments for consensus analysis
        full_labels = np.full(n_samples, -2)  # -2 for not sampled
        full_labels[sample_indices] = labels
        stability_metrics['cluster_assignments'].append(full_labels)
    
    # Calculate stability statistics
    cluster_count_cv = np.std(stability_metrics['cluster_counts']) / np.mean(stability_metrics['cluster_counts']) if np.mean(stability_metrics['cluster_counts']) > 0 else float('inf')
    outlier_ratio_cv = np.std(stability_metrics['outlier_ratios']) / np.mean(stability_metrics['outlier_ratios']) if np.mean(stability_metrics['outlier_ratios']) > 0 else float('inf')
    
    valid_silhouettes = [s for s in stability_metrics['silhouette_scores'] if s > -1]
    avg_silhouette = np.mean(valid_silhouettes) if valid_silhouettes else -1
    silhouette_cv = np.std(valid_silhouettes) / np.mean(valid_silhouettes) if valid_silhouettes and np.mean(valid_silhouettes) > 0 else float('inf')
    
    # Consensus outlier detection
    outlier_consensus = np.zeros(n_samples)
    for assignment in stability_metrics['cluster_assignments']:
        outlier_consensus += (assignment == -1).astype(int)
    
    consensus_outliers = outlier_consensus >= (n_trials * 0.5)  # Majority vote
    
    results = {
        'cluster_count_stability': cluster_count_cv,
        'outlier_ratio_stability': outlier_ratio_cv,
        'avg_silhouette': avg_silhouette,
        'silhouette_stability': silhouette_cv,
        'consensus_outliers': consensus_outliers,
        'stability_metrics': stability_metrics
    }
    
    print(f"Stability Analysis (eps={eps:.3f}, min_samples={min_samples}):")
    print(f"  Cluster count CV: {cluster_count_cv:.3f} (lower is more stable)")
    print(f"  Outlier ratio CV: {outlier_ratio_cv:.3f} (lower is more stable)")
    print(f"  Average silhouette: {avg_silhouette:.3f}")
    print(f"  Silhouette CV: {silhouette_cv:.3f} (lower is more stable)")
    print(f"  Consensus outliers: {np.sum(consensus_outliers)} points")
    
    return results

# Test stability of best parameters
if len(best_results) > 0:
    stability_results = dbscan_stability_analysis(
        base_df[['Age']].values,
        best_params['eps'],
        int(best_params['min_samples'])
    )
```

#### **3. Business Logic Validation**

```python
def business_validation_dbscan(X, labels, domain_constraints, feature_names=None):
    """
    Validate DBSCAN results against business logic
    """
    
    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X.shape[1])]
    
    validation_results = {}
    
    # Overall cluster analysis
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_outliers = np.sum(labels == -1)
    outlier_ratio = n_outliers / len(X)
    
    validation_results['cluster_summary'] = {
        'n_clusters': n_clusters,
        'n_outliers': n_outliers,
        'outlier_ratio': outlier_ratio
    }
    
    # Check against business constraints
    expected_clusters = domain_constraints.get('expected_clusters', None)
    max_outlier_ratio = domain_constraints.get('max_outlier_ratio', 0.2)
    min_cluster_size = domain_constraints.get('min_cluster_size', 3)
    
    # Validation flags
    validation_results['validations'] = {}
    
    if expected_clusters:
        cluster_count_ok = abs(n_clusters - expected_clusters) <= 2
        validation_results['validations']['cluster_count_reasonable'] = cluster_count_ok
    
    outlier_ratio_ok = outlier_ratio <= max_outlier_ratio
    validation_results['validations']['outlier_ratio_acceptable'] = outlier_ratio_ok
    
    # Cluster size validation
    cluster_sizes = []
    cluster_details = {}
    
    for cluster_id in set(labels):
        if cluster_id != -1:
            cluster_mask = labels == cluster_id
            cluster_data = X[cluster_mask]
            cluster_size = np.sum(cluster_mask)
            cluster_sizes.append(cluster_size)
            
            cluster_details[cluster_id] = {
                'size': cluster_size,
                'percentage': cluster_size / len(X) * 100,
                'centroid': np.mean(cluster_data, axis=0),
                'spread': np.std(cluster_data, axis=0)
            }
    
    min_size_ok = all(size >= min_cluster_size for size in cluster_sizes)
    validation_results['validations']['min_cluster_size_met'] = min_size_ok
    
    # Feature-specific validations (for customer age analysis)
    if 'Age' in feature_names or len(feature_names) == 1:
        age_idx = feature_names.index('Age') if 'Age' in feature_names else 0
        
        # Outlier age analysis
        if n_outliers > 0:
            outlier_ages = X[labels == -1, age_idx]
            min_reasonable_age = domain_constraints.get('min_reasonable_age', 16)
            max_reasonable_age = domain_constraints.get('max_reasonable_age', 80)
            
            reasonable_outliers = outlier_ages[
                (outlier_ages >= min_reasonable_age) & 
                (outlier_ages <= max_reasonable_age)
            ]
            
            validation_results['outlier_analysis'] = {
                'outlier_ages': sorted(outlier_ages),
                'reasonable_outliers_ratio': len(reasonable_outliers) / len(outlier_ages),
                'age_range': (outlier_ages.min(), outlier_ages.max())
            }
        
        # Cluster age analysis
        for cluster_id, details in cluster_details.items():
            age_mean = details['centroid'][age_idx]
            age_std = details['spread'][age_idx]
            
            # Check if cluster represents a reasonable age group
            cluster_reasonable = (
                age_mean >= min_reasonable_age and 
                age_mean <= max_reasonable_age and
                age_std < 15  # Age groups shouldn't be too spread out
            )
            
            cluster_details[cluster_id]['age_reasonable'] = cluster_reasonable
            cluster_details[cluster_id]['age_stats'] = {
                'mean': age_mean,
                'std': age_std
            }
    
    validation_results['cluster_details'] = cluster_details
    
    # Overall validation score
    validation_score = sum(validation_results['validations'].values()) / len(validation_results['validations'])
    validation_results['overall_validation_score'] = validation_score
    
    return validation_results

# Apply business validation
if len(best_results) > 0:
    # Run DBSCAN with best parameters
    final_dbscan = DBSCAN(eps=best_params['eps'], min_samples=int(best_params['min_samples']))
    final_labels = final_dbscan.fit_predict(base_df[['Age']].values)
    
    # Define business constraints for customer segmentation
    business_constraints = {
        'expected_clusters': 4,  # Expect around 3-5 customer age segments
        'max_outlier_ratio': 0.15,  # Max 15% outliers acceptable
        'min_cluster_size': 5,  # Each segment should have at least 5 customers
        'min_reasonable_age': 18,
        'max_reasonable_age': 70
    }
    
    business_validation = business_validation_dbscan(
        base_df[['Age']].values,
        final_labels,
        business_constraints,
        ['Age']
    )
    
    print(f"\n📋 Business Validation Results:")
    print(f"Overall validation score: {business_validation['overall_validation_score']:.2f}")
    
    for validation, passed in business_validation['validations'].items():
        status = "✅" if passed else "❌"
        print(f"  {status} {validation}: {passed}")
    
    if 'outlier_analysis' in business_validation:
        print(f"\nOutlier Analysis:")
        print(f"  Outlier ages: {business_validation['outlier_analysis']['outlier_ages']}")
        print(f"  Reasonable outliers: {business_validation['outlier_analysis']['reasonable_outliers_ratio']:.2f}")
```

### 🎯 Complete DBSCAN Parameter Tuning Pipeline

```python
def complete_dbscan_tuning_pipeline(X, feature_names=None, domain_knowledge=None):
    """
    Complete pipeline for DBSCAN parameter optimization and validation
    """
    
    print("🔍 Step 1: Analyzing data characteristics...")
    data_stats = analyze_data_characteristics(X)
    
    print(f"\n📊 Step 2: K-distance analysis for eps selection...")
    eps_analysis = k_distance_plot_analysis(X, plot=True)
    
    print(f"\n🎯 Step 3: Min_samples suggestions...")
    optimal_min_samples, min_samples_analysis = suggest_min_samples(X, domain_knowledge)
    
    print(f"\n🧪 Step 4: Comprehensive parameter validation...")
    validation_results = comprehensive_dbscan_validation(X)
    
    # Filter and rank results
    good_results = validation_results[
        (validation_results['n_clusters'] > 0) &
        (validation_results['n_clusters'] <= 10) &
        (validation_results['outlier_ratio'] <= 0.4) &
        (validation_results['silhouette_score'] > 0)
    ].sort_values('quality_score', ascending=False)
    
    if len(good_results) == 0:
        print("⚠️ No good parameter combinations found. Relaxing constraints...")
        good_results = validation_results[
            (validation_results['n_clusters'] > 0)
        ].sort_values('quality_score', ascending=False)
    
    if len(good_results) > 0:
        # Test top 3 candidates for stability
        top_candidates = good_results.head(3)
        
        print(f"\n🔬 Step 5: Stability analysis of top candidates...")
        stability_results = []
        
        for _, candidate in top_candidates.iterrows():
            print(f"\nTesting eps={candidate['eps']:.3f}, min_samples={int(candidate['min_samples'])}")
            stability = dbscan_stability_analysis(
                X, candidate['eps'], int(candidate['min_samples']), n_trials=5
            )
            stability['params'] = candidate
            stability_results.append(stability)
        
        # Choose most stable candidate
        best_stability = min(stability_results, 
                           key=lambda x: x['cluster_count_stability'] + x['outlier_ratio_stability'])
        
        final_params = {
            'eps': best_stability['params']['eps'],
            'min_samples': int(best_stability['params']['min_samples'])
        }
        
        print(f"\n🏆 Final Recommended Parameters:")
        print(f"eps: {final_params['eps']:.3f}")
        print(f"min_samples: {final_params['min_samples']}")
        
        # Create final model and validate
        final_dbscan = DBSCAN(**final_params)
        final_labels = final_dbscan.fit_predict(X)
        
        print(f"\n📋 Step 6: Business validation...")
        if domain_knowledge and 'business_constraints' in domain_knowledge:
            business_validation = business_validation_dbscan(
                X, final_labels, domain_knowledge['business_constraints'], feature_names
            )
            
            print(f"Business validation score: {business_validation['overall_validation_score']:.2f}")
        
        return {
            'recommended_params': final_params,
            'final_model': final_dbscan,
            'final_labels': final_labels,
            'validation_results': validation_results,
            'stability_analysis': best_stability,
            'data_analysis': data_stats
        }
    else:
        print("❌ No suitable parameters found. Consider:")
        print("  - Checking data preprocessing")
        print("  - Using different distance metrics")
        print("  - Considering alternative clustering methods")
        return None

# Apply complete pipeline
domain_knowledge_full = {
    'noise_level': 'medium',
    'min_cluster_size': 3,
    'business_constraints': {
        'expected_clusters': 4,
        'max_outlier_ratio': 0.15,
        'min_cluster_size': 5,
        'min_reasonable_age': 18,
        'max_reasonable_age': 70
    }
}

optimal_dbscan_results = complete_dbscan_tuning_pipeline(
    base_df[['Age']].values,
    feature_names=['Age'],
    domain_knowledge=domain_knowledge_full
)
```

### 📝 DBSCAN Parameter Validation Checklist

#### **✅ Parameters are Well-Tuned When:**

1. **Eps Parameter Validation:**
   - Clear elbow in k-distance plot
   - Reasonable number of clusters (2-10 for most business cases)
   - Not too many tiny clusters or single giant cluster
   - Stable across different k values in k-distance analysis

2. **Min_samples Parameter Validation:**
   - Clusters have meaningful minimum size for business context
   - Not too sensitive to noise (stable results)
   - Balanced between over-clustering and under-clustering

3. **Overall Model Validation:**
   - High silhouette score (> 0.3)
   - Reasonable outlier ratio (typically 5-20%)
   - Stable results across subsamples
   - Business logic validation passes
   - Cluster sizes make business sense

#### **🚨 Red Flags (Poor Tuning):**

- **No clear clusters**: eps too small or too large
- **Everything is noise**: eps too small or min_samples too large  
- **Single giant cluster**: eps too large
- **Highly unstable**: Different results on subsamples
- **Unreasonable outliers**: Outliers don't make business sense
- **Poor separation**: Clusters overlap significantly

### 🎯 Specific Recommendations for Customer Segmentation

```python
def customer_segmentation_dbscan_optimizer(customer_data):
    """
    Specialized DBSCAN optimizer for customer segmentation
    """
    
    print("🎯 Customer Segmentation DBSCAN Optimization")
    
    # Customer-specific parameter ranges
    age_range = customer_data['Age'].max() - customer_data['Age'].min()
    
    # Conservative eps range (5-15% of age range)
    eps_range = np.linspace(age_range * 0.05, age_range * 0.15, 10)
    
    # Business-appropriate min_samples
    n_customers = len(customer_data)
    if n_customers < 100:
        min_samples_range = [2, 3, 4]
    elif n_customers < 300:
        min_samples_range = [3, 4, 5, 6]
    else:
        min_samples_range = [4, 5, 6, 8, 10]
    
    print(f"Testing eps range: {eps_range.min():.2f} - {eps_range.max():.2f}")
    print(f"Testing min_samples: {min_samples_range}")
    
    # Run validation
    results = comprehensive_dbscan_validation(
        customer_data[['Age']].values,
        eps_range,
        min_samples_range
    )
    
    # Business-focused filtering
    business_results = results[
        (results['n_clusters'] >= 2) &  # At least 2 customer segments
        (results['n_clusters'] <= 6) &  # Not too many segments
        (results['outlier_ratio'] <= 0.2) &  # Max 20% outliers
        (results['min_cluster_size'] >= 3)  # Each segment has at least 3 customers
    ]
    
    if len(business_results) > 0:
        # Prefer solutions with good silhouette and reasonable cluster count
        business_results['business_score'] = (
            business_results['silhouette_score'] * 2 +
            (1 / business_results['n_clusters']) * 0.5 +  # Prefer fewer clusters
            (1 - business_results['outlier_ratio']) * 1  # Prefer fewer outliers
        )
        
        best_business = business_results.sort_values('business_score', ascending=False).iloc[0]
        
        print(f"\n🎯 Customer Segmentation Results:")
        print(f"Optimal eps: {best_business['eps']:.2f}")
        print(f"Optimal min_samples: {int(best_business['min_samples'])}")
        print(f"Customer segments: {int(best_business['n_clusters'])}")
        print(f"Outlier customers: {int(best_business['n_outliers'])} ({best_business['outlier_ratio']*100:.1f}%)")
        print(f"Silhouette score: {best_business['silhouette_score']:.3f}")
        
        # Apply final model
        final_dbscan = DBSCAN(eps=best_business['eps'], min_samples=int(best_business['min_samples']))
        labels = final_dbscan.fit_predict(customer_data[['Age']].values)
        
        # Segment analysis
        print(f"\n📊 Customer Segment Analysis:")
        for cluster_id in sorted(set(labels)):
            if cluster_id == -1:
                segment_customers = customer_data[labels == cluster_id]
                print(f"Outlier customers: {len(segment_customers)} customers")
                if len(segment_customers) > 0:
                    ages = segment_customers['Age'].values
                    print(f"  Ages: {sorted(ages)}")
            else:
                segment_customers = customer_data[labels == cluster_id]
                ages = segment_customers['Age']
                print(f"Segment {cluster_id}: {len(segment_customers)} customers")
                print(f"  Age range: {ages.min():.0f} - {ages.max():.0f}")
                print(f"  Average age: {ages.mean():.1f} ± {ages.std():.1f}")
        
        return final_dbscan, labels, best_business
    else:
        print("⚠️ No suitable parameters found for customer segmentation")
        return None, None, None

# Apply customer-specific optimization
customer_dbscan, customer_labels, customer_results = customer_segmentation_dbscan_optimizer(base_df)
```

### 🎯 Summary: DBSCAN Parameter Tuning Best Practices

#### **🔧 Parameter Selection Rules:**

**For `eps`:**
1. **Use k-distance plots** - look for elbow point
2. **Test multiple k values** (2, 3, 4, 5) for robustness
3. **Consider data density** - sparse data needs larger eps
4. **Domain scaling** - eps should be meaningful in your domain units

**For `min_samples`:**
1. **Start with 2 × dimensions** (classic heuristic)
2. **Consider noise level** - higher noise needs larger min_samples
3. **Business constraints** - minimum meaningful cluster size
4. **Sample size** - 1-2% of dataset size as starting point

#### **🧪 Validation Methodology:**

1. **Multi-metric evaluation** - silhouette, Calinski-Harabasz, Davies-Bouldin
2. **Stability analysis** - consistent results across subsamples
3. **Business validation** - clusters make domain sense
4. **Parameter sensitivity** - robust to small parameter changes

#### **🎯 For Customer Segmentation:**

- **eps**: 5-15% of age range (typically 2-8 years for age data)
- **min_samples**: 3-6 customers per segment
- **Expected clusters**: 2-6 customer segments
- **Max outliers**: 10-20% of customers

The key is **iterative refinement** - start with data-driven estimates, validate with multiple metrics, and adjust based on business requirements! 🎯