In [None]:
%store -r

print("Project configuration:")
print(f"SLUG = {SLUG}")
print(f"DATA_DIR = {DATA_DIR}")
print(f"DATASET_KEY = {DATASET_KEY}")
print(f"FIG_DIR = {FIG_DIR}")
print(f"REP_DIR = {REP_DIR}")
print(f"NOTEBOOK_DIR = {NOTEBOOK_DIR}")

missing_vars = [var for var in ['SLUG', 'DATA_DIR', 'FIG_DIR', 'REP_DIR', 'NOTEBOOK_DIR', 'DATASET_KEY'] if var not in globals()]
print(f"Vars not found in globals: {missing_vars}")

# Set default values if variables are not found in store or are empty
if not SLUG:  # Check if empty string
    print(f"{SLUG=} is empty, initializing everything explicitly")
    SLUG = 'customer-segmentation'
    DATASET_KEY = 'vjchoudhary7/customer-segmentation-tutorial-in-python'
    GIT_ROOT = Path.cwd().parent.parent
    DATA_DIR = GIT_ROOT / 'data' / SLUG
    FIG_DIR = GIT_ROOT / 'figures' / SLUG
    REP_DIR = GIT_ROOT / 'reports' / SLUG
    NOTEBOOK_DIR = GIT_ROOT / 'notebooks' / SLUG


Project configuration:
SLUG = customer-segmentation
DATA_DIR = /Users/ravisharma/workdir/eda_practice/data/customer-segmentation
DATASET_KEY = vjchoudhary7/customer-segmentation-tutorial-in-python
FIG_DIR = /Users/ravisharma/workdir/eda_practice/figures/customer-segmentation
REP_DIR = /Users/ravisharma/workdir/eda_practice/reports/customer-segmentation
NOTEBOOK_DIR = /Users/ravisharma/workdir/eda_practice/notebooks/customer-segmentation
Vars not found in globals: []


In [None]:
from pathlib import Path
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display

In [None]:
# Downloading data

base_df = pd.DataFrame()

CSV_PATH = Path(DATA_DIR) / "Mall_Customers.csv"
if not CSV_PATH.exists:
    print(f"CSV {CSV_PATH} does not exist. base_df will remain empty.")
else:
    base_df = pd.read_csv(CSV_PATH)
    print(f"CSV {CSV_PATH} loaded successfully.")

base_df.head()

CSV /Users/ravisharma/workdir/eda_practice/data/customer-segmentation/Mall_Customers.csv loaded successfully.


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
base_df.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


## 🎯 One-Class SVM (Support Vector Machine) - Comprehensive Analysis

### 📝 **Code Breakdown**

```python
# One-Class SVM
from sklearn.svm import OneClassSVM
one_class_svm = OneClassSVM(nu=0.1, kernel='rbf', gamma='auto')
outliers = one_class_svm.fit_predict(base_df[['Age']])
print(outliers)
```

**Line-by-line explanation:**

1. **Import**: `from sklearn.svm import OneClassSVM` - Imports the One-Class SVM implementation from scikit-learn
2. **Model Creation**: `OneClassSVM(nu=0.1, kernel='rbf', gamma='auto')` - Creates model with specific parameters:
   - `nu=0.1`: Expected fraction of outliers (10%)
   - `kernel='rbf'`: Radial Basis Function kernel for non-linear decision boundaries
   - `gamma='auto'`: Automatic scaling parameter for RBF kernel
3. **Fit & Predict**: `fit_predict(base_df[['Age']])` - Trains model and predicts outliers in one step
4. **Output**: Array of 1s (normal) and -1s (outliers)

### 📚 **Essential Documentation & Resources**

#### **Official Documentation:**
- **[Scikit-learn OneClassSVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html)** - Complete API reference
- **[Scikit-learn Outlier Detection Guide](https://scikit-learn.org/stable/modules/outlier_detection.html)** - Overview of outlier detection methods
- **[Scikit-learn SVM User Guide](https://scikit-learn.org/stable/modules/svm.html)** - Comprehensive SVM theory and usage

#### **Research Papers:**
- **[Original One-Class SVM Paper](http://www.jmlr.org/papers/volume2/scholkopf01a/scholkopf01a.pdf)** - Schölkopf et al. (2001) "Estimating the Support of a High-Dimensional Distribution"
- **[SVDD Paper](https://www.jmlr.org/papers/volume5/tax04a/tax04a.pdf)** - Tax & Duin (2004) "Support Vector Data Description"

#### **Helpful Blog Posts & Tutorials:**
- **[Towards Data Science: One-Class SVM](https://towardsdatascience.com/outlier-detection-with-one-class-svms-5403a1a1878c)** - Practical tutorial with examples
- **[Machine Learning Mastery: One-Class SVM](https://machinelearningmastery.com/one-class-classification-algorithms/)** - Comprehensive guide to one-class classification
- **[Analytics Vidhya: Anomaly Detection](https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/)** - Comparison of outlier detection methods

### 🔍 **Algorithm Theory & How It Works**

**Core Concept:**
One-Class SVM learns a decision boundary that encapsulates the "normal" data points in a high-dimensional feature space, treating anything outside this boundary as an outlier.

**Mathematical Foundation:**
1. **Kernel Transformation**: Maps data to higher-dimensional space using RBF kernel: `K(x,x') = exp(-γ||x-x'||²)`
2. **Optimization Problem**: Finds hyperplane that separates data from origin with maximum margin
3. **Decision Function**: `f(x) = Σᵢ αᵢ K(xᵢ,x) - ρ` where αᵢ are support vectors
4. **Classification**: Points with `f(x) ≥ 0` are normal, `f(x) < 0` are outliers

**Visual Intuition:**
- Imagine fitting a "bubble" around your normal data points
- The SVM finds the optimal bubble boundary that contains most data
- Points outside the bubble are considered outliers

### 📊 **Output Interpretation & Practical Usage**

**Understanding the Output:**
```python
# Example output: [1 1 -1 1 1 1 -1 1 ...]
# 1  = Normal point (inside the decision boundary)
# -1 = Outlier point (outside the decision boundary)
```

**Practical Application Code:**
```python
# Get outlier indices and values
outlier_indices = np.where(outliers == -1)[0]
normal_indices = np.where(outliers == 1)[0]

print(f"Found {len(outlier_indices)} outliers out of {len(base_df)} customers")
print(f"Outlier percentage: {len(outlier_indices)/len(base_df)*100:.1f}%")

# Examine outlier customers
outlier_customers = base_df.iloc[outlier_indices]
normal_customers = base_df.iloc[normal_indices]

print(f"\nOutlier Age Statistics:")
print(f"Age range: {outlier_customers['Age'].min()} - {outlier_customers['Age'].max()}")
print(f"Mean age: {outlier_customers['Age'].mean():.1f}")
print(f"\nNormal Age Statistics:")
print(f"Age range: {normal_customers['Age'].min()} - {normal_customers['Age'].max()}")
print(f"Mean age: {normal_customers['Age'].mean():.1f}")

# Get decision scores for ranking
decision_scores = one_class_svm.decision_function(base_df[['Age']])
# More negative scores = more outlier-like
print(f"\nMost outlier-like customers (lowest scores):")
most_outlier_idx = np.argsort(decision_scores.flatten())[:5]
print(base_df.iloc[most_outlier_idx][['Age']])
```

**Business Interpretation for Customer Segmentation:**
- **Outlier customers** may represent:
  - Unique customer segments (very young/old customers)
  - Data entry errors
  - Special cases requiring different marketing strategies
  - VIP customers with unusual behavior patterns

### ⚖️ **Strengths vs Weaknesses**

```mermaid
graph TB
    subgraph "One-Class SVM Characteristics"
        A["🎯 One-Class SVM"] --> B["Strengths"]
        A --> C["Weaknesses"]
        
        B --> B1["Non-linear boundaries<br/>(RBF kernel)"]
        B --> B2["No assumption about<br/>data distribution"]
        B --> B3["Robust to<br/>high dimensions"]
        B --> B4["Theoretically<br/>well-founded"]
        B --> B5["Flexible kernel<br/>choices"]
        
        C --> C1["Complex parameter<br/>tuning (nu, gamma)"]
        C --> C2["Computationally<br/>expensive O(n³)"]
        C --> C3["Black box<br/>(hard to interpret)"]
        C --> C4["Sensitive to<br/>kernel parameters"]
        C --> C5["Memory intensive<br/>for large datasets"]
    end
    
    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fce4ec
```

**Strengths:**
- ✅ **Non-linear boundaries** (RBF kernel)
- ✅ **No assumption about data distribution**
- ✅ **Robust to high dimensions**
- ✅ **Theoretically well-founded**
- ✅ **Flexible kernel choices**

**Weaknesses:**
- ❌ **Complex parameter tuning** (nu, gamma)
- ❌ **Computationally expensive** O(n³)
- ❌ **Black box** (hard to interpret)
- ❌ **Sensitive to kernel parameters**
- ❌ **Memory intensive** for large datasets

### 📈 **Detailed Comparison with Other Outlier Detection Methods**

| **Method** | **Type** | **Assumptions** | **Complexity** | **Interpretability** | **Best For** | **Limitations** |
|------------|----------|-----------------|-----------------|---------------------|--------------|-----------------|
| **One-Class SVM** | Non-parametric | None (kernel-based) | High O(n³) | Low | Complex boundaries, high-dim | Parameter tuning, computation |
| **Standard Z-Score** | Parametric | Normal distribution | Low O(n) | High | Quick screening | Assumes normality |
| **Modified Z-Score** | Robust | Symmetric distribution | Low O(n) | High | Robust univariate | Univariate only |
| **Isolation Forest** | Tree-based | None | Medium O(n log n) | Medium | Large datasets, fast | Less precise boundaries |
| **LOF** | Density-based | Local density varies | High O(n²) | Medium | Local outliers | Neighborhood sensitive |
| **DBSCAN** | Clustering | Density clusters exist | Medium O(n log n) | High | Cluster-based outliers | Parameter sensitive |
| **Elliptic Envelope** | Parametric | Gaussian/elliptical | Low O(n) | High | Multivariate Gaussian | Assumes elliptical shape |

### 🎯 **When to Use One-Class SVM**

**✅ Ideal Scenarios:**
- **Complex decision boundaries** needed (non-linear patterns)
- **No distributional assumptions** can be made
- **High-dimensional data** with complex relationships
- **Theoretical rigor** is important for your application
- **Small to medium datasets** where computation isn't limiting

**❌ Avoid When:**
- **Large datasets** (>10,000 points) due to computational cost
- **Simple linear patterns** (Z-score would be sufficient)
- **Real-time applications** requiring fast predictions
- **Interpretability is crucial** for business decisions
- **Limited computational resources**

### 🔧 **Parameter Tuning Guide**

**Key Parameters:**

1. **`nu` (0 < nu ≤ 1)**: Expected fraction of outliers
   - **0.05**: Very conservative (5% outliers)
   - **0.1**: Moderate (10% outliers) - **current setting**
   - **0.2**: Liberal (20% outliers)
   - **Rule**: Start with domain knowledge of expected outlier rate

2. **`kernel`**: Type of kernel function
   - **'rbf'**: Radial Basis Function - **current setting** (most common)
   - **'linear'**: Linear kernel (simpler boundaries)
   - **'poly'**: Polynomial kernel (specific polynomial patterns)
   - **'sigmoid'**: Sigmoid kernel (neural network-like)

3. **`gamma`**: RBF kernel coefficient
   - **'auto'**: 1/n_features - **current setting**
   - **'scale'**: 1/(n_features × X.var())
   - **Float**: Custom value (higher = more complex boundaries)

**Optimization Code:**
```python
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import silhouette_score

def optimize_one_class_svm(X, param_grid=None):
    if param_grid is None:
        param_grid = {
            'nu': [0.05, 0.1, 0.15, 0.2],
            'gamma': ['auto', 'scale', 0.001, 0.01, 0.1, 1],
            'kernel': ['rbf']  # Focus on RBF for simplicity
        }
    
    best_score = -np.inf
    best_params = None
    results = []
    
    for params in ParameterGrid(param_grid):
        try:
            model = OneClassSVM(**params, random_state=42)
            predictions = model.fit_predict(X)
            
            # Skip if all points classified as outliers or all as normal
            if len(np.unique(predictions)) < 2:
                continue
                
            # Calculate silhouette score as quality metric
            score = silhouette_score(X, predictions)
            
            # Count outliers
            n_outliers = np.sum(predictions == -1)
            outlier_ratio = n_outliers / len(X)
            
            results.append({
                'params': params,
                'score': score,
                'n_outliers': n_outliers,
                'outlier_ratio': outlier_ratio
            })
            
            if score > best_score:
                best_score = score
                best_params = params
                
        except Exception as e:
            continue
    
    return best_params, best_score, results

# Apply optimization
best_params, best_score, all_results = optimize_one_class_svm(base_df[['Age']].values)
print(f"Best parameters: {best_params}")
print(f"Best silhouette score: {best_score:.3f}")
```

### 💡 **Recommendations for Your Customer Segmentation**

**Current Settings Analysis:**
- `nu=0.1` (10% outliers) - **Reasonable** for customer data
- `kernel='rbf'` - **Good choice** for non-linear patterns  
- `gamma='auto'` - **Conservative** default setting

**Suggested Improvements:**
1. **Try `gamma='scale'`** - Often performs better than 'auto'
2. **Test different `nu` values** - 0.05-0.15 range for business data
3. **Consider multivariate analysis** - Use multiple features (Age + Income + Spending)
4. **Validate results** - Check if outliers make business sense

**Enhanced Implementation:**
```python
# Multi-feature One-Class SVM for better customer insights
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X_multi = base_df[features].values

# Standardize features (important for SVM)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_multi)

# Apply One-Class SVM
enhanced_svm = OneClassSVM(nu=0.08, kernel='rbf', gamma='scale')
multi_outliers = enhanced_svm.fit_predict(X_scaled)

print(f"Multi-feature outlier detection:")
print(f"Outliers found: {np.sum(multi_outliers == -1)} ({np.sum(multi_outliers == -1)/len(base_df)*100:.1f}%)")
```

## 🎯 **Summary: One-Class SVM for Customer Outlier Detection**

**Key Takeaways:**
1. **Powerful non-linear** outlier detection without distributional assumptions
2. **Current parameters** (nu=0.1, rbf, gamma='auto') are reasonable starting points
3. **Best for complex patterns** but computationally expensive for large datasets
4. **Consider multi-feature analysis** for richer customer insights
5. **Validate business relevance** of detected outliers

One-Class SVM excels at finding complex outlier patterns that simpler methods might miss, making it valuable for sophisticated customer segmentation where you suspect non-linear relationships in your data! 🎯

In [None]:
# One-Class SVM
from sklearn.svm import OneClassSVM
one_class_svm = OneClassSVM(nu=0.1, kernel='rbf', gamma='auto')
outliers = one_class_svm.fit_predict(base_df[['Age']])
print(outliers)


[-1  1 -1  1 -1  1  1  1 -1 -1  1  1  1 -1  1  1  1 -1  1  1  1  1  1 -1
  1  1  1  1 -1  1  1  1 -1 -1  1  1 -1 -1  1 -1 -1 -1  1 -1  1 -1  1  1
  1 -1  1  1 -1 -1  1 -1 -1  1  1 -1 -1 -1  1  1  1 -1 -1 -1 -1 -1 -1 -1
  1  1 -1 -1  1 -1  1  1  1 -1  1  1  1  1  1  1 -1  1 -1 -1  1 -1 -1 -1
 -1  1  1 -1  1  1  1 -1  1  1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1  1  1 -1  1
  1 -1 -1  1  1 -1 -1 -1 -1 -1 -1  1  1 -1 -1  1  1 -1 -1  1  1 -1 -1 -1
  1 -1  1 -1 -1 -1 -1  1  1 -1 -1  1  1 -1 -1 -1  1  1 -1 -1  1  1 -1  1
  1 -1 -1 -1  1  1  1 -1  1  1 -1  1  1 -1  1  1 -1 -1  1 -1 -1  1 -1 -1
  1 -1 -1  1  1 -1 -1 -1]


## 🎯 One-Class SVM Parameter Tuning: Comprehensive Guide

### 📊 Core Parameters and Data-Driven Selection

#### **1. `nu` - Expected Outlier Fraction**

**What it controls:** Upper bound on the fraction of training errors and lower bound on the fraction of support vectors

**Data-driven selection methods:**

```python
import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from scipy import stats
import matplotlib.pyplot as plt

def estimate_nu_for_one_class_svm(X, methods=['iqr', 'zscore', 'mahalanobis', 'domain']):
    """
    Estimate nu parameter using multiple statistical methods
    """
    
    nu_estimates = {}
    
    # Method 1: IQR-based estimation (univariate)
    if 'iqr' in methods:
        for i in range(X.shape[1]):
            data = X[:, i]
            Q1 = np.percentile(data, 25)
            Q3 = np.percentile(data, 75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            iqr_outliers = np.sum((data < lower_bound) | (data > upper_bound))
            nu_estimates[f'iqr_feature_{i}'] = iqr_outliers / len(data)
    
    # Method 2: Z-score based (univariate)
    if 'zscore' in methods:
        for i in range(X.shape[1]):
            data = X[:, i]
            z_scores = np.abs(stats.zscore(data))
            zscore_outliers = np.sum(z_scores > 2.5)
            nu_estimates[f'zscore_feature_{i}'] = zscore_outliers / len(data)
    
    # Method 3: Mahalanobis distance (multivariate)
    if 'mahalanobis' in methods and X.shape[1] > 1:
        try:
            mean = np.mean(X, axis=0)
            cov = np.cov(X.T)
            inv_cov = np.linalg.pinv(cov)
            
            mahal_distances = []
            for i in range(len(X)):
                diff = X[i] - mean
                mahal_dist = np.sqrt(diff.T @ inv_cov @ diff)
                mahal_distances.append(mahal_dist)
            
            mahal_distances = np.array(mahal_distances)
            
            # Use chi-squared distribution for p dimensions
            p = X.shape[1]
            chi2_95 = stats.chi2.ppf(0.95, p)
            chi2_99 = stats.chi2.ppf(0.99, p)
            
            outliers_95 = np.sum(mahal_distances**2 > chi2_95)
            outliers_99 = np.sum(mahal_distances**2 > chi2_99)
            
            nu_estimates['mahalanobis_95'] = outliers_95 / len(X)
            nu_estimates['mahalanobis_99'] = outliers_99 / len(X)
            
        except np.linalg.LinAlgError:
            pass
    
    # Method 4: Domain-specific heuristics
    if 'domain' in methods:
        n_samples = len(X)
        
        # Conservative estimates based on data type
        if n_samples < 100:
            nu_estimates['domain_small_dataset'] = 0.15
        elif n_samples < 1000:
            nu_estimates['domain_medium_dataset'] = 0.10
        else:
            nu_estimates['domain_large_dataset'] = 0.05
        
        # Feature-based heuristics
        if X.shape[1] == 1:
            nu_estimates['domain_univariate'] = 0.08
        elif X.shape[1] <= 5:
            nu_estimates['domain_low_dim'] = 0.10
        else:
            nu_estimates['domain_high_dim'] = 0.12
    
    # Calculate statistics
    estimates = [v for v in nu_estimates.values() if 0 < v <= 0.5]
    
    if estimates:
        conservative_nu = min(estimates)
        liberal_nu = max(estimates)
        median_nu = np.median(estimates)
        robust_nu = np.clip(median_nu, 0.01, 0.3)
    else:
        conservative_nu = liberal_nu = median_nu = robust_nu = 0.1
    
    print("Nu Estimates:")
    for method, estimate in nu_estimates.items():
        print(f"  {method}: {estimate:.3f}")
    
    print(f"\nConservative nu: {conservative_nu:.3f}")
    print(f"Liberal nu: {liberal_nu:.3f}")
    print(f"Median nu: {median_nu:.3f}")
    print(f"Recommended nu: {robust_nu:.3f}")
    
    return {
        'conservative': conservative_nu,
        'liberal': liberal_nu,
        'median': median_nu,
        'recommended': robust_nu,
        'all_estimates': nu_estimates
    }

# Apply nu estimation
nu_analysis = estimate_nu_for_one_class_svm(base_df[['Age']].values)
```

Rules for Nu:**
- **Financial fraud**: 0.01-0.05 (very rare outliers)
- **Customer behavior**: 0.05-0.15 (moderate outliers)
- **Sensor data**: 0.02-0.08 (equipment failures)
- **Medical diagnostics**: 0.01-0.10 (rare conditions)
- **Quality control**: 0.03-0.10 (defects)

#### **2. `gamma` - RBF Kernel Coefficient**

**What it controls:** Influence of each training example (higher = more complex decision boundary)

```python
def estimate_gamma_for_rbf_kernel(X, method='comprehensive'):
    """
    Estimate gamma parameter based on data characteristics
    """
    
    gamma_estimates = {}
    
    # Method 1: Default sklearn approaches
    n_features = X.shape[1]
    gamma_estimates['auto'] = 1.0 / n_features
    gamma_estimates['scale'] = 1.0 / (n_features * X.var())
    
    # Method 2: Data variance analysis
    overall_variance = np.var(X)
    gamma_estimates['inverse_variance'] = 1.0 / overall_variance
    
    # Method 3: Distance-based estimation
    n_samples = min(1000, len(X))
    sample_indices = np.random.choice(len(X), n_samples, replace=False)
    X_sample = X[sample_indices]
    
    # Calculate pairwise distances
    from sklearn.metrics.pairwise import euclidean_distances
    distances = euclidean_distances(X_sample)
    
    # Remove diagonal (zero distances)
    mask = np.ones(distances.shape, dtype=bool)
    np.fill_diagonal(mask, False)
    distances_flat = distances[mask]
    
    # Use statistics of distances
    mean_distance = np.mean(distances_flat)
    median_distance = np.median(distances_flat)
    
    gamma_estimates['inverse_mean_distance'] = 1.0 / (mean_distance ** 2)
    gamma_estimates['inverse_median_distance'] = 1.0 / (median_distance ** 2)
    
    # Method 4: Percentile-based
    distance_90th = np.percentile(distances_flat, 90)
    distance_75th = np.percentile(distances_flat, 75)
    
    gamma_estimates['inverse_90th_percentile'] = 1.0 / (distance_90th ** 2)
    gamma_estimates['inverse_75th_percentile'] = 1.0 / (distance_75th ** 2)
    
    # Method 5: Rule of thumb based on data dimensionality
    if n_features == 1:
        gamma_estimates['rule_of_thumb'] = 1.0
    elif n_features <= 5:
        gamma_estimates['rule_of_thumb'] = 0.1
    elif n_features <= 20:
        gamma_estimates['rule_of_thumb'] = 0.01
    else:
        gamma_estimates['rule_of_thumb'] = 0.001
    
    # Filter extreme values
    valid_gammas = {k: v for k, v in gamma_estimates.items() 
                   if 1e-6 <= v <= 1e3}
    
    if valid_gammas:
        gamma_values = list(valid_gammas.values())
        median_gamma = np.median(gamma_values)
        conservative_gamma = min(gamma_values)
        liberal_gamma = max(gamma_values)
    else:
        median_gamma = conservative_gamma = liberal_gamma = 'auto'
    
    print("Gamma Estimates:")
    for method, estimate in gamma_estimates.items():
        if isinstance(estimate, str):
            print(f"  {method}: {estimate}")
        else:
            print(f"  {method}: {estimate:.6f}")
    
    print(f"\nRecommended gamma: {median_gamma}")
    
    return {
        'recommended': median_gamma,
        'conservative': conservative_gamma,
        'liberal': liberal_gamma,
        'all_estimates': valid_gammas
    }

# Apply gamma estimation
gamma_analysis = estimate_gamma_for_rbf_kernel(base_df[['Age']].values)
```

**Gamma Selection Guidelines:**
- **Small gamma (0.001-0.01)**: Smooth decision boundary, underfitting risk
- **Medium gamma (0.01-1.0)**: Balanced complexity
- **Large gamma (1.0-100)**: Complex boundary, overfitting risk
- **'auto'**: Safe default (1/n_features)
- **'scale'**: Variance-adjusted (1/(n_features × variance))

### 🧪 One-Class SVM Validation Methodologies

#### **1. Cross-Validation for Outlier Detection**

```python
def cross_validate_one_class_svm(X, param_grid, cv_folds=5, scoring_methods=['silhouette', 'stability']):
    """
    Cross-validation specifically designed for One-Class SVM
    """
    
    from sklearn.model_selection import KFold
    from sklearn.metrics import silhouette_score
    
    results = []
    
    # Create parameter combinations
    from sklearn.model_selection import ParameterGrid
    
    for params in ParameterGrid(param_grid):
        fold_results = []
        
        kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
        
        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            
            try:
                # Fit on training data
                model = OneClassSVM(**params, random_state=42)
                model.fit(X_train)
                
                # Predict on both training and validation
                train_pred = model.predict(X_train)
                val_pred = model.predict(X_val)
                
                # Calculate metrics
                fold_metrics = {}
                
                # 1. Silhouette score (if we have both classes)
                if 'silhouette' in scoring_methods:
                    if len(np.unique(val_pred)) > 1:
                        sil_score = silhouette_score(X_val, val_pred)
                        fold_metrics['silhouette'] = sil_score
                    else:
                        fold_metrics['silhouette'] = -1  # Penalty for no outliers
                
                # 2. Stability score (consistency between train/val outlier ratios)
                if 'stability' in scoring_methods:
                    train_outlier_ratio = np.sum(train_pred == -1) / len(train_pred)
                    val_outlier_ratio = np.sum(val_pred == -1) / len(val_pred)
                    stability = 1 - abs(train_outlier_ratio - val_outlier_ratio)
                    fold_metrics['stability'] = stability
                
                # 3. Support vector ratio (model complexity indicator)
                n_support_vectors = len(model.support_vectors_)
                sv_ratio = n_support_vectors / len(X_train)
                fold_metrics['sv_ratio'] = sv_ratio
                
                # 4. Decision function statistics
                train_scores = model.decision_function(X_train)
                val_scores = model.decision_function(X_val)
                
                fold_metrics['mean_train_score'] = np.mean(train_scores)
                fold_metrics['mean_val_score'] = np.mean(val_scores)
                fold_metrics['score_consistency'] = 1 - abs(np.mean(train_scores) - np.mean(val_scores)) / (abs(np.mean(train_scores)) + 1e-6)
                
                fold_results.append(fold_metrics)
                
            except Exception as e:
                print(f"Error in fold {fold_idx} with params {params}: {e}")
                continue
        
        if fold_results:
            # Aggregate fold results
            avg_metrics = {}
            for metric in fold_results[0].keys():
                values = [fold[metric] for fold in fold_results if metric in fold]
                if values:
                    avg_metrics[f'mean_{metric}'] = np.mean(values)
                    avg_metrics[f'std_{metric}'] = np.std(values)
            
            # Calculate composite score
            composite_score = 0
            score_components = 0
            
            if 'mean_silhouette' in avg_metrics and avg_metrics['mean_silhouette'] > -1:
                composite_score += avg_metrics['mean_silhouette'] * 0.4
                score_components += 0.4
            
            if 'mean_stability' in avg_metrics:
                composite_score += avg_metrics['mean_stability'] * 0.3
                score_components += 0.3
            
            if 'mean_score_consistency' in avg_metrics:
                composite_score += avg_metrics['mean_score_consistency'] * 0.2
                score_components += 0.2
            
            # Penalty for too many or too few support vectors
            if 'mean_sv_ratio' in avg_metrics:
                sv_penalty = 0
                if avg_metrics['mean_sv_ratio'] > 0.8:  # Too many SVs
                    sv_penalty = (avg_metrics['mean_sv_ratio'] - 0.8) * 0.5
                elif avg_metrics['mean_sv_ratio'] < 0.1:  # Too few SVs
                    sv_penalty = (0.1 - avg_metrics['mean_sv_ratio']) * 0.5
                composite_score -= sv_penalty
                score_components += 0.1
            
            if score_components > 0:
                composite_score /= score_components
            
            result = {
                'params': params,
                'composite_score': composite_score,
                **avg_metrics
            }
            results.append(result)
    
    return pd.DataFrame(results)

# Define parameter grid for optimization
param_grid = {
    'nu': [0.05, 0.08, 0.1, 0.12, 0.15],
    'gamma': ['auto', 'scale', 0.001, 0.01, 0.1, 1.0],
    'kernel': ['rbf']
}

print("🔄 Running One-Class SVM cross-validation...")
cv_results = cross_validate_one_class_svm(base_df[['Age']].values, param_grid)

if len(cv_results) > 0:
    # Find best parameters
    best_result = cv_results.loc[cv_results['composite_score'].idxmax()]
    
    print(f"\n🏆 Best Parameters:")
    print(f"Parameters: {best_result['params']}")
    print(f"Composite Score: {best_result['composite_score']:.4f}")
    print(f"Mean Silhouette: {best_result.get('mean_silhouette', 'N/A'):.4f}")
    print(f"Mean Stability: {best_result.get('mean_stability', 'N/A'):.4f}")
    print(f"Support Vector Ratio: {best_result.get('mean_sv_ratio', 'N/A'):.4f}")
else:
    print("❌ No valid parameter combinations found")
```

#### **2. Outlier Stability Analysis**

```python
def analyze_one_class_svm_stability(X, nu, gamma, kernel='rbf', n_trials=10):
    """
    Analyze stability of One-Class SVM outlier detection across multiple runs
    """
    
    stability_metrics = {
        'outlier_ratios': [],
        'outlier_sets': [],
        'decision_scores': [],
        'support_vector_counts': []
    }
    
    for trial in range(n_trials):
        # Use different random states for bootstrap sampling
        n_samples = len(X)
        bootstrap_indices = np.random.choice(n_samples, n_samples, replace=True)
        X_bootstrap = X[bootstrap_indices]
        
        model = OneClassSVM(nu=nu, gamma=gamma, kernel=kernel, random_state=trial)
        predictions = model.fit_predict(X_bootstrap)
        scores = model.decision_function(X_bootstrap)
        
        # Map back to original indices
        outlier_mask = predictions == -1
        original_outlier_indices = set(bootstrap_indices[outlier_mask])
        
        # Record metrics
        outlier_ratio = np.sum(predictions == -1) / len(predictions)
        n_support_vectors = len(model.support_vectors_)
        
        stability_metrics['outlier_ratios'].append(outlier_ratio)
        stability_metrics['outlier_sets'].append(original_outlier_indices)
        stability_metrics['decision_scores'].append(scores)
        stability_metrics['support_vector_counts'].append(n_support_vectors)
    
    # Calculate stability statistics
    outlier_ratio_cv = np.std(stability_metrics['outlier_ratios']) / np.mean(stability_metrics['outlier_ratios'])
    
    # Outlier consensus analysis
    all_outliers = set()
    for outlier_set in stability_metrics['outlier_sets']:
        all_outliers.update(outlier_set)
    
    # Count how many times each point was detected as outlier
    outlier_counts = {}
    for outlier_set in stability_metrics['outlier_sets']:
        for outlier_idx in outlier_set:
            outlier_counts[outlier_idx] = outlier_counts.get(outlier_idx, 0) + 1
    
    # Consensus outliers (detected in majority of runs)
    consensus_threshold = n_trials * 0.6  # 60% consensus
    consensus_outliers = [idx for idx, count in outlier_counts.items() 
                         if count >= consensus_threshold]
    
    # Support vector stability
    sv_count_cv = np.std(stability_metrics['support_vector_counts']) / np.mean(stability_metrics['support_vector_counts'])
    
    stability_results = {
        'outlier_ratio_cv': outlier_ratio_cv,
        'sv_count_cv': sv_count_cv,
        'consensus_outliers': consensus_outliers,
        'outlier_detection_counts': outlier_counts,
        'mean_outlier_ratio': np.mean(stability_metrics['outlier_ratios']),
        'outlier_ratio_range': (min(stability_metrics['outlier_ratios']), 
                              max(stability_metrics['outlier_ratios'])),
        'mean_sv_count': np.mean(stability_metrics['support_vector_counts'])
    }
    
    print(f"Stability Analysis (nu={nu}, gamma={gamma}):")
    print(f"  Outlier ratio CV: {outlier_ratio_cv:.4f} (lower is more stable)")
    print(f"  Support vector CV: {sv_count_cv:.4f} (lower is more stable)")
    print(f"  Consensus outliers: {len(consensus_outliers)} points")
    print(f"  Mean outlier ratio: {stability_results['mean_outlier_ratio']:.3f}")
    print(f"  Outlier ratio range: {stability_results['outlier_ratio_range'][0]:.3f} - {stability_results['outlier_ratio_range'][1]:.3f}")
    
    return stability_results

# Test stability with best parameters
if len(cv_results) > 0:
    best_params = best_result['params']
    stability_results = analyze_one_class_svm_stability(
        base_df[['Age']].values,
        best_params['nu'],
        best_params['gamma']
    )
```

#### **3. Business Validation and Interpretability**

```python
def validate_one_class_svm_business_logic(X, feature_names, model, outlier_predictions):
    """
    Validate One-Class SVM results from business perspective
    """
    
    outlier_indices = np.where(outlier_predictions == -1)[0]
    normal_indices = np.where(outlier_predictions == 1)[0]
    
    validation_results = {}
    
    # 1. Feature distribution analysis
    print("📊 Business Validation Results:")
    print("=" * 50)
    
    for i, feature_name in enumerate(feature_names):
        outlier_values = X[outlier_indices, i]
        normal_values = X[normal_indices, i]
        
        if len(outlier_values) > 0 and len(normal_values) > 0:
            # Statistical tests
            from scipy.stats import mannwhitneyu
            stat, p_value = mannwhitneyu(outlier_values, normal_values, alternative='two-sided')
            
            # Effect size (Cohen's d equivalent for non-parametric)
            outlier_median = np.median(outlier_values)
            normal_median = np.median(normal_values)
            pooled_mad = np.median(np.abs(np.concatenate([outlier_values, normal_values]) - 
                                        np.median(np.concatenate([outlier_values, normal_values]))))
            effect_size = abs(outlier_median - normal_median) / (pooled_mad + 1e-6)
            
            validation_results[feature_name] = {
                'outlier_median': outlier_median,
                'normal_median': normal_median,
                'difference': outlier_median - normal_median,
                'effect_size': effect_size,
                'mannwhitney_p': p_value,
                'significant_difference': p_value < 0.05
            }
            
            print(f"\n{feature_name}:")
            print(f"  Outlier median: {outlier_median:.2f}")
            print(f"  Normal median: {normal_median:.2f}")
            print(f"  Difference: {outlier_median - normal_median:.2f}")
            print(f"  Effect size: {effect_size:.3f}")
            print(f"  Statistical significance: {'Yes' if p_value < 0.05 else 'No'} (p={p_value:.4f})")
    
    # 2. Decision boundary analysis
    decision_scores = model.decision_function(X)
    
    print(f"\n🎯 Decision Boundary Analysis:")
    print(f"  Decision score range: {decision_scores.min():.3f} to {decision_scores.max():.3f}")
    print(f"  Mean score (outliers): {decision_scores[outlier_indices].mean():.3f}")
    print(f"  Mean score (normal): {decision_scores[normal_indices].mean():.3f}")
    
    # 3. Model complexity assessment
    n_support_vectors = len(model.support_vectors_)
    sv_ratio = n_support_vectors / len(X)
    
    print(f"\n🔧 Model Complexity:")
    print(f"  Support vectors: {n_support_vectors} ({sv_ratio:.1%} of data)")
    
    if sv_ratio > 0.8:
        print("  ⚠️  Warning: Very high support vector ratio - model may be overfitting")
    elif sv_ratio < 0.1:
        print("  ⚠️  Warning: Very low support vector ratio - model may be underfitting")
    else:
        print("  ✅ Support vector ratio looks reasonable")
    
    # 4. Outlier characteristics summary
    print(f"\n📈 Outlier Summary:")
    print(f"  Total outliers: {len(outlier_indices)} ({len(outlier_indices)/len(X)*100:.1f}%)")
    print(f"  Expected outliers (nu): {model.nu*100:.1f}%")
    
    ratio_difference = abs(len(outlier_indices)/len(X) - model.nu)
    if ratio_difference > 0.05:
        print(f"  ⚠️  Warning: Large difference between expected and actual outlier ratio")
    else:
        print(f"  ✅ Outlier ratio close to expected value")
    
    return validation_results

# Apply business validation
if len(cv_results) > 0:
    best_model = OneClassSVM(**best_result['params'], random_state=42)
    best_predictions = best_model.fit_predict(base_df[['Age']].values)
    
    business_validation = validate_one_class_svm_business_logic(
        base_df[['Age']].values,
        ['Age'],
        best_model,
        best_predictions
    )
```

### 📝 One-Class SVM Parameter Validation Checklist

#### **✅ Parameters are Well-Tuned When:**

1. **Nu Parameter Validation:**
   - Actual outlier ratio ≈ nu parameter ±5%
   - Outliers are statistically different from normal points
   - Business logic confirms outliers make sense

2. **Gamma Parameter Validation:**
   - High cross-validation stability
   - Reasonable support vector ratio (10%-80%)
   - Good silhouette score (>0.3)

3. **Overall Model Validation:**
   - Consistent results across different random seeds
   - Statistical significance in feature differences
   - Support vector count stable across runs

#### **🚨 Red Flags (Poor Tuning):**

- **Unstable outlier detection**: Different outliers across runs
- **Extreme support vector ratios**: <5% or >90%
- **No statistical difference**: Outliers statistically similar to normal points
- **Business contradiction**: Outliers don't make domain sense

### 🎯 Complete One-Class SVM Optimization Pipeline

```python
def optimize_one_class_svm_complete_pipeline(X, feature_names=None):
    """
    Complete pipeline for One-Class SVM optimization and validation
    """
    
    if feature_names is None:
        feature_names = [f'Feature_{i}' for i in range(X.shape[1])]
    
    print("🔍 Step 1: Data preprocessing...")
    # Standardize features for SVM
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print("📊 Step 2: Parameter estimation...")
    nu_analysis = estimate_nu_for_one_class_svm(X_scaled)
    gamma_analysis = estimate_gamma_for_rbf_kernel(X_scaled)
    
    print("🔄 Step 3: Cross-validation optimization...")
    param_grid = {
        'nu': [nu_analysis['conservative'], nu_analysis['median'], nu_analysis['liberal']],
        'gamma': [gamma_analysis['conservative'], gamma_analysis['recommended'], gamma_analysis['liberal']],
        'kernel': ['rbf']
    }
    
    cv_results = cross_validate_one_class_svm(X_scaled, param_grid)
    
    if len(cv_results) > 0:
        best_result = cv_results.loc[cv_results['composite_score'].idxmax()]
        
        print("🧪 Step 4: Stability analysis...")
        stability_results = analyze_one_class_svm_stability(
            X_scaled, 
            best_result['params']['nu'], 
            best_result['params']['gamma']
        )
        
        print("💼 Step 5: Business validation...")
        final_model = OneClassSVM(**best_result['params'], random_state=42)
        final_predictions = final_model.fit_predict(X_scaled)
        
        business_validation = validate_one_class_svm_business_logic(
            X, feature_names, final_model, final_predictions
        )
        
        print(f"\n🏆 Final Recommended Parameters:")
        print(f"nu: {best_result['params']['nu']}")
        print(f"gamma: {best_result['params']['gamma']}")
        print(f"kernel: {best_result['params']['kernel']}")
        print(f"Composite Score: {best_result['composite_score']:.4f}")
        print(f"Stability (outlier ratio CV): {stability_results['outlier_ratio_cv']:.4f}")
        
        return {
            'recommended_params': best_result['params'],
            'final_model': final_model,
            'scaler': scaler,
            'cv_results': cv_results,
            'stability_results': stability_results,
            'business_validation': business_validation
        }
    else:
        print("❌ No suitable parameters found")
        return None

# Apply complete pipeline
optimal_svm_results = optimize_one_class_svm_complete_pipeline(
    base_df[['Age']].values,
    feature_names=['Age']
)
```

### 🎯 Summary: One-Class SVM Parameter Tuning Best Practices

#### **🔧 Parameter Selection Strategy:**

**For `nu`:**
1. **Start with statistical estimates** - IQR, Z-score, Mahalanobis distance
2. **Consider domain knowledge** - typical outlier rates in your field
3. **Validate with business logic** - check if outliers make sense
4. **Test range around estimates** - ±0.05 from initial estimate

**For `gamma`:**
1. **Use data-driven estimates** - distance-based or variance-based
2. **Start with 'scale' or 'auto'** - good defaults
3. **Cross-validate thoroughly** - gamma is very sensitive
4. **Check support vector ratio** - should be 10%-80%

#### **🧪 Validation Methodology:**

1. **Cross-validation with composite scoring** - stability + silhouette + consistency
2. **Bootstrap stability analysis** - consistent outlier detection
3. **Statistical validation** - outliers statistically different
4. **Business logic validation** - outliers make domain sense
5. **Model complexity checks** - reasonable support vector ratio

The key to successful One-Class SVM tuning is **combining statistical rigor with business validation** - parameters should optimize both mathematical metrics AND produce interpretable, actionable results! 🎯