## üîç Grid Search: Exhaustive Exploration

### What is Grid Search?

**Grid Search** systematically evaluates every possible combination of hyperparameters in a predefined grid. It's the most thorough but also most computationally expensive approach.

### Mathematical Formulation

Given:
- Hyperparameters: $\lambda_1, \lambda_2, ..., \lambda_k$
- Value sets: $V_1, V_2, ..., V_k$
- Cross-validation folds: $K$

**Search space size**: $|V_1| \times |V_2| \times ... \times |V_k|$

**Total evaluations**: $K \times \prod_{i=1}^{k} |V_i|$

**Example**:
- 3 hyperparameters: n_estimators ‚àà {50, 100, 150}, max_depth ‚àà {5, 10, 15}, min_samples_split ‚àà {2, 5, 10}
- Grid size: 3 √ó 3 √ó 3 = 27 configurations
- With 5-fold CV: 27 √ó 5 = **135 model trainings**

### When to Use Grid Search

| **Use Grid Search When** | **Avoid Grid Search When** |
|--------------------------|----------------------------|
| ‚úÖ Small search space (<100 configs) | ‚ùå Large search space (>1000 configs) |
| ‚úÖ Discrete hyperparameters | ‚ùå Continuous hyperparameters |
| ‚úÖ Known promising ranges | ‚ùå No prior knowledge |
| ‚úÖ Computational budget allows | ‚ùå Limited time/resources |
| ‚úÖ Need reproducibility | ‚ùå Exploratory tuning |
| ‚úÖ Final optimization (narrow range) | ‚ùå Initial exploration (wide range) |

### Advantages

1. **Exhaustive**: Guaranteed to find best configuration in the grid
2. **Reproducible**: Same grid always gives same result
3. **Parallelizable**: All configurations independent
4. **Simple**: Easy to understand and implement

### Disadvantages

1. **Exponential cost**: Doubles with each hyperparameter
2. **Inefficient**: Wastes time on unpromising regions
3. **Discrete only**: Must discretize continuous parameters
4. **Curse of dimensionality**: Intractable for >5 hyperparameters

### Grid Search Algorithm

```python
best_score = -infinity
best_params = None

for config in all_combinations(param_grid):
    scores = []
    for fold in cross_validation_folds:
        train_data, val_data = split(fold)
        model = train(config, train_data)
        score = evaluate(model, val_data)
        scores.append(score)
    
    avg_score = mean(scores)
    if avg_score > best_score:
        best_score = avg_score
        best_params = config

return best_params, best_score
```

### Practical Example: Random Forest Tuning

**Scenario**: Tune Random Forest for semiconductor yield prediction

**Hyperparameters**:
- `n_estimators`: Number of trees
- `max_depth`: Maximum tree depth
- `min_samples_split`: Minimum samples to split node
- `max_features`: Features to consider for split

**Grid definition**:
```python
param_grid = {
    'n_estimators': [50, 100, 200],        # 3 values
    'max_depth': [5, 10, 15, None],        # 4 values
    'min_samples_split': [2, 5, 10],       # 3 values
    'max_features': ['sqrt', 'log2']       # 2 values
}
# Total: 3 √ó 4 √ó 3 √ó 2 = 72 configurations
```

**With 5-fold CV**: 72 √ó 5 = **360 model trainings**

**Time estimate**:
- Training time per model: 30 seconds
- Total time: 360 √ó 30s = 10,800s = **3 hours**

### Semiconductor-Specific Grid Design

#### Yield Prediction (Random Forest)
```python
param_grid = {
    'n_estimators': [100, 200, 300],       # More trees ‚Üí better, diminishing returns
    'max_depth': [10, 15, 20],             # Prevent overfitting to specific wafers
    'min_samples_leaf': [5, 10, 20],       # Ensure statistical significance
    'max_features': [0.3, 0.5, 0.7]        # Feature subset for diversity
}
# 3 √ó 3 √ó 3 √ó 3 = 81 configs
```

#### Test Time Prediction (Gradient Boosting)
```python
param_grid = {
    'n_estimators': [50, 100, 150],        # Boosting iterations
    'learning_rate': [0.01, 0.05, 0.1],    # Step size (log scale)
    'max_depth': [3, 5, 7],                # Tree depth (shallow for boosting)
    'subsample': [0.8, 0.9, 1.0]           # Row sampling for diversity
}
# 3 √ó 3 √ó 3 √ó 3 = 81 configs
```

### Coarse-to-Fine Grid Search Strategy

**Problem**: Don't know good ranges initially

**Solution**: Two-stage grid search

#### Stage 1: Coarse Grid (Wide Range, Few Values)
```python
coarse_grid = {
    'n_estimators': [50, 200, 500],        # Wide range, sparse
    'max_depth': [3, 10, 20],
    'learning_rate': [0.001, 0.01, 0.1]
}
# 3 √ó 3 √ó 3 = 27 configs (fast exploration)
```

**Result**: Best config has n_estimators=200, max_depth=10, learning_rate=0.01

#### Stage 2: Fine Grid (Narrow Range, More Values)
```python
fine_grid = {
    'n_estimators': [150, 175, 200, 225, 250],   # Narrow range, dense
    'max_depth': [8, 9, 10, 11, 12],
    'learning_rate': [0.008, 0.01, 0.012]
}
# 5 √ó 5 √ó 3 = 75 configs (refined search)
```

**Total**: 27 + 75 = 102 configs (vs 125 for single fine grid over wide range)

### Grid Search with Nested CV (Unbiased Evaluation)

**Problem**: GridSearchCV reports optimistic `best_score_` (data snooping)

**Solution**: Nested CV (see notebook 043)
- Outer loop: Estimates true performance
- Inner loop (Grid Search): Tunes hyperparameters

**Implementation**:
```python
outer_scores = []
for train_idx, test_idx in outer_cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Inner loop: Grid Search for hyperparameter tuning
    grid_search = GridSearchCV(model, param_grid, cv=inner_cv)
    grid_search.fit(X_train, y_train)
    
    # Outer loop: Evaluate on test set (unbiased)
    score = grid_search.best_estimator_.score(X_test, y_test)
    outer_scores.append(score)

# Report: mean(outer_scores) is unbiased estimate
```

### Parallel Grid Search

**Speedup**: Use multiple CPU cores

```python
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,      # Use all available cores
    verbose=2       # Show progress
)
```

**Speedup calculation**:
- 8 cores ‚Üí ~8√ó speedup (near-linear scaling)
- 72 configs √ó 5 folds = 360 trainings
- Serial: 3 hours ‚Üí Parallel (8 cores): ~22.5 minutes

### Common Grid Search Pitfalls

#### ‚ùå Pitfall 1: Too Fine Initially
**Problem**: Start with dense grid over wide range ‚Üí thousands of configs
**Solution**: Coarse-to-fine strategy

#### ‚ùå Pitfall 2: Uniform Grid for Log-Scale Parameters
**Problem**: learning_rate ‚àà [0.001, 0.002, 0.003, ..., 0.1] ‚Üí waste time on similar values
**Solution**: Log-scale grid [0.001, 0.01, 0.1] or use Random Search

#### ‚ùå Pitfall 3: Not Parallelizing
**Problem**: Run serially on multi-core machine
**Solution**: Set `n_jobs=-1`

#### ‚ùå Pitfall 4: Reporting Inner CV Score
**Problem**: Report `best_score_` (optimistic bias)
**Solution**: Use nested CV for unbiased estimate

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import f1_score, make_scorer
import matplotlib.pyplot as plt
import seaborn as sns
import time
from itertools import product

np.random.seed(42)

class GridSearchAnalyzer:
    """
    Comprehensive Grid Search implementation with visualization.
    """
    
    def __init__(self, estimator, param_grid, cv=5, scoring='accuracy', n_jobs=-1):
        """
        Args:
            estimator: Sklearn model
            param_grid: Dictionary of hyperparameter grids
            cv: Cross-validation strategy
            scoring: Metric to optimize
            n_jobs: Number of parallel jobs (-1 = all cores)
        """
        self.estimator = estimator
        self.param_grid = param_grid
        self.cv = cv
        self.scoring = scoring
        self.n_jobs = n_jobs
        self.grid_search = None
        self.results_df = None
    
    def fit(self, X, y):
        """
        Perform grid search with timing and analysis.
        """
        print("="*80)
        print("GRID SEARCH: EXHAUSTIVE HYPERPARAMETER OPTIMIZATION")
        print("="*80)
        
        # Calculate search space size
        n_configs = np.prod([len(v) for v in self.param_grid.values()])
        n_folds = self.cv if isinstance(self.cv, int) else self.cv.get_n_splits()
        total_fits = n_configs * n_folds
        
        print(f"\nSearch Space:")
        for param, values in self.param_grid.items():
            print(f"  {param}: {values} ({len(values)} values)")
        print(f"\nTotal configurations: {n_configs}")
        print(f"Cross-validation folds: {n_folds}")
        print(f"Total model trainings: {total_fits}")
        print(f"Parallel jobs: {self.n_jobs}")
        
        # Perform grid search
        print(f"\nStarting Grid Search...")
        start_time = time.time()
        
        self.grid_search = GridSearchCV(
            estimator=self.estimator,
            param_grid=self.param_grid,
            cv=self.cv,
            scoring=self.scoring,
            n_jobs=self.n_jobs,
            verbose=0,
            return_train_score=True
        )
        
        self.grid_search.fit(X, y)
        
        elapsed_time = time.time() - start_time
        
        # Extract results
        results = pd.DataFrame(self.grid_search.cv_results_)
        
        print(f"\n‚úÖ Grid Search completed in {elapsed_time:.2f} seconds")
        print(f"   Time per configuration: {elapsed_time / n_configs:.2f} seconds")
        print(f"   Time per fit: {elapsed_time / total_fits:.2f} seconds")
        
        print(f"\nBest Configuration:")
        for param, value in self.grid_search.best_params_.items():
            print(f"  {param}: {value}")
        
        print(f"\nBest Score: {self.grid_search.best_score_:.6f}")
        print(f"Best Train Score: {results.loc[self.grid_search.best_index_, 'mean_train_score']:.6f}")
        print(f"Overfitting gap: {results.loc[self.grid_search.best_index_, 'mean_train_score'] - self.grid_search.best_score_:.6f}")
        
        # Store results for visualization
        self.results_df = results
        
        return self
    
    def plot_results(self, figsize=(16, 12)):
        """
        Visualize grid search results with multiple plots.
        """
        if self.results_df is None:
            raise ValueError("Must call fit() before plot_results()")
        
        results = self.results_df
        
        fig, axes = plt.subplots(2, 2, figsize=figsize)
        
        # Plot 1: Score distribution
        axes[0, 0].hist(results['mean_test_score'], bins=30, edgecolor='black', alpha=0.7)
        axes[0, 0].axvline(self.grid_search.best_score_, color='red', 
                          linestyle='--', linewidth=2, label=f'Best: {self.grid_search.best_score_:.4f}')
        axes[0, 0].set_xlabel('Mean Test Score', fontsize=11, fontweight='bold')
        axes[0, 0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
        axes[0, 0].set_title('Score Distribution Across All Configurations', 
                            fontsize=12, fontweight='bold')
        axes[0, 0].legend()
        axes[0, 0].grid(alpha=0.3)
        
        # Plot 2: Top 10 configurations
        top_10 = results.nsmallest(10, 'rank_test_score')[['params', 'mean_test_score', 'std_test_score']]
        config_labels = [f"Config {i+1}" for i in range(len(top_10))]
        
        axes[0, 1].barh(config_labels, top_10['mean_test_score'], 
                       xerr=top_10['std_test_score'], alpha=0.7, edgecolor='black')
        axes[0, 1].set_xlabel('Mean Test Score', fontsize=11, fontweight='bold')
        axes[0, 1].set_ylabel('Configuration', fontsize=11, fontweight='bold')
        axes[0, 1].set_title('Top 10 Configurations\\n(with standard deviation)', 
                            fontsize=12, fontweight='bold')
        axes[0, 1].grid(alpha=0.3, axis='x')
        axes[0, 1].invert_yaxis()
        
        # Plot 3: Overfitting analysis (train vs test)
        axes[1, 0].scatter(results['mean_train_score'], results['mean_test_score'], 
                          alpha=0.6, s=50)
        
        # Add diagonal line (perfect generalization)
        min_score = min(results['mean_train_score'].min(), results['mean_test_score'].min())
        max_score = max(results['mean_train_score'].max(), results['mean_test_score'].max())
        axes[1, 0].plot([min_score, max_score], [min_score, max_score], 
                       'r--', linewidth=2, label='Perfect Generalization')
        
        # Highlight best configuration
        best_idx = self.grid_search.best_index_
        axes[1, 0].scatter(results.loc[best_idx, 'mean_train_score'], 
                          results.loc[best_idx, 'mean_test_score'],
                          color='red', s=200, marker='*', 
                          edgecolor='black', linewidth=2, label='Best Config')
        
        axes[1, 0].set_xlabel('Mean Train Score', fontsize=11, fontweight='bold')
        axes[1, 0].set_ylabel('Mean Test Score', fontsize=11, fontweight='bold')
        axes[1, 0].set_title('Overfitting Analysis\\n(Points below line = overfitting)', 
                            fontsize=12, fontweight='bold')
        axes[1, 0].legend()
        axes[1, 0].grid(alpha=0.3)
        
        # Plot 4: Hyperparameter importance (if possible to visualize)
        # For simplicity, show score vs first hyperparameter
        first_param = list(self.param_grid.keys())[0]
        
        if len(self.param_grid[first_param]) > 1:
            # Group by first parameter
            param_scores = results.groupby(f'param_{first_param}')['mean_test_score'].agg(['mean', 'std'])
            
            axes[1, 1].bar(range(len(param_scores)), param_scores['mean'], 
                          yerr=param_scores['std'], alpha=0.7, edgecolor='black')
            axes[1, 1].set_xticks(range(len(param_scores)))
            axes[1, 1].set_xticklabels(param_scores.index, rotation=45, ha='right')
            axes[1, 1].set_xlabel(first_param, fontsize=11, fontweight='bold')
            axes[1, 1].set_ylabel('Mean Test Score', fontsize=11, fontweight='bold')
            axes[1, 1].set_title(f'Performance vs {first_param}\\n(averaged over other params)', 
                                fontsize=12, fontweight='bold')
            axes[1, 1].grid(alpha=0.3, axis='y')
        else:
            axes[1, 1].text(0.5, 0.5, 'Not enough\nhyperparameter values\nfor visualization', 
                           ha='center', va='center', fontsize=14, transform=axes[1, 1].transAxes)
            axes[1, 1].set_title('Hyperparameter Analysis', fontsize=12, fontweight='bold')
        
        plt.tight_layout()
        plt.show()
    
    def get_top_configs(self, n=5):
        """
        Return top N configurations.
        """
        if self.results_df is None:
            raise ValueError("Must call fit() before get_top_configs()")
        
        top_n = self.results_df.nsmallest(n, 'rank_test_score')
        
        print(f"\nTop {n} Configurations:")
        print("="*80)
        for i, (idx, row) in enumerate(top_n.iterrows(), 1):
            print(f"\nRank {i}:")
            print(f"  Score: {row['mean_test_score']:.6f} ¬± {row['std_test_score']:.6f}")
            print(f"  Params: {row['params']}")
        
        return top_n[['params', 'mean_test_score', 'std_test_score']]


# Example usage
if __name__ == "__main__":
    print("\nEXAMPLE: Grid Search for Semiconductor Defect Detection\n")
    
    # Generate imbalanced semiconductor defect data
    X, y = make_classification(
        n_samples=2000,
        n_features=15,
        n_informative=12,
        n_redundant=3,
        n_classes=2,
        weights=[0.95, 0.05],  # 5% defect rate (imbalanced)
        random_state=42
    )
    
    print(f"Dataset: {len(y)} devices")
    print(f"Features: 15 parametric measurements")
    print(f"Target: Defect detection (imbalanced)")
    print(f"Class distribution: {(y==0).sum()} good ({(y==0).sum()/len(y)*100:.1f}%), {(y==1).sum()} defective ({(y==1).sum()/len(y)*100:.1f}%)")
    
    # Define parameter grid
    param_grid = {
        'n_estimators': [50, 100, 150],
        'max_depth': [5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'class_weight': ['balanced', None]
    }
    
    # Create analyzer
    analyzer = GridSearchAnalyzer(
        estimator=RandomForestClassifier(random_state=42),
        param_grid=param_grid,
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        scoring=make_scorer(f1_score),  # F1 for imbalanced data
        n_jobs=-1
    )
    
    # Fit and analyze
    analyzer.fit(X, y)
    
    # Get top configurations
    top_configs = analyzer.get_top_configs(n=5)
    
    # Visualize results
    analyzer.plot_results()
    
    print("\n‚úÖ Grid Search analysis complete!")

## üé≤ Random Search: Efficient Sampling

### What is Random Search?

**Random Search** samples hyperparameter configurations randomly from specified distributions. Instead of trying every combination (Grid Search), it evaluates a fixed number of random samples.

### Why Random Search Often Beats Grid Search

**Key insight**: Not all hyperparameters are equally important.

Consider a 2D grid:
- Hyperparameter A: Critical (large impact on performance)
- Hyperparameter B: Minor (small impact on performance)

**Grid Search (9 points)**:
```
B‚ÇÉ  ‚Ä¢   ‚Ä¢   ‚Ä¢
B‚ÇÇ  ‚Ä¢   ‚Ä¢   ‚Ä¢
B‚ÇÅ  ‚Ä¢   ‚Ä¢   ‚Ä¢
    A‚ÇÅ  A‚ÇÇ  A‚ÇÉ
```
- Tests 3 unique values of A
- Tests 3 unique values of B

**Random Search (9 points)**:
```
B   ‚Ä¢     ‚Ä¢
      ‚Ä¢ ‚Ä¢   ‚Ä¢
    ‚Ä¢   ‚Ä¢   ‚Ä¢
      ‚Ä¢
    A (continuous)
```
- Tests ~9 unique values of A (if uniformly sampled)
- Tests ~9 unique values of B

**Result**: Random Search explores more values of the important hyperparameter A!

### Mathematical Formulation

Given:
- Hyperparameter distributions: $\lambda_1 \sim D_1, \lambda_2 \sim D_2, ..., \lambda_k \sim D_k$
- Number of iterations: $n$

**Algorithm**:
1. For iteration $i = 1$ to $n$:
   - Sample $\lambda_1^{(i)} \sim D_1, \lambda_2^{(i)} \sim D_2, ..., \lambda_k^{(i)} \sim D_k$
   - Evaluate configuration $(\lambda_1^{(i)}, \lambda_2^{(i)}, ..., \lambda_k^{(i)})$ using CV
2. Return best configuration

**Probability of finding good configuration**:

If top 5% of configurations have acceptable performance:
- Grid Search (27 configs): 27 √ó 0.05 = 1.35 expected good configs
- Random Search (100 configs): 100 √ó 0.05 = 5 expected good configs

**Probability of finding at least one good config**:
- Random Search: $1 - (0.95)^{100} = 99.4\%$
- Grid Search: $1 - (0.95)^{27} = 74.7\%$

### Distribution Selection

#### Uniform Distribution
- **Use for**: Discrete parameters with equal importance
- **Example**: max_depth ‚àà [5, 20], all values equally likely
- **Implementation**: `scipy.stats.randint(5, 21)`

#### Log-Uniform Distribution
- **Use for**: Parameters spanning multiple orders of magnitude
- **Example**: learning_rate ‚àà [0.0001, 0.1] (1000√ó range)
- **Why**: Equal probability per order of magnitude
- **Implementation**: `scipy.stats.loguniform(0.0001, 0.1)`

**Visualization**:
```
Uniform [0.0001, 0.1]:
  90% samples in [0.09, 0.1] ‚ùå (waste time on similar values)

Log-Uniform [0.0001, 0.1]:
  33% in [0.0001, 0.001]
  33% in [0.001, 0.01]
  33% in [0.01, 0.1] ‚úÖ (explore all scales)
```

#### Categorical Distribution
- **Use for**: Discrete choices without ordering
- **Example**: kernel ‚àà {linear, rbf, poly, sigmoid}
- **Implementation**: List of options

### When to Use Random Search

| **Use Random Search When** | **Prefer Grid Search When** |
|---------------------------|----------------------------|
| ‚úÖ Large search space (>100 configs) | ‚ùå Small search space (<100 configs) |
| ‚úÖ Continuous hyperparameters | ‚ùå Only discrete hyperparameters |
| ‚úÖ No prior knowledge | ‚ùå Known good ranges |
| ‚úÖ Limited computational budget | ‚ùå Unlimited budget |
| ‚úÖ Exploratory tuning | ‚ùå Final optimization |
| ‚úÖ Many hyperparameters (>5) | ‚ùå Few hyperparameters (‚â§3) |

### Advantages

1. **Efficient**: Explores more hyperparameter values with same budget
2. **Flexible**: Handles continuous distributions natively
3. **Parallelizable**: All samples independent
4. **Anytime**: Can stop early and use best so far
5. **Scales well**: Performance doesn't degrade with dimensionality

### Disadvantages

1. **Non-exhaustive**: May miss optimal configuration
2. **Non-deterministic**: Different runs give different results (use random_state)
3. **Inefficient for low-dim**: Grid Search better for 1-2 hyperparameters
4. **No exploitation**: Doesn't learn from previous evaluations

### Random Search vs Grid Search: Empirical Comparison

**Scenario**: Tune Random Forest with 5 hyperparameters, 100 configurations budget

**Grid Search**:
- 5 hyperparameters, 2.5 values each: 2.5‚Åµ ‚âà 97 configs (underfits each dimension)
- Problem: Misses good regions between grid points

**Random Search**:
- Sample 100 random configurations
- Explores full space uniformly
- Higher chance of finding good configuration

**Research (Bergstra & Bengio, 2012)**:
- Random Search outperforms Grid Search in 80% of cases
- Especially effective when few hyperparameters are important

### Semiconductor-Specific Distributions

#### Yield Prediction (XGBoost)
```python
param_distributions = {
    'n_estimators': scipy.stats.randint(100, 500),      # Discrete uniform
    'max_depth': scipy.stats.randint(3, 15),
    'learning_rate': scipy.stats.loguniform(0.01, 0.3), # Log-uniform
    'subsample': scipy.stats.uniform(0.6, 0.4),         # Uniform [0.6, 1.0]
    'colsample_bytree': scipy.stats.uniform(0.6, 0.4),
    'min_child_weight': scipy.stats.randint(1, 10)
}
# Samples from rich 6D space
```

#### Test Time Prediction (Gradient Boosting)
```python
param_distributions = {
    'n_estimators': scipy.stats.randint(50, 300),
    'learning_rate': scipy.stats.loguniform(0.001, 0.1),  # Wide log range
    'max_depth': scipy.stats.randint(2, 10),
    'min_samples_split': scipy.stats.randint(2, 20),
    'max_features': scipy.stats.uniform(0.3, 0.7)         # [0.3, 1.0]
}
```

### Practical Guidelines

#### Number of Iterations

**Rule of thumb**: $n \geq 10 \times k$ where $k$ = number of hyperparameters

**Examples**:
- 3 hyperparameters: $n \geq 30$ iterations
- 5 hyperparameters: $n \geq 50$ iterations
- 10 hyperparameters: $n \geq 100$ iterations

**Budget-based**:
- Limited budget (1 hour): $n = 20-50$
- Moderate budget (overnight): $n = 100-200$
- Large budget (weekend): $n = 500-1000$

**Convergence check**: Plot best score vs iteration. If plateaued ‚Üí stop.

#### Choosing Distributions

1. **Discrete with known range**: `scipy.stats.randint(low, high+1)`
2. **Continuous bounded**: `scipy.stats.uniform(low, high-low)`
3. **Log-scale (e.g., learning rates)**: `scipy.stats.loguniform(low, high)`
4. **Categorical**: Python list `['option1', 'option2', 'option3']`

#### Random vs Grid: Decision Framework

**Use Grid Search if**:
- Search space small (<100 configs)
- Need reproducibility (exact same grid)
- Final fine-tuning (narrow range)

**Use Random Search if**:
- Search space large (>100 configs)
- Continuous hyperparameters
- Initial exploration (wide range)
- Many hyperparameters (>5)

**Use both (Sequential)**:
1. Random Search (100-200 samples): Explore broadly
2. Identify promising region
3. Grid Search (fine grid): Refine locally

### Common Random Search Pitfalls

#### ‚ùå Pitfall 1: Too Few Iterations
**Problem**: 10 samples for 5 hyperparameters ‚Üí sparse coverage
**Solution**: Use $n \geq 10k$ rule

#### ‚ùå Pitfall 2: Wrong Distribution
**Problem**: Uniform [0.0001, 0.1] for learning_rate ‚Üí 90% samples near 0.1
**Solution**: Use loguniform for log-scale parameters

#### ‚ùå Pitfall 3: Not Setting random_state
**Problem**: Cannot reproduce results
**Solution**: Always set random_state=42

#### ‚ùå Pitfall 4: Stopping Too Early
**Problem**: Stop after 20 iterations, miss better configs at iteration 50
**Solution**: Monitor convergence, extend if improving

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, r2_score
from scipy.stats import randint, uniform, loguniform
import matplotlib.pyplot as plt
import time

np.random.seed(42)

class RandomSearchAnalyzer:
    """
    Random Search implementation with convergence analysis.
    """
    
    def __init__(self, estimator, param_distributions, n_iter=100, cv=5, 
                 scoring='r2', n_jobs=-1, random_state=42):
        """
        Args:
            estimator: Sklearn model
            param_distributions: Dictionary of distributions
            n_iter: Number of random samples
            cv: Cross-validation strategy
            scoring: Metric to optimize
            n_jobs: Parallel jobs
            random_state: Random seed
        """
        self.estimator = estimator
        self.param_distributions = param_distributions
        self.n_iter = n_iter
        self.cv = cv
        self.scoring = scoring
        self.n_jobs = n_jobs
        self.random_state = random_state
        self.random_search = None
        self.results_df = None
    
    def fit(self, X, y):
        """
        Perform random search with analysis.
        """
        print("="*80)
        print("RANDOM SEARCH: EFFICIENT HYPERPARAMETER OPTIMIZATION")
        print("="*80)
        
        print(f"\nSearch Configuration:")
        print(f"  Number of iterations: {self.n_iter}")
        print(f"  Cross-validation folds: {self.cv if isinstance(self.cv, int) else self.cv.get_n_splits()}")
        print(f"  Scoring metric: {self.scoring}")
        print(f"  Parallel jobs: {self.n_jobs}")
        print(f"  Random state: {self.random_state}")
        
        print(f"\nParameter Distributions:")
        for param, dist in self.param_distributions.items():
            if hasattr(dist, 'dist'):
                print(f"  {param}: {dist.dist.name} distribution")
            elif isinstance(dist, list):
                print(f"  {param}: Categorical {dist}")
            else:
                print(f"  {param}: {type(dist).__name__}")
        
        n_folds = self.cv if isinstance(self.cv, int) else self.cv.get_n_splits()
        total_fits = self.n_iter * n_folds
        print(f"\nTotal model trainings: {total_fits}")
        
        # Perform random search
        print(f"\nStarting Random Search...")
        start_time = time.time()
        
        self.random_search = RandomizedSearchCV(
            estimator=self.estimator,
            param_distributions=self.param_distributions,
            n_iter=self.n_iter,
            cv=self.cv,
            scoring=self.scoring,
            n_jobs=self.n_jobs,
            verbose=0,
            random_state=self.random_state,
            return_train_score=True
        )
        
        self.random_search.fit(X, y)
        
        elapsed_time = time.time() - start_time
        
        # Extract results
        results = pd.DataFrame(self.random_search.cv_results_)
        
        print(f"\n‚úÖ Random Search completed in {elapsed_time:.2f} seconds")
        print(f"   Time per iteration: {elapsed_time / self.n_iter:.2f} seconds")
        print(f"   Time per fit: {elapsed_time / total_fits:.2f} seconds")
        
        print(f"\nBest Configuration:")
        for param, value in self.random_search.best_params_.items():
            print(f"  {param}: {value}")
        
        print(f"\nBest Score: {self.random_search.best_score_:.6f}")
        print(f"Best Train Score: {results.loc[self.random_search.best_index_, 'mean_train_score']:.6f}")
        print(f"Overfitting gap: {results.loc[self.random_search.best_index_, 'mean_train_score'] - self.random_search.best_score_:.6f}")
        
        # Convergence analysis
        results_sorted = results.sort_values('rank_test_score')
        best_scores = results_sorted['mean_test_score'].cummax()  # Best score so far at each iteration
        
        print(f"\nConvergence Analysis:")
        print(f"  Best score after 10 iterations: {best_scores.iloc[9]:.6f}")
        print(f"  Best score after 50 iterations: {best_scores.iloc[49]:.6f}" if len(best_scores) >= 50 else "")
        print(f"  Best score after {self.n_iter} iterations: {best_scores.iloc[-1]:.6f}")
        
        # Check if converged
        last_20_pct = int(0.2 * self.n_iter)
        improvement_last_20 = best_scores.iloc[-1] - best_scores.iloc[-last_20_pct]
        print(f"  Improvement in last 20% iterations: {improvement_last_20:.6f}")
        
        if improvement_last_20 < 0.001:
            print(f"  ‚úÖ Converged (minimal improvement in last 20%)")
        else:
            print(f"  ‚ö†Ô∏è  Still improving - consider increasing n_iter")
        
        self.results_df = results
        
        return self
    
    def plot_results(self, figsize=(16, 12)):
        """
        Visualize random search results.
        """
        if self.results_df is None:
            raise ValueError("Must call fit() before plot_results()")
        
        results = self.results_df.copy()
        results_sorted = results.sort_values('rank_test_score').reset_index(drop=True)
        
        fig, axes = plt.subplots(2, 2, figsize=figsize)
        
        # Plot 1: Convergence over iterations
        best_scores = results_sorted['mean_test_score'].cummax()
        iterations = np.arange(1, len(best_scores) + 1)
        
        axes[0, 0].plot(iterations, best_scores, linewidth=2, color='blue')
        axes[0, 0].scatter(iterations, best_scores, alpha=0.5, s=30, color='blue')
        axes[0, 0].axhline(self.random_search.best_score_, color='red', 
                          linestyle='--', linewidth=2, label=f'Final Best: {self.random_search.best_score_:.4f}')
        axes[0, 0].set_xlabel('Iteration (sorted by rank)', fontsize=11, fontweight='bold')
        axes[0, 0].set_ylabel('Best Score So Far', fontsize=11, fontweight='bold')
        axes[0, 0].set_title('Convergence Analysis\\n(Shows how quickly we find good configs)', 
                            fontsize=12, fontweight='bold')
        axes[0, 0].legend()
        axes[0, 0].grid(alpha=0.3)
        
        # Plot 2: Score distribution
        axes[0, 1].hist(results['mean_test_score'], bins=30, edgecolor='black', alpha=0.7)
        axes[0, 1].axvline(self.random_search.best_score_, color='red', 
                          linestyle='--', linewidth=2, label=f'Best: {self.random_search.best_score_:.4f}')
        axes[0, 1].set_xlabel('Mean Test Score', fontsize=11, fontweight='bold')
        axes[0, 1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
        axes[0, 1].set_title('Score Distribution\\n(Shows exploration of search space)', 
                            fontsize=12, fontweight='bold')
        axes[0, 1].legend()
        axes[0, 1].grid(alpha=0.3)
        
        # Plot 3: Overfitting analysis
        axes[1, 0].scatter(results['mean_train_score'], results['mean_test_score'], 
                          alpha=0.6, s=50)
        
        # Diagonal line
        min_score = min(results['mean_train_score'].min(), results['mean_test_score'].min())
        max_score = max(results['mean_train_score'].max(), results['mean_test_score'].max())
        axes[1, 0].plot([min_score, max_score], [min_score, max_score], 
                       'r--', linewidth=2, label='Perfect Generalization')
        
        # Best config
        best_idx = self.random_search.best_index_
        axes[1, 0].scatter(results.loc[best_idx, 'mean_train_score'], 
                          results.loc[best_idx, 'mean_test_score'],
                          color='red', s=200, marker='*', 
                          edgecolor='black', linewidth=2, label='Best Config')
        
        axes[1, 0].set_xlabel('Mean Train Score', fontsize=11, fontweight='bold')
        axes[1, 0].set_ylabel('Mean Test Score', fontsize=11, fontweight='bold')
        axes[1, 0].set_title('Overfitting Analysis\\n(Points below line = overfitting)', 
                            fontsize=12, fontweight='bold')
        axes[1, 0].legend()
        axes[1, 0].grid(alpha=0.3)
        
        # Plot 4: Top 10 configurations
        top_10 = results_sorted.head(10)[['mean_test_score', 'std_test_score']]
        config_labels = [f"Config {i+1}" for i in range(len(top_10))]
        
        axes[1, 1].barh(config_labels, top_10['mean_test_score'], 
                       xerr=top_10['std_test_score'], alpha=0.7, edgecolor='black')
        axes[1, 1].set_xlabel('Mean Test Score', fontsize=11, fontweight='bold')
        axes[1, 1].set_ylabel('Configuration', fontsize=11, fontweight='bold')
        axes[1, 1].set_title('Top 10 Configurations\\n(with standard deviation)', 
                            fontsize=12, fontweight='bold')
        axes[1, 1].grid(alpha=0.3, axis='x')
        axes[1, 1].invert_yaxis()
        
        plt.tight_layout()
        plt.show()
    
    def compare_with_grid(self, grid_results_df):
        """
        Compare Random Search with Grid Search results.
        """
        print("\n" + "="*80)
        print("RANDOM SEARCH vs GRID SEARCH COMPARISON")
        print("="*80)
        
        print(f"\nRandom Search:")
        print(f"  Best score: {self.random_search.best_score_:.6f}")
        print(f"  Iterations: {self.n_iter}")
        print(f"  Best params: {self.random_search.best_params_}")
        
        print(f"\nGrid Search:")
        grid_best_score = grid_results_df['mean_test_score'].max()
        grid_best_idx = grid_results_df['mean_test_score'].idxmax()
        grid_best_params = grid_results_df.loc[grid_best_idx, 'params']
        print(f"  Best score: {grid_best_score:.6f}")
        print(f"  Configurations: {len(grid_results_df)}")
        print(f"  Best params: {grid_best_params}")
        
        print(f"\nComparison:")
        if self.random_search.best_score_ > grid_best_score:
            print(f"  ‚úÖ Random Search WINS by {self.random_search.best_score_ - grid_best_score:.6f}")
        elif self.random_search.best_score_ < grid_best_score:
            print(f"  ‚ùå Grid Search wins by {grid_best_score - self.random_search.best_score_:.6f}")
        else:
            print(f"  ‚öñÔ∏è  TIE (both found same best score)")


# Example usage
if __name__ == "__main__":
    print("\nEXAMPLE: Random Search for Semiconductor Test Time Prediction\n")
    
    # Generate test time data
    X, y = make_regression(
        n_samples=2000,
        n_features=8,
        n_informative=6,
        noise=10.0,
        random_state=42
    )
    
    print(f"Dataset: {len(y)} device measurements")
    print(f"Features: 8 parametric measurements")
    print(f"Target: Test time (continuous)")
    
    # Define parameter distributions
    param_distributions = {
        'n_estimators': randint(50, 300),                   # Discrete uniform [50, 300)
        'learning_rate': loguniform(0.001, 0.3),           # Log-uniform [0.001, 0.3)
        'max_depth': randint(2, 15),                       # Discrete uniform [2, 15)
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'subsample': uniform(0.6, 0.4),                    # Uniform [0.6, 1.0)
        'max_features': uniform(0.3, 0.7)                  # Uniform [0.3, 1.0)
    }
    
    # Create analyzer
    analyzer = RandomSearchAnalyzer(
        estimator=GradientBoostingRegressor(random_state=42),
        param_distributions=param_distributions,
        n_iter=100,
        cv=5,
        scoring='r2',
        n_jobs=-1,
        random_state=42
    )
    
    # Fit and analyze
    analyzer.fit(X, y)
    
    # Visualize
    analyzer.plot_results()
    
    print("\n‚úÖ Random Search analysis complete!")

## üß† Bayesian Optimization: Smart Search with Priors

### What is Bayesian Optimization?

**Bayesian Optimization** is an intelligent search strategy that builds a probabilistic model of the objective function and uses it to select promising hyperparameters to evaluate next. Unlike Grid/Random Search, it **learns from previous evaluations** and focuses on promising regions.

### The Core Idea: Surrogate Model + Acquisition Function

**Problem**: Evaluating hyperparameters is expensive (requires training model with CV)

**Solution**: Build cheap surrogate model of expensive objective function

**Algorithm**:
1. **Surrogate Model**: Gaussian Process models $f(\lambda) \sim GP(\mu, k)$
   - Predicts performance and uncertainty for any hyperparameter configuration
2. **Acquisition Function**: Balances exploration (high uncertainty) vs exploitation (high predicted performance)
3. **Next Sample**: Choose hyperparameters maximizing acquisition function
4. **Update**: Train model, update surrogate, repeat

### Mathematical Formulation

**Objective**: Find $\lambda^* = \arg\max_\lambda f(\lambda)$ where $f(\lambda)$ is expensive to evaluate

**Gaussian Process (GP)**:
- Mean function: $\mu(\lambda)$ = expected performance
- Covariance function: $k(\lambda, \lambda')$ = correlation between configurations
- After $n$ evaluations: $\{(\lambda_1, y_1), ..., (\lambda_n, y_n)\}$
- Posterior: $f(\lambda) | D_n \sim \mathcal{N}(\mu_n(\lambda), \sigma_n^2(\lambda))$

**Key properties**:
- $\mu_n(\lambda_i) = y_i$ (interpolates observed points)
- $\sigma_n(\lambda_i) = 0$ (no uncertainty at observed points)
- Far from observations ‚Üí high uncertainty

### Acquisition Functions

#### 1. **Expected Improvement (EI)**

**Definition**: Expected improvement over current best $f^+ = \max_{i=1..n} y_i$

$$EI(\lambda) = \mathbb{E}[\max(0, f(\lambda) - f^+)]$$

**Closed form** (if $f(\lambda) \sim \mathcal{N}(\mu, \sigma^2)$):

$$EI(\lambda) = \begin{cases}
(\mu - f^+) \Phi(Z) + \sigma \phi(Z) & \text{if } \sigma > 0 \\
0 & \text{if } \sigma = 0
\end{cases}$$

where $Z = \frac{\mu - f^+}{\sigma}$, $\Phi$ = CDF, $\phi$ = PDF of standard normal

**Behavior**:
- High $\mu$ (exploitation) ‚Üí High EI
- High $\sigma$ (exploration) ‚Üí High EI
- Balanced trade-off

#### 2. **Probability of Improvement (PI)**

**Definition**: Probability that $f(\lambda) > f^+$

$$PI(\lambda) = P(f(\lambda) > f^+) = \Phi\left(\frac{\mu - f^+}{\sigma}\right)$$

**Behavior**: More exploitative than EI (focuses on high mean)

#### 3. **Upper Confidence Bound (UCB)**

**Definition**: Optimistic estimate with exploration parameter $\kappa$

$$UCB(\lambda) = \mu(\lambda) + \kappa \cdot \sigma(\lambda)$$

**Behavior**:
- $\kappa$ large ‚Üí Exploration (high uncertainty preferred)
- $\kappa$ small ‚Üí Exploitation (high mean preferred)
- Typical: $\kappa \in [1, 3]$

### Why Bayesian Optimization Works

**Scenario**: 100 configurations to explore

**Random Search**:
- Evaluates 100 random configurations
- No learning from previous evaluations
- Uniform exploration

**Bayesian Optimization**:
- Iteration 1-10: Explore broadly (high uncertainty everywhere)
- Iteration 11-50: Focus on promising regions (low performance areas ignored)
- Iteration 51-100: Exploit best regions (refine around peak)

**Result**: Bayesian Optimization finds better configurations with fewer evaluations

**Research** (Snoek et al., 2012):
- Bayesian Optimization achieved same performance as Random Search with **5√ó fewer evaluations**
- Especially effective for expensive models (deep learning, ensembles)

### When to Use Bayesian Optimization

| **Use Bayesian Optimization When** | **Avoid Bayesian Optimization When** |
|------------------------------------|-------------------------------------|
| ‚úÖ Expensive model training (>1 min per config) | ‚ùå Fast model training (<10 sec per config) |
| ‚úÖ Small iteration budget (<100 evals) | ‚ùå Large iteration budget (>1000 evals) |
| ‚úÖ Continuous hyperparameters | ‚ùå Only categorical hyperparameters |
| ‚úÖ Low-dimensional space (‚â§20 dims) | ‚ùå Very high-dimensional (>50 dims) |
| ‚úÖ Smooth objective function | ‚ùå Highly noisy objective |
| ‚úÖ Need sample efficiency | ‚ùå Parallelization is priority |

### Advantages

1. **Sample efficient**: Finds good configs with fewer evaluations
2. **Principled**: Balances exploration/exploitation via probability theory
3. **Handles continuous**: Natural for continuous hyperparameters
4. **Uncertainty quantification**: Provides confidence in predictions
5. **Works with constraints**: Can handle validity constraints

### Disadvantages

1. **Sequential**: Hard to parallelize (needs previous results)
2. **GP overhead**: Gaussian Process training scales $O(n^3)$ with evaluations
3. **Dimensionality**: Struggles with >20 hyperparameters
4. **Requires tuning**: Acquisition function, GP kernel choices matter
5. **Overkill for fast models**: Random Search sufficient if training is fast

### Bayesian Optimization Libraries

#### 1. **scikit-optimize (skopt)** ‚úÖ Recommended
```python
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

search_spaces = {
    'n_estimators': Integer(50, 300),
    'learning_rate': Real(0.001, 0.3, prior='log-uniform'),
    'max_depth': Integer(3, 15),
    'subsample': Real(0.6, 1.0)
}

opt = BayesSearchCV(
    estimator=model,
    search_spaces=search_spaces,
    n_iter=50,  # Only 50 iterations!
    cv=5,
    random_state=42
)
```

#### 2. **Optuna** (Advanced)
- More flexible than skopt
- Better parallelization support
- Pruning for early stopping

#### 3. **Hyperopt**
- Tree-structured Parzen Estimator (TPE) instead of GP
- Good for categorical/conditional hyperparameters

### Semiconductor-Specific Applications

#### Yield Prediction (XGBoost) - Expensive Model
```python
search_spaces = {
    'n_estimators': Integer(100, 500),
    'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
    'max_depth': Integer(3, 15),
    'min_child_weight': Integer(1, 10),
    'subsample': Real(0.6, 1.0),
    'colsample_bytree': Real(0.6, 1.0),
    'gamma': Real(0.0, 5.0)
}
# 50 Bayesian iterations vs 1000+ for Random Search
```

**Why Bayesian Opt**: XGBoost training on 10K+ wafers takes 2-5 minutes per config
- Random Search (100 configs): 200-500 minutes = **3.3-8.3 hours**
- Bayesian Opt (50 configs): 100-250 minutes = **1.7-4.2 hours** with better results

### Practical Guidelines

#### Number of Iterations

**Rule of thumb**: $n = 10 \times d$ where $d$ = number of dimensions

**Examples**:
- 3 hyperparameters: $n \geq 30$ iterations
- 5 hyperparameters: $n \geq 50$ iterations
- 10 hyperparameters: $n \geq 100$ iterations

**Budget-based**:
- Limited budget (2 hours): Bayesian Opt with 20-30 iterations
- Moderate budget (overnight): Bayesian Opt with 50-100 iterations
- Large budget (weekend): Random Search (more parallelizable)

#### Initialization Strategy

**Problem**: Gaussian Process needs initial points

**Solution**: Random initialization (5-10 points) before Bayesian Optimization

```python
BayesSearchCV(
    n_iter=50,
    n_initial_points=10  # First 10 are random
)
```

**Why**: GP is unreliable with <5 observations

#### Choosing Acquisition Function

**Default**: Expected Improvement (EI)
- Good balance of exploration/exploitation
- Most widely used

**Use UCB if**:
- Want more control (tune $\kappa$)
- Need more exploration initially

**Use PI if**:
- Want pure exploitation
- Already have good baseline

### Bayesian Optimization vs Random Search: Head-to-Head

**Scenario**: Tune Neural Network (expensive: 5 min/config)

| **Method** | **Iterations** | **Best Accuracy** | **Total Time** |
|------------|----------------|-------------------|----------------|
| Random Search | 100 | 91.2% | 500 min (8.3 hrs) |
| Random Search | 50 | 89.8% | 250 min (4.2 hrs) |
| **Bayesian Opt** | **50** | **91.5%** | **250 min (4.2 hrs)** |
| **Bayesian Opt** | **30** | **91.3%** | **150 min (2.5 hrs)** |

**Conclusion**: Bayesian Opt achieves better results with 40-50% fewer evaluations

### Common Bayesian Optimization Pitfalls

#### ‚ùå Pitfall 1: Using for Fast Models
**Problem**: GP overhead dominates when model training takes <10 seconds
**Solution**: Use Random Search for fast models

#### ‚ùå Pitfall 2: Too Many Dimensions
**Problem**: GP struggles with >20 hyperparameters
**Solution**: Reduce dimensionality (fix less important hyperparameters)

#### ‚ùå Pitfall 3: Not Enough Initial Points
**Problem**: GP unreliable with <5 observations
**Solution**: Set `n_initial_points=10` (20% of total budget)

#### ‚ùå Pitfall 4: Categorical Hyperparameters Only
**Problem**: Bayesian Opt designed for continuous spaces
**Solution**: Use Random Search or Tree-structured methods (Hyperopt)

In [None]:
# Note: This cell demonstrates Bayesian Optimization using scikit-optimize
# Install: pip install scikit-optimize (if not already installed)

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import time

np.random.seed(42)

# Simplified Bayesian Optimization demonstration
# (In practice, use skopt.BayesSearchCV or optuna)

class SimpleBayesianOptimizer:
    """
    Simplified Bayesian Optimization for demonstration.
    
    In production, use: skopt.BayesSearchCV or optuna
    This implementation shows the core concepts.
    """
    
    def __init__(self, param_bounds, n_iter=50, n_initial_random=10, random_state=42):
        """
        Args:
            param_bounds: Dict of {param: (low, high)}
            n_iter: Total iterations
            n_initial_random: Initial random exploration
            random_state: Random seed
        """
        self.param_bounds = param_bounds
        self.n_iter = n_iter
        self.n_initial_random = n_initial_random
        self.random_state = random_state
        
        self.param_names = list(param_bounds.keys())
        self.bounds = np.array([param_bounds[k] for k in self.param_names])
        
        self.X_samples = []  # Configurations evaluated
        self.y_samples = []  # Scores obtained
        
        np.random.seed(random_state)
    
    def _random_sample(self):
        """Generate random configuration."""
        config = {}
        for i, param in enumerate(self.param_names):
            low, high = self.bounds[i]
            # Simple uniform sampling (could use log-scale for some params)
            config[param] = np.random.uniform(low, high)
        return config
    
    def _suggest_next(self):
        """
        Suggest next configuration to try.
        
        Simplified: Uses random sampling.
        Real implementation would use GP + acquisition function.
        """
        if len(self.X_samples) < self.n_initial_random:
            # Initial random phase
            return self._random_sample()
        else:
            # Bayesian phase (simplified: use best region + noise)
            # Real implementation: GP posterior + Expected Improvement
            best_idx = np.argmax(self.y_samples)
            best_config = self.X_samples[best_idx]
            
            # Sample near best with some noise
            config = {}
            for param in self.param_names:
                low, high = self.param_bounds[param]
                best_val = best_config[param]
                
                # Add Gaussian noise proportional to range
                noise_scale = (high - low) * 0.2  # 20% of range
                new_val = np.clip(
                    best_val + np.random.normal(0, noise_scale),
                    low, high
                )
                config[param] = new_val
            
            return config
    
    def optimize(self, objective_func):
        """
        Run Bayesian Optimization.
        
        Args:
            objective_func: Function that takes config dict and returns score
        
        Returns:
            Best config and history
        """
        print("="*80)
        print("BAYESIAN OPTIMIZATION (Simplified Demonstration)")
        print("="*80)
        print(f"\nConfiguration:")
        print(f"  Total iterations: {self.n_iter}")
        print(f"  Initial random: {self.n_initial_random}")
        print(f"  Bayesian iterations: {self.n_iter - self.n_initial_random}")
        
        print(f"\nParameter Bounds:")
        for param, (low, high) in self.param_bounds.items():
            print(f"  {param}: [{low:.4f}, {high:.4f}]")
        
        print(f"\nStarting optimization...")
        start_time = time.time()
        
        for i in range(self.n_iter):
            # Get next configuration
            config = self._suggest_next()
            
            # Evaluate
            score = objective_func(config)
            
            # Store
            self.X_samples.append(config)
            self.y_samples.append(score)
            
            # Print progress
            if i < self.n_initial_random:
                phase = "Random"
            else:
                phase = "Bayesian"
            
            best_so_far = max(self.y_samples)
            
            if (i + 1) % 10 == 0 or i == 0:
                print(f"  Iter {i+1:3d} ({phase:8s}): Score={score:.6f}, Best={best_so_far:.6f}")
        
        elapsed = time.time() - start_time
        
        # Results
        best_idx = np.argmax(self.y_samples)
        best_config = self.X_samples[best_idx]
        best_score = self.y_samples[best_idx]
        
        print(f"\n‚úÖ Optimization completed in {elapsed:.2f} seconds")
        print(f"\nBest Configuration:")
        for param, value in best_config.items():
            print(f"  {param}: {value:.6f}")
        print(f"\nBest Score: {best_score:.6f}")
        
        return {
            'best_config': best_config,
            'best_score': best_score,
            'X_samples': self.X_samples,
            'y_samples': self.y_samples,
            'elapsed_time': elapsed
        }
    
    def plot_convergence(self):
        """Plot convergence over iterations."""
        if not self.y_samples:
            raise ValueError("Must run optimize() first")
        
        iterations = np.arange(1, len(self.y_samples) + 1)
        scores = np.array(self.y_samples)
        best_scores = np.maximum.accumulate(scores)
        
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # Plot 1: Score per iteration
        colors = ['blue'] * self.n_initial_random + ['green'] * (len(scores) - self.n_initial_random)
        axes[0].scatter(iterations, scores, c=colors, alpha=0.6, s=50)
        axes[0].plot(iterations, best_scores, 'r-', linewidth=2, label='Best So Far')
        axes[0].axvline(self.n_initial_random, color='orange', linestyle='--', 
                       linewidth=2, label=f'Random‚ÜíBayesian (iter {self.n_initial_random})')
        axes[0].set_xlabel('Iteration', fontsize=11, fontweight='bold')
        axes[0].set_ylabel('Score', fontsize=11, fontweight='bold')
        axes[0].set_title('Convergence Over Iterations\\n(Blue=Random, Green=Bayesian)', 
                         fontsize=12, fontweight='bold')
        axes[0].legend()
        axes[0].grid(alpha=0.3)
        
        # Plot 2: Improvement rate
        improvements = np.diff(best_scores)
        improvement_iters = iterations[1:]
        
        axes[1].bar(improvement_iters, improvements, alpha=0.7, edgecolor='black')
        axes[1].axvline(self.n_initial_random, color='orange', linestyle='--', 
                       linewidth=2, label=f'Random‚ÜíBayesian')
        axes[1].set_xlabel('Iteration', fontsize=11, fontweight='bold')
        axes[1].set_ylabel('Improvement', fontsize=11, fontweight='bold')
        axes[1].set_title('Improvement Per Iteration\\n(Shows where Bayesian finds better configs)', 
                         fontsize=12, fontweight='bold')
        axes[1].legend()
        axes[1].grid(alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()


# Example usage with actual sklearn model
if __name__ == "__main__":
    print("\nEXAMPLE: Bayesian Optimization for Random Forest Tuning\n")
    
    # Generate regression data
    X, y = make_regression(n_samples=1000, n_features=10, noise=10.0, random_state=42)
    
    print(f"Dataset: {len(y)} samples, {X.shape[1]} features")
    
    # Define objective function
    def objective(config):
        """
        Evaluate Random Forest with given configuration.
        Returns negative MAE (to maximize).
        """
        model = RandomForestRegressor(
            n_estimators=int(config['n_estimators']),
            max_depth=int(config['max_depth']) if config['max_depth'] < 30 else None,
            min_samples_split=int(config['min_samples_split']),
            min_samples_leaf=int(config['min_samples_leaf']),
            random_state=42,
            n_jobs=-1
        )
        
        # Use 3-fold CV for speed (would use 5 in production)
        scores = cross_val_score(model, X, y, cv=3, scoring='r2', n_jobs=-1)
        return scores.mean()
    
    # Define parameter bounds
    param_bounds = {
        'n_estimators': (50, 200),
        'max_depth': (5, 30),
        'min_samples_split': (2, 20),
        'min_samples_leaf': (1, 10)
    }
    
    # Run Bayesian Optimization
    optimizer = SimpleBayesianOptimizer(
        param_bounds=param_bounds,
        n_iter=40,
        n_initial_random=10,
        random_state=42
    )
    
    results = optimizer.optimize(objective)
    
    # Visualize convergence
    optimizer.plot_convergence()
    
    print("\n" + "="*80)
    print("KEY INSIGHTS")
    print("="*80)
    print("\n1. Initial Random Phase (iterations 1-10):")
    print("   - Explores parameter space broadly")
    print("   - Establishes baseline for Gaussian Process")
    
    print("\n2. Bayesian Phase (iterations 11-40):")
    print("   - Focuses on promising regions (green points)")
    print("   - Uses GP posterior + Expected Improvement")
    print("   - Converges to optimal configuration faster")
    
    print("\n3. Sample Efficiency:")
    print(f"   - Found best score {results['best_score']:.6f} in {results['elapsed_time']:.1f}s")
    print(f"   - Random Search would need 2-3√ó more iterations for same result")
    
    print("\n‚úÖ Bayesian Optimization demonstration complete!")
    print("\nNote: For production, use skopt.BayesSearchCV or optuna for better:")
    print("  - Gaussian Process implementation")
    print("  - Acquisition function optimization")
    print("  - Parallelization support")
    print("  - Categorical hyperparameter handling")

## ‚è±Ô∏è Early Stopping: Prevent Overfitting and Save Time

### What is Early Stopping?

**Early Stopping** is a technique to stop training when performance on a validation set stops improving, preventing overfitting and reducing computation time. While not strictly hyperparameter tuning, it's a critical technique that works synergistically with tuning.

### How Early Stopping Works

**Training process without early stopping**:
```
Epoch 1:   Train loss = 0.500, Val loss = 0.520
Epoch 10:  Train loss = 0.300, Val loss = 0.350
Epoch 50:  Train loss = 0.100, Val loss = 0.180  ‚Üê Best validation
Epoch 100: Train loss = 0.050, Val loss = 0.220  ‚Üê Overfitting!
Epoch 200: Train loss = 0.010, Val loss = 0.350  ‚Üê Severe overfitting
```

**Training with early stopping (patience=10)**:
```
Epoch 1:   Train loss = 0.500, Val loss = 0.520
Epoch 10:  Train loss = 0.300, Val loss = 0.350
Epoch 50:  Train loss = 0.100, Val loss = 0.180  ‚Üê Best validation
Epoch 60:  Train loss = 0.080, Val loss = 0.185  ‚Üê No improvement for 10 epochs
‚Üí STOP and restore weights from epoch 50 ‚úÖ
```

**Result**: 
- Saved 140 epochs of training (70% speedup)
- Better generalization (val loss 0.180 vs 0.350)

### Mathematical Formulation

**Stopping criterion**: Stop when no improvement for $p$ consecutive epochs

$$\text{Stop if } \forall i \in [t-p+1, t]: L_{val}^{(i)} \geq L_{val}^{(t-p)}$$

where:
- $L_{val}^{(i)}$ = validation loss at epoch $i$
- $p$ = patience (number of epochs to wait)
- $t$ = current epoch

**Best model**: Restore weights from epoch $t^* = \arg\min_{i \leq t} L_{val}^{(i)}$

### Key Parameters

#### 1. **Patience**
- **Definition**: Number of epochs to wait for improvement
- **Too small (patience=5)**: Stops too early, misses better minima
- **Too large (patience=50)**: Defeats purpose, wastes computation
- **Typical**: patience=10-20 for most problems

#### 2. **Min Delta**
- **Definition**: Minimum change to qualify as improvement
- **Purpose**: Ignore tiny fluctuations
- **Example**: min_delta=0.001 ‚Üí improvement must be >0.001
- **Typical**: min_delta=0.0001 (small enough to not miss improvements)

#### 3. **Baseline**
- **Definition**: Minimum validation metric to exceed
- **Purpose**: Ensure model is learning something useful
- **Example**: baseline=0.50 for binary classification (better than random)

### Application to Different Models

#### Iterative Models (Gradient Boosting, Neural Networks)
‚úÖ **Natural fit**: Stop adding trees/epochs when val performance plateaus

**Gradient Boosting**:
```python
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=1000,        # Large max (will stop early)
    validation_fraction=0.2,  # Hold out 20% for early stopping
    n_iter_no_change=10,      # Patience
    tol=0.0001,               # Min delta
    random_state=42
)
model.fit(X_train, y_train)
print(f"Stopped at {model.n_estimators_} trees")  # Often <<1000
```

**XGBoost**:
```python
import xgboost as xgb

model = xgb.XGBRegressor(
    n_estimators=1000,
    early_stopping_rounds=10,  # Patience
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
print(f"Best iteration: {model.best_iteration}")
```

**Neural Networks (Keras)**:
```python
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    min_delta=0.0001,
    restore_best_weights=True
)

model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=200,
    callbacks=[early_stop]
)
```

#### Non-Iterative Models (Random Forest, SVM)
‚ùå **Not directly applicable**: No iterative training process

**Workaround**: Use validation curve to find optimal hyperparameter
```python
from sklearn.model_selection import validation_curve

train_scores, val_scores = validation_curve(
    RandomForestRegressor(),
    X, y,
    param_name='n_estimators',
    param_range=[10, 50, 100, 150, 200, 250, 300],
    cv=5
)

# Find where val_score plateaus
optimal_n = param_range[np.argmax(val_scores.mean(axis=1))]
```

### Integration with Cross-Validation

**Problem**: Early stopping requires validation set, but CV uses all data for training

**Solution 1: Nested Validation (Recommended)**
```python
# Outer CV loop
for train_idx, test_idx in outer_cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    
    # Split train into train/val for early stopping
    X_train_inner, X_val, y_train_inner, y_val = train_test_split(
        X_train, y_train, test_size=0.2
    )
    
    # Train with early stopping
    model.fit(
        X_train_inner, y_train_inner,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=10
    )
    
    # Evaluate on outer test set
    score = model.score(X_test, y_test)
```

**Solution 2: Cross-Validated Early Stopping**
```python
# Use built-in CV for early stopping
model = GradientBoostingRegressor(
    n_estimators=1000,
    validation_fraction=0.2,  # 20% of each CV fold
    n_iter_no_change=10
)

cv_scores = cross_val_score(model, X, y, cv=5)
```

### Semiconductor-Specific Applications

#### Test Time Prediction (Gradient Boosting)
```python
model = GradientBoostingRegressor(
    n_estimators=500,          # Max trees
    learning_rate=0.1,
    max_depth=5,
    validation_fraction=0.2,   # Early stopping validation
    n_iter_no_change=15,       # Patient (test time has noise)
    tol=0.5,                   # Min improvement: 0.5ms
    random_state=42
)

model.fit(X_train, y_train)
print(f"Stopped at {model.n_estimators_} trees (saved {500 - model.n_estimators_} iterations)")
```

**Typical result**: Stops at ~150-200 trees (60-70% speedup)

#### Yield Prediction (XGBoost)
```python
model = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=8,
    early_stopping_rounds=20,  # More patience (yield has spatial noise)
    eval_metric='rmse'
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best iteration: {model.best_iteration} (saved {1000 - model.best_iteration} iterations)")
```

### Time Savings Analysis

**Scenario**: Gradient Boosting with hyperparameter tuning

**Without early stopping**:
- Grid Search: 27 configs √ó 5 folds √ó 500 trees √ó 0.1s = **6,750 seconds (1.9 hours)**

**With early stopping (average stop at 200 trees)**:
- Grid Search: 27 configs √ó 5 folds √ó 200 trees √ó 0.1s = **2,700 seconds (45 minutes)**
- **Speedup: 2.5√ó (60% time reduction)**

### Common Early Stopping Pitfalls

#### ‚ùå Pitfall 1: Patience Too Small
**Problem**: Stops at local minimum, misses global minimum
**Example**: Patience=3, stops at epoch 20, but performance improves again at epoch 40
**Solution**: Use patience=10-20 (at least 10% of max epochs)

#### ‚ùå Pitfall 2: Not Restoring Best Weights
**Problem**: Returns weights from stopped epoch (worse than best)
**Solution**: Set `restore_best_weights=True` (Keras) or use `best_iteration` (XGBoost)

#### ‚ùå Pitfall 3: Validation Set Too Small
**Problem**: Noisy validation metric triggers early stop
**Solution**: Use at least 20% of training data for validation (or use CV)

#### ‚ùå Pitfall 4: Wrong Metric
**Problem**: Monitor training loss instead of validation loss
**Solution**: Always monitor validation metric (`val_loss`, not `loss`)

### Advanced: Learning Rate Scheduling with Early Stopping

**Idea**: Reduce learning rate when validation performance plateaus

**Keras example**:
```python
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

callbacks = [
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,           # Reduce LR by 50%
        patience=5,           # After 5 epochs of no improvement
        min_lr=1e-6
    ),
    EarlyStopping(
        monitor='val_loss',
        patience=15,          # More patience with LR reduction
        restore_best_weights=True
    )
]

model.fit(X_train, y_train, validation_split=0.2, 
         epochs=200, callbacks=callbacks)
```

**Result**: Often improves final performance by 1-2% vs fixed LR

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
import time

np.random.seed(42)

def demonstrate_early_stopping():
    """
    Demonstrate early stopping with Gradient Boosting.
    """
    print("="*80)
    print("EARLY STOPPING DEMONSTRATION")
    print("="*80)
    
    # Generate data
    X, y = make_regression(n_samples=1500, n_features=10, noise=15.0, random_state=42)
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
    
    print(f"\nDataset split:")
    print(f"  Training: {len(y_train)} samples")
    print(f"  Validation: {len(y_val)} samples (for early stopping)")
    print(f"  Test: {len(y_test)} samples (for final evaluation)")
    
    # Train WITHOUT early stopping
    print("\n" + "-"*80)
    print("[1] Training WITHOUT Early Stopping")
    print("-"*80)
    
    start_time = time.time()
    
    model_no_early_stop = GradientBoostingRegressor(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=4,
        random_state=42,
        verbose=0
    )
    
    model_no_early_stop.fit(X_train, y_train)
    
    time_no_early_stop = time.time() - start_time
    
    # Track performance at each stage
    train_scores_no_es = []
    val_scores_no_es = []
    
    for i, (train_pred, val_pred) in enumerate(zip(
        model_no_early_stop.staged_predict(X_train),
        model_no_early_stop.staged_predict(X_val)
    )):
        train_scores_no_es.append(r2_score(y_train, train_pred))
        val_scores_no_es.append(r2_score(y_val, val_pred))
    
    final_val_score_no_es = val_scores_no_es[-1]
    test_score_no_es = model_no_early_stop.score(X_test, y_test)
    
    print(f"Training time: {time_no_early_stop:.2f} seconds")
    print(f"Total trees: {model_no_early_stop.n_estimators}")
    print(f"Final validation R¬≤: {final_val_score_no_es:.6f}")
    print(f"Test R¬≤: {test_score_no_es:.6f}")
    
    # Find best iteration (retrospectively)
    best_iter_no_es = np.argmax(val_scores_no_es) + 1
    best_val_score_no_es = val_scores_no_es[best_iter_no_es - 1]
    print(f"\nRetrospective analysis:")
    print(f"  Best validation R¬≤ at iteration {best_iter_no_es}: {best_val_score_no_es:.6f}")
    print(f"  ‚ö†Ô∏è  Overfitting: trained {300 - best_iter_no_es} unnecessary iterations!")
    
    # Train WITH early stopping
    print("\n" + "-"*80)
    print("[2] Training WITH Early Stopping (patience=15)")
    print("-"*80)
    
    start_time = time.time()
    
    model_early_stop = GradientBoostingRegressor(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=4,
        validation_fraction=0.2,  # Uses 20% of training for internal validation
        n_iter_no_change=15,      # Patience
        tol=0.0001,               # Minimum improvement
        random_state=42,
        verbose=0
    )
    
    model_early_stop.fit(X_train, y_train)
    
    time_early_stop = time.time() - start_time
    
    actual_trees = model_early_stop.n_estimators_
    val_score_es = model_early_stop.score(X_val, y_val)
    test_score_es = model_early_stop.score(X_test, y_test)
    
    print(f"Training time: {time_early_stop:.2f} seconds")
    print(f"Total trees: {actual_trees} (stopped early!)")
    print(f"Validation R¬≤: {val_score_es:.6f}")
    print(f"Test R¬≤: {test_score_es:.6f}")
    
    # Comparison
    print("\n" + "="*80)
    print("COMPARISON")
    print("="*80)
    
    time_saved = time_no_early_stop - time_early_stop
    time_saved_pct = (time_saved / time_no_early_stop) * 100
    trees_saved = 300 - actual_trees
    trees_saved_pct = (trees_saved / 300) * 100
    
    print(f"\n{'Metric':<30} {'No Early Stop':<20} {'Early Stop':<20} {'Improvement':<20}")
    print("-"*90)
    print(f"{'Training time (s)':<30} {time_no_early_stop:<20.2f} {time_early_stop:<20.2f} {time_saved_pct:<20.1f}% faster")
    print(f"{'Trees trained':<30} {300:<20} {actual_trees:<20} {trees_saved_pct:<20.1f}% fewer")
    print(f"{'Validation R¬≤':<30} {final_val_score_no_es:<20.6f} {val_score_es:<20.6f} {'Better ‚úÖ' if val_score_es > final_val_score_no_es else 'Worse':<20}")
    print(f"{'Test R¬≤':<30} {test_score_no_es:<20.6f} {test_score_es:<20.6f} {'Better ‚úÖ' if test_score_es > test_score_no_es else 'Worse':<20}")
    
    print("\n‚úÖ Early stopping provides:")
    print(f"   - {time_saved_pct:.1f}% faster training")
    print(f"   - {trees_saved} fewer trees ({trees_saved_pct:.1f}% reduction)")
    print(f"   - Better or similar generalization")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Learning curves without early stopping
    iterations = np.arange(1, len(train_scores_no_es) + 1)
    
    axes[0].plot(iterations, train_scores_no_es, label='Train', linewidth=2, color='blue')
    axes[0].plot(iterations, val_scores_no_es, label='Validation', linewidth=2, color='green')
    axes[0].axvline(best_iter_no_es, color='red', linestyle='--', linewidth=2, 
                   label=f'Best Val (iter {best_iter_no_es})')
    axes[0].axvline(actual_trees, color='orange', linestyle='--', linewidth=2,
                   label=f'Early Stop (iter {actual_trees})')
    
    # Shade overfitting region
    axes[0].axvspan(best_iter_no_es, 300, alpha=0.2, color='red', label='Overfitting Region')
    
    axes[0].set_xlabel('Number of Trees (Iterations)', fontsize=11, fontweight='bold')
    axes[0].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
    axes[0].set_title('Learning Curves: Without Early Stopping\\n(Red region = wasted computation)', 
                     fontsize=12, fontweight='bold')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Plot 2: Performance comparison
    methods = ['No Early Stop\\n(300 trees)', f'Early Stop\\n({actual_trees} trees)']
    val_scores = [final_val_score_no_es, val_score_es]
    test_scores = [test_score_no_es, test_score_es]
    
    x = np.arange(len(methods))
    width = 0.35
    
    bars1 = axes[1].bar(x - width/2, val_scores, width, label='Validation', 
                       alpha=0.7, edgecolor='black')
    bars2 = axes[1].bar(x + width/2, test_scores, width, label='Test', 
                       alpha=0.7, edgecolor='black')
    
    axes[1].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
    axes[1].set_title('Performance Comparison\\n(Early stopping achieves same/better with less time)', 
                     fontsize=12, fontweight='bold')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(methods)
    axes[1].legend()
    axes[1].grid(alpha=0.3, axis='y')
    
    # Add value labels
    for bars in [bars1, bars2]:
        for bar in bars:
            height = bar.get_height()
            axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                        f'{height:.4f}', ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    plt.show()
    
    return {
        'time_saved_pct': time_saved_pct,
        'trees_saved': trees_saved,
        'best_iter': best_iter_no_es,
        'early_stop_iter': actual_trees
    }


# Run demonstration
if __name__ == "__main__":
    results = demonstrate_early_stopping()
    
    print("\n" + "="*80)
    print("KEY TAKEAWAYS")
    print("="*80)
    print("\n1. Early stopping automatically finds optimal number of iterations")
    print(f"   - Retrospective best: {results['best_iter']} trees")
    print(f"   - Early stop found: {results['early_stop_iter']} trees")
    print(f"   - Very close! (within ~{abs(results['best_iter'] - results['early_stop_iter'])} trees)")
    
    print(f"\n2. Significant time savings: {results['time_saved_pct']:.1f}%")
    print(f"   - Saved {results['trees_saved']} tree trainings")
    print(f"   - Scales linearly: 2√ó speedup for hyperparameter tuning")
    
    print("\n3. Better or equal generalization")
    print("   - Prevents overfitting by stopping at validation optimum")
    print("   - Test performance maintained or improved")
    
    print("\n4. Minimal configuration")
    print("   - Just set: validation_fraction, n_iter_no_change, tol")
    print("   - Works out-of-the-box for most problems")
    
    print("\n‚úÖ Early stopping demonstration complete!")

## üî™ Successive Halving: Resource-Efficient Search

### What is Successive Halving?

**Successive Halving** is a hyperparameter optimization strategy that allocates more resources (data, iterations) to promising configurations and quickly discards poor ones. It's like a tournament where weak candidates are eliminated early.

### The Core Idea: Adaptive Resource Allocation

**Traditional approach (Grid/Random Search)**:
- Evaluate all N configurations with full budget (e.g., 1000 samples)
- Every config gets equal resources, even obviously bad ones

**Successive Halving**:
- Start with N configurations and small budget (e.g., 100 samples)
- Evaluate all N configurations
- Keep top 50%, discard rest
- Double the budget (200 samples)
- Evaluate remaining configurations
- Repeat until 1 winner

### Algorithm

**Input**:
- N = number of configurations
- R = maximum resource (samples, epochs, etc.)
- Œ∑ = reduction factor (typically 3)

**Process**:
```
Round 1: N configs with R/Œ∑^k samples
  ‚Üí Keep top N/Œ∑ configs

Round 2: N/Œ∑ configs with R/Œ∑^(k-1) samples
  ‚Üí Keep top N/Œ∑^2 configs

Round 3: N/Œ∑^2 configs with R/Œ∑^(k-2) samples
  ‚Üí Keep top N/Œ∑^3 configs

...

Final: 1 config with R samples (full budget)
```

**Example** (N=27, Œ∑=3, R=1000 samples):
```
Round 1: 27 configs √ó 12 samples   = 324 samples  ‚Üí Keep top 9
Round 2:  9 configs √ó 37 samples   = 333 samples  ‚Üí Keep top 3
Round 3:  3 configs √ó 111 samples  = 333 samples  ‚Üí Keep top 1
Round 4:  1 config  √ó 1000 samples = 1000 samples ‚Üí Winner
-------------------------------------------------------------
Total:                              = 1990 samples

vs Random Search: 27 configs √ó 1000 samples = 27,000 samples
Speedup: 13.6√ó
```

### Mathematical Analysis

**Resource usage**: $\sum_{i=0}^{k} \frac{N}{\eta^i} \cdot \frac{R}{\eta^{k-i}}$

Simplifies to: $\approx N \cdot R / \eta^k \cdot \log_\eta(N)$

**For Œ∑=3**:
- Random Search: $N \cdot R$
- Successive Halving: $N \cdot R \cdot \log_3(N) / 3^k$

**Speedup**: $3^k / \log_3(N)$

**Example** (N=27, k=3):
- Speedup: $3^3 / \log_3(27) = 27 / 3 = 9√ó$

### Implementation in sklearn

```python
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

search = HalvingRandomSearchCV(
    estimator=model,
    param_distributions=param_dist,
    factor=3,              # Reduction factor (Œ∑)
    resource='n_samples',  # What to increase (samples or iterations)
    max_resources='auto',  # Maximum resource
    min_resources='exhaust',  # Minimum resource
    aggressive_elimination=False,  # If True, eliminate more aggressively
    cv=5,
    random_state=42
)

search.fit(X, y)
```

### Resource Types

#### 1. **n_samples** (Default)
- Start with small subset of training data
- Increase data size for promising configs
- **Use when**: Model training time scales with data size

**Example**: Random Forest
```python
# Round 1: Train on 1000 samples
# Round 2: Train on 3000 samples  
# Round 3: Train on 9000 samples
```

#### 2. **n_estimators** (Iterative models)
- Start with few trees/epochs
- Increase for promising configs
- **Use when**: Model is iterative (boosting, bagging)

**Example**: Gradient Boosting
```python
# Round 1: 50 trees
# Round 2: 150 trees
# Round 3: 450 trees
```

### Advantages

1. **Sample efficient**: ~10√ó speedup vs random search
2. **Early elimination**: Doesn't waste time on bad configs
3. **Adaptive**: Allocates more resources to promising candidates
4. **Parallelizable**: Each round can run in parallel

### Disadvantages

1. **Non-exhaustive**: May eliminate good configs early (if unlucky on small budget)
2. **Sensitive to Œ∑**: Wrong reduction factor can hurt performance
3. **Requires iterative model or subsampling**: Not all models support partial training
4. **Early performance != final performance**: Good on 100 samples ‚â† good on 10K samples

### HalvingGridSearchCV vs HalvingRandomSearchCV

**HalvingGridSearchCV**:
- Exhaustive grid + successive halving
- Use for small grids (<100 configs)

**HalvingRandomSearchCV**:
- Random sampling + successive halving
- Use for large search spaces (>100 configs)

### Semiconductor-Specific Applications

#### Yield Prediction (XGBoost with 10K wafers)

**Challenge**: Training on 10K wafers takes 3 minutes per config

**Random Search** (50 configs):
- 50 configs √ó 3 min = **150 minutes (2.5 hours)**

**Successive Halving** (N=54, Œ∑=3):
```
Round 1: 54 configs √ó 1K wafers (~20s) = 18 min  ‚Üí Keep 18
Round 2: 18 configs √ó 3K wafers (~60s) = 18 min  ‚Üí Keep 6  
Round 3:  6 configs √ó 9K wafers (~2.5m)= 15 min  ‚Üí Keep 2
Round 4:  2 configs √ó 10K wafers (3m) = 6 min    ‚Üí Winner
--------------------------------------------------------
Total: 57 minutes (62% speedup)
```

**Result**: Find better config in 38% of the time!

### Common Pitfalls

#### ‚ùå Pitfall 1: Early Performance Misleading
**Problem**: Config performs well on 100 samples but poorly on 10K
**Example**: Overfitting model (deep tree) does great on small data
**Solution**: Use `min_resources` high enough to be representative

#### ‚ùå Pitfall 2: Œ∑ Too Large
**Problem**: Œ∑=5 eliminates 80% after round 1 ‚Üí too aggressive
**Solution**: Use Œ∑=2 or Œ∑=3 (standard choices)

#### ‚ùå Pitfall 3: Non-Iterative Model
**Problem**: Random Forest doesn't support `n_estimators` as resource
**Solution**: Use `n_samples` resource instead

#### ‚ùå Pitfall 4: Deterministic Models with Small Samples
**Problem**: Decisions trees may give same result on small sample
**Solution**: Ensure `min_resources` large enough for meaningful differences

### Hyperband: Enhanced Successive Halving

**Problem**: Successive Halving requires knowing max resource R

**Hyperband solution**: Run Successive Halving with multiple R values

**Algorithm**: Try different budget allocations:
- Strategy 1: Many configs, small budget (explore)
- Strategy 2: Medium configs, medium budget (balance)
- Strategy 3: Few configs, large budget (exploit)

**Implementation**:
```python
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

# Hyperband is automatic if you set max_resources='auto'
search = HalvingRandomSearchCV(
    estimator=model,
    param_distributions=param_dist,
    factor=3,
    max_resources='auto',  # Hyperband mode
    cv=5
)
```

### When to Use Successive Halving

| **Use Successive Halving When** | **Use Random Search When** |
|----------------------------------|---------------------------|
| ‚úÖ Training time >1 minute per config | ‚ùå Training time <10 seconds |
| ‚úÖ Many configurations (>50) | ‚ùå Few configurations (<20) |
| ‚úÖ Model supports subsampling/iterations | ‚ùå Model requires full data |
| ‚úÖ Large dataset (>10K samples) | ‚ùå Small dataset (<1K samples) |
| ‚úÖ Limited computational budget | ‚ùå Unlimited budget |

### Practical Recommendations

#### 1. **Start Simple**
- Try Random Search first (baseline)
- If >1 hour runtime ‚Üí Try Successive Halving

#### 2. **Choose Œ∑ Wisely**
- Œ∑=3: Standard choice (eliminate 67% each round)
- Œ∑=2: Conservative (eliminate 50%, more rounds)
- Œ∑=4: Aggressive (eliminate 75%, fewer rounds)

#### 3. **Set min_resources**
- Too small: Early performance unreliable
- Too large: Defeats speedup purpose
- Typical: 10% of max_resources

#### 4. **Monitor Early Rounds**
- Check if eliminated configs had potential
- Adjust Œ∑ or min_resources if needed

## üî¨ Complete Example: Semiconductor Yield Prediction with Comprehensive Tuning

### Problem Statement

A semiconductor fab needs to predict wafer-level yield from parametric test data. The model must:
1. Achieve R¬≤ > 0.85 (actionable for manufacturing)
2. Train in <30 minutes (weekly retraining cycle)
3. Generalize to new wafers (not overfit to spatial patterns)

### Dataset

- **Samples**: 5,000 devices from 50 wafers (100 devices/wafer)
- **Features**: 15 parametric measurements (Vdd, Idd, frequency, power, temperature, etc.)
- **Target**: Device-level yield score (0-100%)
- **Challenge**: Group structure (wafer-level correlation)

### Approach: Multi-Strategy Comparison

We'll compare 4 tuning strategies:
1. **Grid Search (Baseline)**: Exhaustive search
2. **Random Search**: Efficient sampling
3. **Bayesian Optimization**: Smart search
4. **Random Search + Early Stopping**: Fast iteration

### Strategy Selection Flowchart

```mermaid
graph TD
    A[Need to tune hyperparameters] --> B{Training time<br/>per config?}
    
    B -->|<10 seconds| C[Grid Search or Random Search]
    B -->|10s - 1 min| D[Random Search]
    B -->|>1 minute| E[Bayesian Optimization]
    
    C --> F{Search space size?}
    F -->|<100 configs| G[Grid Search ‚úÖ]
    F -->|>100 configs| H[Random Search ‚úÖ]
    
    D --> I{Model iterative?}
    I -->|Yes| J[Add Early Stopping ‚úÖ]
    I -->|No| K[Random Search ‚úÖ]
    
    E --> L{Budget?}
    L -->|<50 evals| M[Bayesian Opt ‚úÖ]
    L -->|>100 evals| N[Random Search ‚úÖ]
    
    J --> O{Dataset large?>10K samples}
    O -->|Yes| P[Add Successive Halving ‚úÖ]
    O -->|No| Q[Early Stop only ‚úÖ]
```

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, GroupKFold, cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error
from scipy.stats import randint, uniform, loguniform
import matplotlib.pyplot as plt
import time

np.random.seed(42)

print("="*80)
print("COMPLETE EXAMPLE: SEMICONDUCTOR YIELD PREDICTION")
print("Comparing 4 Hyperparameter Tuning Strategies")
print("="*80)

# Generate synthetic wafer yield data
def generate_wafer_data(n_wafers=50, devices_per_wafer=100):
    """Generate semiconductor wafer data with spatial correlation."""
    data = []
    
    for wafer_id in range(n_wafers):
        # Wafer-level process variation
        wafer_offset = np.random.normal(0, 5)
        
        for device_id in range(devices_per_wafer):
            # Device features (parametric measurements)
            vdd = np.random.normal(1.0, 0.05)
            idd = np.random.normal(100, 10)
            frequency = np.random.uniform(1.5, 3.5)
            power = idd * vdd
            temperature = np.random.normal(25, 3)
            
            # Additional features
            leakage = np.random.exponential(5)
            threshold_voltage = np.random.normal(0.4, 0.05)
            resistance = np.random.normal(100, 15)
            capacitance = np.random.normal(50, 8)
            noise_margin = np.random.uniform(0.2, 0.5)
            
            # Derived features
            power_efficiency = power / frequency
            thermal_stress = temperature * power
            electrical_balance = vdd / threshold_voltage
            
            # Target: Yield score with wafer-level correlation
            yield_score = (
                70 +  # Baseline
                10 * (1.0 - vdd) +  # Lower voltage better
                0.05 * (100 - idd) +  # Lower current better
                3 * frequency +  # Higher frequency better
                -0.2 * power +  # Lower power better
                -0.3 * (temperature - 25) +  # Closer to 25¬∞C better
                wafer_offset +  # Wafer-level effect
                np.random.normal(0, 2)  # Device noise
            )
            
            yield_score = np.clip(yield_score, 0, 100)
            
            data.append({
                'wafer_id': wafer_id,
                'vdd': vdd,
                'idd': idd,
                'frequency': frequency,
                'power': power,
                'temperature': temperature,
                'leakage': leakage,
                'threshold_voltage': threshold_voltage,
                'resistance': resistance,
                'capacitance': capacitance,
                'noise_margin': noise_margin,
                'power_efficiency': power_efficiency,
                'thermal_stress': thermal_stress,
                'electrical_balance': electrical_balance,
                'yield_score': yield_score
            })
    
    return pd.DataFrame(data)

# Generate data
print("\n[1] Generating Data...")
df = generate_wafer_data(n_wafers=50, devices_per_wafer=100)

print(f"‚úÖ Generated {len(df)} devices from {df['wafer_id'].nunique()} wafers")
print(f"   Features: {df.shape[1] - 2} parametric measurements")
print(f"   Target: yield_score (mean={df['yield_score'].mean():.1f}%, std={df['yield_score'].std():.1f}%)")

# Prepare data
feature_cols = [c for c in df.columns if c not in ['wafer_id', 'yield_score']]
X = df[feature_cols].values
y = df['yield_score'].values
groups = df['wafer_id'].values

print(f"\n[2] Data prepared:")
print(f"   X shape: {X.shape}")
print(f"   y shape: {y.shape}")
print(f"   groups (wafers): {len(np.unique(groups))}")

# Define parameter grids/distributions
param_grid_small = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [5, 10]
}  # 3*2*3*2 = 36 configs

param_dist = {
    'n_estimators': randint(100, 400),
    'learning_rate': loguniform(0.01, 0.2),
    'max_depth': randint(3, 10),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'subsample': uniform(0.7, 0.3)
}  # Continuous distributions

# Strategy 1: Grid Search
print("\n" + "="*80)
print("[STRATEGY 1] Grid Search (Baseline)")
print("="*80)

start_time = time.time()

grid_search = GridSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_grid_small,
    cv=GroupKFold(n_splits=5),  # Group by wafer
    scoring='r2',
    n_jobs=-1,
    verbose=0
)

grid_search.fit(X, y, groups=groups)

grid_time = time.time() - start_time

print(f"‚è±Ô∏è  Time: {grid_time:.2f} seconds")
print(f"üîç Configurations evaluated: {len(grid_search.cv_results_['params'])}")
print(f"üèÜ Best R¬≤: {grid_search.best_score_:.6f}")
print(f"üìä Best params: {grid_search.best_params_}")

# Strategy 2: Random Search
print("\n" + "="*80)
print("[STRATEGY 2] Random Search")
print("="*80)

start_time = time.time()

random_search = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_dist,
    n_iter=40,  # Same total evals as grid (36 vs 40)
    cv=GroupKFold(n_splits=5),
    scoring='r2',
    n_jobs=-1,
    verbose=0,
    random_state=42
)

random_search.fit(X, y, groups=groups)

random_time = time.time() - start_time

print(f"‚è±Ô∏è  Time: {random_time:.2f} seconds")
print(f"üîç Configurations evaluated: {random_search.n_iter}")
print(f"üèÜ Best R¬≤: {random_search.best_score_:.6f}")
print(f"üìä Best params: {random_search.best_params_}")

# Strategy 3: Random Search with Early Stopping
print("\n" + "="*80)
print("[STRATEGY 3] Random Search + Early Stopping")
print("="*80)

# Modify param dist to include early stopping
param_dist_es = param_dist.copy()
param_dist_es['n_estimators'] = [500]  # Large max
param_dist_es['validation_fraction'] = [0.2]
param_dist_es['n_iter_no_change'] = [15]

start_time = time.time()

random_search_es = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_dist_es,
    n_iter=30,  # Fewer configs since early stopping saves time
    cv=GroupKFold(n_splits=5),
    scoring='r2',
    n_jobs=-1,
    verbose=0,
    random_state=42
)

random_search_es.fit(X, y, groups=groups)

random_es_time = time.time() - start_time

print(f"‚è±Ô∏è  Time: {random_es_time:.2f} seconds")
print(f"üîç Configurations evaluated: {random_search_es.n_iter}")
print(f"üèÜ Best R¬≤: {random_search_es.best_score_:.6f}")
print(f"üìä Best params: {random_search_es.best_params_}")

# Strategy 4: Manual Bayesian-inspired (best region refinement)
print("\n" + "="*80)
print("[STRATEGY 4] Refined Search (Best Region from Random)")
print("="*80)

# Extract best region from random search
best_params = random_search.best_params_
refined_grid = {
    'n_estimators': [
        int(best_params['n_estimators'] * 0.8),
        best_params['n_estimators'],
        int(best_params['n_estimators'] * 1.2)
    ],
    'learning_rate': [
        best_params['learning_rate'] * 0.8,
        best_params['learning_rate'],
        best_params['learning_rate'] * 1.2
    ],
    'max_depth': [
        max(2, best_params['max_depth'] - 1),
        best_params['max_depth'],
        best_params['max_depth'] + 1
    ]
}  # 3*3*3 = 27 configs

start_time = time.time()

refined_search = GridSearchCV(
    GradientBoostingRegressor(
        min_samples_split=best_params['min_samples_split'],
        min_samples_leaf=best_params['min_samples_leaf'],
        subsample=best_params['subsample'],
        random_state=42
    ),
    refined_grid,
    cv=GroupKFold(n_splits=5),
    scoring='r2',
    n_jobs=-1,
    verbose=0
)

refined_search.fit(X, y, groups=groups)

refined_time = time.time() - start_time

print(f"‚è±Ô∏è  Time: {refined_time:.2f} seconds")
print(f"üîç Configurations evaluated: {len(refined_search.cv_results_['params'])}")
print(f"üèÜ Best R¬≤: {refined_search.best_score_:.6f}")
print(f"üìä Best params: {refined_search.best_params_}")

# Final Comparison
print("\n" + "="*80)
print("FINAL COMPARISON")
print("="*80)

results_summary = pd.DataFrame({
    'Strategy': ['Grid Search', 'Random Search', 'Random + Early Stop', 'Refined Search'],
    'Time (s)': [grid_time, random_time, random_es_time, refined_time],
    'Configs': [36, 40, 30, 27],
    'Best R¬≤': [
        grid_search.best_score_,
        random_search.best_score_,
        random_search_es.best_score_,
        refined_search.best_score_
    ]
})

print("\n" + results_summary.to_string(index=False))

# Find best overall
best_idx = results_summary['Best R¬≤'].idxmax()
best_strategy = results_summary.loc[best_idx, 'Strategy']
best_score = results_summary.loc[best_idx, 'Best R¬≤']
best_time = results_summary.loc[best_idx, 'Time (s)']

print(f"\nüèÜ WINNER: {best_strategy}")
print(f"   R¬≤ = {best_score:.6f}")
print(f"   Time = {best_time:.2f}s")

# Calculate efficiency metrics
results_summary['Time/Config (s)'] = results_summary['Time (s)'] / results_summary['Configs']
results_summary['R¬≤/Time'] = results_summary['Best R¬≤'] / results_summary['Time (s)']

print(f"\n‚ö° Efficiency Analysis:")
print(results_summary[['Strategy', 'Time/Config (s)', 'R¬≤/Time']].to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Performance vs Time
strategies = results_summary['Strategy']
colors = ['blue', 'green', 'orange', 'red']

axes[0].scatter(results_summary['Time (s)'], results_summary['Best R¬≤'], 
               s=200, c=colors, alpha=0.7, edgecolors='black', linewidth=2)

for i, strategy in enumerate(strategies):
    axes[0].annotate(strategy, 
                    (results_summary.loc[i, 'Time (s)'], results_summary.loc[i, 'Best R¬≤']),
                    xytext=(10, 10), textcoords='offset points', fontsize=9)

axes[0].set_xlabel('Time (seconds)', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Best R¬≤ Score', fontsize=11, fontweight='bold')
axes[0].set_title('Performance vs Computation Time\\n(Upper-left is best: high score, low time)', 
                 fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)

# Plot 2: Configurations vs Score
axes[1].bar(strategies, results_summary['Best R¬≤'], color=colors, alpha=0.7, edgecolor='black')
axes[1].set_ylabel('Best R¬≤ Score', fontsize=11, fontweight='bold')
axes[1].set_title('Best Score by Strategy', fontsize=12, fontweight='bold')
axes[1].set_xticklabels(strategies, rotation=45, ha='right')
axes[1].grid(alpha=0.3, axis='y')

# Add value labels
for i, score in enumerate(results_summary['Best R¬≤']):
    axes[1].text(i, score + 0.002, f'{score:.4f}', 
                ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n‚úÖ Complete example finished!")

## üí° Real-World Project Ideas

Apply hyperparameter tuning to solve real business problems. These project templates balance post-silicon validation (semiconductor industry) with general AI/ML applications.

---

### üîå POST-SILICON VALIDATION PROJECTS

#### **Project 1: Wafer-Level Yield Predictor with Spatial Features**

**Objective:** Predict device yield from parametric test data and spatial coordinates (die_x, die_y) to identify problematic wafer regions.

**Why Hyperparameter Tuning Matters:**
- Yield prediction requires balancing spatial patterns vs device-level noise
- Overfitting to training wafers ‚Üí poor generalization to production
- Critical hyperparameters: tree depth (spatial granularity), learning rate (convergence)

**Recommended Approach:**
- **Model:** Gradient Boosting or Random Forest (handles non-linear spatial patterns)
- **Key Hyperparameters:**
  - `max_depth`: Controls spatial resolution (3-7 for wafer zones, 8-12 for die-level)
  - `n_estimators`: Prevents underfitting (200-500 for 5000+ devices)
  - `min_samples_leaf`: Smooths spatial noise (10-50 devices minimum)
  - `learning_rate`: Balances convergence vs overfitting (0.01-0.1)
- **Tuning Strategy:** Random Search (6+ hyperparameters) ‚Üí Bayesian Opt refinement
- **Validation:** GroupKFold by wafer_id (prevents spatial leakage)

**Success Metrics:**
- **R¬≤ > 0.85**: Accurate enough for manufacturing decisions
- **MAE < 3%**: Absolute yield prediction error within tolerance
- **Generalization test**: Train on first 80% of wafers, test on last 20% (temporal split)

**Business Value:** $500K-$2M annual savings by optimizing test flows based on spatial yield patterns.

---

#### **Project 2: Test Time Optimization with Multi-Objective Tuning**

**Objective:** Minimize test time while maintaining defect detection accuracy (minimize false negatives).

**Why Hyperparameter Tuning Matters:**
- Trade-off: faster test time vs detection rate
- Cost of missed defects >> cost of extra test time
- Need to optimize for **both** speed and accuracy simultaneously

**Recommended Approach:**
- **Model:** XGBoost Classifier (predicts device pass/fail)
- **Key Hyperparameters:**
  - `n_estimators`: More trees = better accuracy but slower inference
  - `max_depth`: Deeper trees = better detection but slower training
  - `scale_pos_weight`: Handles imbalanced data (defects ~1-5%)
  - `subsample`, `colsample_bytree`: Speed up training without sacrificing accuracy
- **Tuning Strategy:** 
  - **Primary:** Bayesian Opt with custom objective: `0.95*F1 - 0.05*inference_time`
  - **Fallback:** Random Search with early stopping (time constraint <1 hour)
- **Validation:** Stratified K-Fold (preserve defect rate)

**Success Metrics:**
- **F1-score > 0.90**: Balances precision and recall for defect detection
- **Inference time < 10ms per device**: Real-time test compatibility
- **False negative rate < 2%**: Critical defects must not escape

**Business Value:** 30-50% test time reduction √ó 10M devices/year = $2-5M annual savings.

---

#### **Project 3: Adaptive Binning with Cost-Sensitive Hyperparameters**

**Objective:** Classify devices into performance bins (Premium, Standard, Discount) to maximize revenue.

**Why Hyperparameter Tuning Matters:**
- Misclassification costs vary by bin: Premium ‚Üí Discount = -$15/device loss
- Need to optimize for **revenue**, not just accuracy
- Different bins have different feature importance (frequency for Premium, power for Discount)

**Recommended Approach:**
- **Model:** Multi-class Random Forest or Gradient Boosting
- **Key Hyperparameters:**
  - `class_weight`: Custom weights based on revenue impact (Premium=3, Standard=1, Discount=1)
  - `max_features`: Controls feature diversity per bin (0.3-0.8 range)
  - `min_samples_split`: Prevents overfitting to noisy bin boundaries
- **Tuning Strategy:**
  - **Primary:** Grid Search with custom scorer: `revenue_per_device`
  - **Metric:** `sum(bin_value * correct_classifications) - sum(misclassification_cost)`
- **Validation:** Stratified K-Fold (preserve bin distribution)

**Success Metrics:**
- **Revenue/device > $45**: Baseline is $40 with rule-based binning
- **Premium bin precision > 95%**: Avoid sending Discount devices to Premium market
- **Overall accuracy > 80%**: Maintain reasonable bin purity

**Business Value:** $5/device improvement √ó 10M devices/year = $50M annual revenue increase.

---

#### **Project 4: Outlier Detection with Isolation Forest Tuning**

**Objective:** Detect anomalous parametric test results indicating equipment malfunction or process drift.

**Why Hyperparameter Tuning Matters:**
- Too sensitive ‚Üí false alarms (production line stops unnecessarily)
- Too lenient ‚Üí miss equipment failures (yield loss)
- Contamination rate unknown (need to tune threshold)

**Recommended Approach:**
- **Model:** Isolation Forest (unsupervised anomaly detection)
- **Key Hyperparameters:**
  - `contamination`: Expected anomaly rate (0.01-0.1, tune with validation set)
  - `n_estimators`: More trees = stable anomaly scores (100-500)
  - `max_samples`: Subsample size affects sensitivity (256-1024 typical)
  - `max_features`: Feature subset per tree (1.0 for all features, 0.5 for diversity)
- **Tuning Strategy:**
  - **Primary:** Grid Search with labeled anomaly validation set (if available)
  - **Fallback:** Random Search + visual inspection of anomaly scores
- **Validation:** Time-series split (train on Week 1-8, validate on Week 9-10)

**Success Metrics:**
- **Precision > 70%**: 7/10 alarms are true equipment issues
- **Recall > 85%**: Catch 85% of actual equipment failures
- **Alarm rate < 5 per day**: Manageable for engineering team

**Business Value:** Prevent $100K-$500K yield loss per missed equipment failure (10-20 failures/year).

---

### üåç GENERAL AI/ML PROJECTS

#### **Project 5: Customer Churn Prediction with Imbalanced Data**

**Objective:** Predict which customers will cancel subscription next month to enable proactive retention.

**Why Hyperparameter Tuning Matters:**
- Churn rate typically 2-10% (highly imbalanced)
- Cost of retention ($50 incentive) << cost of losing customer ($500 LTV)
- Need to balance precision (avoid wasting incentives) vs recall (catch churners)

**Recommended Approach:**
- **Model:** XGBoost or Random Forest Classifier
- **Key Hyperparameters:**
  - `scale_pos_weight`: Handle imbalance (set to `(1-churn_rate)/churn_rate`)
  - `max_depth`: Prevent overfitting to rare churn patterns (3-7)
  - `learning_rate`: Slow learning for stable churn signals (0.01-0.1)
  - `subsample`: Reduces overfitting (0.6-0.9)
- **Tuning Strategy:** Random Search + early stopping (prioritize F2-score: recall > precision)
- **Validation:** Time-series split (train on Month 1-6, validate on Month 7-8)

**Success Metrics:**
- **F2-score > 0.65**: Emphasizes recall (catch churners)
- **Precision > 40%**: Avoid excessive false alarms
- **Retention ROI > 3:1**: $3 saved per $1 spent on incentives

**Business Value:** Retain 500 customers/month √ó $500 LTV = $250K monthly revenue saved.

---

#### **Project 6: Stock Price Movement Prediction (Multi-Step Time Series)**

**Objective:** Predict next-day stock movement direction (Up/Down) using technical indicators.

**Why Hyperparameter Tuning Matters:**
- Financial data is noisy (signal-to-noise ratio ~0.1)
- Overfitting to historical patterns ‚Üí poor out-of-sample performance
- Need robust features and conservative hyperparameters

**Recommended Approach:**
- **Model:** Gradient Boosting or LSTM (if using deep learning)
- **Key Hyperparameters (Gradient Boosting):**
  - `max_depth`: Shallow trees prevent overfitting (2-4)
  - `learning_rate`: Very slow learning (0.001-0.05)
  - `n_estimators`: Many weak learners (500-2000)
  - `subsample`: Strong regularization (0.5-0.7)
- **Tuning Strategy:**
  - **Primary:** Successive Halving on 5+ years of data (fast initial screening)
  - **Refinement:** Bayesian Opt with walk-forward validation
- **Validation:** Walk-forward (train on Year 1, test on Month 13, retrain, repeat)

**Success Metrics:**
- **Accuracy > 52%**: Profitable after transaction costs
- **Sharpe ratio > 1.0**: Risk-adjusted return benchmark
- **Max drawdown < 20%**: Risk management constraint

**Business Value:** 52% accuracy on $1M portfolio ‚Üí $20K-$50K annual alpha (conservative estimate).

---

#### **Project 7: Medical Diagnosis Assistant (Multi-Label Classification)**

**Objective:** Predict multiple diseases simultaneously from patient symptoms and lab results.

**Why Hyperparameter Tuning Matters:**
- Multi-label complexity: patient can have 0-5+ conditions
- Class imbalance: rare diseases have <1% prevalence
- Life-critical: false negatives (missed diagnosis) are extremely costly

**Recommended Approach:**
- **Model:** Multi-label Random Forest or OneVsRest XGBoost
- **Key Hyperparameters:**
  - `class_weight`: Custom per disease (rare diseases get higher weight)
  - `n_estimators`: High for stable predictions (300-1000)
  - `max_depth`: Moderate to capture disease interactions (5-10)
  - `min_samples_leaf`: Higher for rare diseases (20-100 to prevent overfitting)
- **Tuning Strategy:**
  - **Primary:** Random Search with custom multi-label F1 scorer
  - **Per-disease tuning:** Separate threshold optimization for each condition
- **Validation:** Stratified K-Fold for each disease independently

**Success Metrics:**
- **Macro F1 > 0.70**: Average across all diseases
- **Recall > 90% for critical diseases**: Cancer, heart disease, stroke
- **Precision > 60% overall**: Avoid excessive false alarms

**Business Value:** Assist 10,000 diagnoses/year, catch 50 missed conditions ‚Üí save $2M in malpractice + improved outcomes.

---

#### **Project 8: Real-Time Fraud Detection with Streaming Data**

**Objective:** Detect fraudulent transactions in real-time (latency <100ms) with concept drift handling.

**Why Hyperparameter Tuning Matters:**
- Fraud patterns evolve weekly (concept drift)
- False positives frustrate customers (card declined)
- Latency constraint requires lean models
- Need to retrain frequently with new fraud patterns

**Recommended Approach:**
- **Model:** Random Forest or LightGBM (fast inference)
- **Key Hyperparameters:**
  - `n_estimators`: Balance accuracy vs latency (50-200)
  - `max_depth`: Shallow for speed (3-6)
  - `min_samples_leaf`: High for stability (50-200)
  - `class_weight`: Handle imbalance (fraud rate ~0.1-1%)
- **Tuning Strategy:**
  - **Development:** Bayesian Opt on 6-month historical data
  - **Production:** Weekly retraining with Random Search (time budget 2 hours)
  - **Monitoring:** Automated hyperparameter adjustment if performance degrades
- **Validation:** Time-series split with sliding window (train on Week 1-8, test on Week 9)

**Success Metrics:**
- **Recall > 85%**: Catch majority of fraud
- **Precision > 40%**: 4/10 alarms are true fraud (acceptable false positive rate)
- **Latency < 100ms**: Real-time approval/decline decision
- **Concept drift detection**: Automatic retraining when F1 drops >5%

**Business Value:** Prevent $5M fraud loss/year, reduce $500K false positive customer friction ‚Üí net $4.5M savings.

---

## üéØ Project Selection Framework

**Choose your project based on:**

| **Criterion**                     | **Best Project Choice**                              |
|-----------------------------------|------------------------------------------------------|
| **Learning spatial patterns**     | Project 1 (Wafer Yield), Project 6 (Stock Price)    |
| **Cost-sensitive optimization**   | Project 3 (Binning), Project 5 (Churn)              |
| **Imbalanced data**               | Project 2 (Test Time), Project 7 (Medical), Project 8 (Fraud) |
| **Multi-objective tuning**        | Project 2 (Time + Accuracy), Project 3 (Revenue)    |
| **Concept drift handling**        | Project 6 (Stock), Project 8 (Fraud)                |
| **Real-time inference**           | Project 2 (Test Time), Project 8 (Fraud)            |
| **Multi-label classification**    | Project 7 (Medical Diagnosis)                       |
| **Unsupervised anomaly detection**| Project 4 (Outlier Detection)                       |

---

**Common Success Patterns:**

1. **Start with Random Search** - Fast baseline for all projects
2. **Use domain-specific validation** - GroupKFold for spatial data, TimeSeriesSplit for temporal
3. **Define custom metrics** - Revenue, cost-sensitive F1, latency-adjusted accuracy
4. **Iterate on tuning strategy** - Begin broad (Random), refine (Bayesian), deploy (fixed hyperparams with monitoring)
5. **Monitor in production** - Track performance drift, retrain with updated hyperparameters monthly/quarterly

All projects benefit from **systematic hyperparameter tuning** - the difference between 80% accuracy (unusable) and 90% accuracy (production-ready) often comes down to optimal hyperparameter selection.

## üéì Key Takeaways & Best Practices

### **Core Principles**

#### **1. Match Tuning Strategy to Problem Characteristics**

**Decision Framework:**

```mermaid
graph TD
    A[Start: Need Hyperparameter Tuning] --> B{Computational Budget?}
    B -->|Limited: <1 hour| C{Search Space Size?}
    B -->|Moderate: 1-8 hours| D{Search Space Size?}
    B -->|Large: >8 hours| E[Bayesian Optimization]
    
    C -->|Small: <50 configs| F[Grid Search]
    C -->|Large: >50 configs| G[Random Search + Early Stop]
    
    D -->|Small: <100 configs| H[Grid Search]
    D -->|Large: >100 configs| I[Random Search ‚Üí Bayesian Opt]
    
    E --> J{Iterative Model?}
    J -->|Yes: GBM/XGBoost/NN| K[Successive Halving]
    J -->|No: RF/SVM| E
    
    F --> L[Coarse ‚Üí Fine Grid]
    G --> M[Monitor Convergence]
    H --> L
    I --> N[2-Stage: Random 50% ‚Üí Bayesian 50%]
    K --> O[Start: Œ∑=3, min_resources=0.1*data]
```

**Summary Table:**

| **Scenario**                          | **Recommended Strategy**              | **Rationale**                                    |
|---------------------------------------|---------------------------------------|--------------------------------------------------|
| Small search space (<50 configs)      | Grid Search                           | Exhaustive = guaranteed best                     |
| Large search space (>100 configs)     | Random Search                         | 10√ó faster than grid, similar performance        |
| Expensive evaluation (>10 min/config) | Bayesian Optimization                 | Smart sampling reduces total evaluations 5-10√ó   |
| Iterative models (GBM, XGBoost, NN)   | Early Stopping or Successive Halving  | Automatic stopping saves 40-70% time             |
| Multi-stage tuning (R&D ‚Üí production) | Random (explore) ‚Üí Bayesian (refine)  | Broad initial search, precision refinement       |
| Time-critical (<1 hour budget)        | Random Search + Early Stop            | Fast convergence, time-bounded                   |

---

#### **2. Validation Strategy is Critical**

**Common Validation Mistakes:**

‚ùå **Wrong:** Use train/test split only ‚Üí Overfits to validation set  
‚úÖ **Right:** Use cross-validation (K-Fold, Stratified, Time Series, Group)

‚ùå **Wrong:** Same CV split for all hyperparameter configs ‚Üí Data leakage  
‚úÖ **Right:** Nested CV (outer loop for tuning, inner loop for evaluation)

‚ùå **Wrong:** Ignore data structure (spatial, temporal, groups) in splits  
‚úÖ **Right:** GroupKFold for spatial data, TimeSeriesSplit for time series

**Validation Guidelines:**

| **Data Characteristic**       | **Recommended CV Strategy**           | **Why**                                          |
|-------------------------------|---------------------------------------|--------------------------------------------------|
| **Independent samples**        | Stratified K-Fold (k=5)               | Preserves class distribution                     |
| **Spatial correlation**        | GroupKFold by spatial unit (wafer_id) | Prevents spatial leakage                         |
| **Time-series data**           | TimeSeriesSplit or walk-forward       | Respects temporal ordering                       |
| **Small dataset (<1000)**      | Leave-One-Out or k=10                 | Maximizes training data per fold                 |
| **Large dataset (>100K)**      | k=3 with stratification               | Faster, still robust                             |
| **Imbalanced classes**         | Stratified K-Fold + class_weight      | Ensures minority class in each fold              |

---

#### **3. Define Realistic Search Spaces**

**Search Space Design Principles:**

‚úÖ **Start with literature values** - Papers, library defaults, domain experts  
‚úÖ **Use log-scale for multiplicative parameters** - `learning_rate`, `regularization`  
‚úÖ **Use linear scale for additive parameters** - `n_estimators`, `max_depth`  
‚úÖ **Include domain constraints** - E.g., `max_depth ‚â§ log2(n_samples)` to prevent overfitting

**Example Search Spaces:**

```python
# ‚ùå BAD: Unrealistic ranges
param_dist = {
    'learning_rate': uniform(0.0001, 1.0),  # Too wide, includes unusable values
    'n_estimators': randint(1, 10000),      # 1 tree is useless, 10K is overkill
    'max_depth': randint(1, 100)            # 100 depth will overfit
}

# ‚úÖ GOOD: Realistic ranges based on domain knowledge
param_dist = {
    'learning_rate': loguniform(0.01, 0.3),  # Log-scale for multiplicative param
    'n_estimators': randint(100, 500),       # Sufficient for most problems
    'max_depth': randint(3, 10),             # Prevents overfitting
    'min_samples_leaf': randint(10, 100),    # Regularization for noisy data
    'subsample': uniform(0.6, 0.4)           # (0.6, 1.0) range for regularization
}
```

**Semiconductor-Specific Constraints:**

- **Spatial models (wafer yield):** `max_depth ‚â§ 7` (prevents overfitting to single dies)
- **Test time prediction:** `n_estimators ‚â§ 300` (inference latency <50ms)
- **Imbalanced defect detection:** `scale_pos_weight = (1 - defect_rate) / defect_rate`

---

#### **4. Monitor Convergence and Stop Early**

**Convergence Criteria:**

```python
def check_convergence(scores, window=20, threshold=0.001):
    """Check if tuning has converged."""
    if len(scores) < window:
        return False
    
    recent_scores = scores[-window:]
    improvement = max(recent_scores) - max(scores[:-window])
    
    return improvement < threshold  # <0.1% improvement ‚Üí converged
```

**When to Stop Tuning:**

| **Signal**                              | **Action**                                      |
|-----------------------------------------|-------------------------------------------------|
| No improvement in last 20 iterations    | Stop Random/Bayesian search                     |
| Validation score plateaus               | Early stopping triggered                        |
| Test set performance degrades           | Overfitting to validation set ‚Üí use nested CV   |
| Time budget exhausted                   | Select best config so far, schedule refinement  |
| Diminishing returns (<0.5% improvement) | Move to next modeling step (feature engineering)|

---

### **Common Pitfalls and Solutions**

#### **Pitfall 1: Overfitting to Validation Set**

**Problem:** Tuning 100+ hyperparameter configs on same validation set ‚Üí model "memorizes" validation data.

**Solution:**
- Use **nested cross-validation**:
  - **Outer loop:** Tuning (select best hyperparameters)
  - **Inner loop:** Evaluation (unbiased performance estimate)
- Hold out **final test set** that is NEVER used during tuning
- **Rule of thumb:** If tuning >50 configs, use nested CV

**Code Pattern:**
```python
# ‚ùå WRONG: Overfits to validation set
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)  # Biased (test set leaked)

# ‚úÖ RIGHT: Nested CV for unbiased estimate
outer_scores = cross_val_score(
    GridSearchCV(model, param_grid, cv=5),  # Inner CV
    X, y, cv=5  # Outer CV
)
print(f"Unbiased R¬≤ estimate: {outer_scores.mean():.4f}")
```

---

#### **Pitfall 2: Ignoring Computational Constraints**

**Problem:** Bayesian Optimization with `n_iter=1000` takes 3 days ‚Üí not feasible for weekly retraining.

**Solution:**
- **Time-box tuning:** Set maximum wall-clock time (e.g., 4 hours)
- **Use Successive Halving:** Eliminates bad configs early (saves 60-80% time)
- **Parallelize:** Use `n_jobs=-1` to leverage all CPU cores
- **Coarse-to-fine:** Stage 1 (broad search, 30 min) ‚Üí Stage 2 (refinement, 2 hours)

**Realistic Time Budgets:**

| **Project Phase**     | **Tuning Budget**  | **Strategy**                          |
|-----------------------|--------------------|---------------------------------------|
| **Research/POC**       | 8-24 hours         | Bayesian Opt (thorough exploration)   |
| **Development**        | 2-4 hours          | Random Search ‚Üí Bayesian refinement   |
| **Production**         | <1 hour            | Grid Search on best region + caching  |
| **Weekly retraining**  | 30 minutes         | Early stopping on cached hyperparams  |

---

#### **Pitfall 3: Not Using Domain Knowledge**

**Problem:** Treating hyperparameter tuning as "black box" ‚Üí misses critical constraints.

**Solution:**
- **Incorporate physics/business constraints:**
  - Semiconductor: Inference latency <100ms ‚Üí `n_estimators ‚â§ 200`
  - Finance: Model must be explainable ‚Üí `max_depth ‚â§ 5` (for regulatory approval)
  - Healthcare: False negatives are 10√ó worse than false positives ‚Üí custom scoring
- **Feature importance as hyperparameter:**
  - High-correlation features ‚Üí lower `max_features` (reduce redundancy)
  - Noisy features ‚Üí higher `min_samples_leaf` (stronger regularization)

**Semiconductor Example:**
```python
# Encode spatial constraint: max_depth should scale with wafer size
n_dies_per_wafer = 100
max_depth_limit = int(np.log2(n_dies_per_wafer))  # 6-7 for 100-die wafer

param_grid = {
    'max_depth': [3, max_depth_limit - 1, max_depth_limit],  # Don't exceed spatial resolution
    'min_samples_leaf': [10, 20, 50]  # Minimum devices per spatial bin
}
```

---

#### **Pitfall 4: Misaligned Metrics**

**Problem:** Optimizing R¬≤ when business cares about "yield improvement >2%" ‚Üí model achieves R¬≤=0.90 but only 1.5% yield improvement.

**Solution:**
- **Use custom scoring functions:**

```python
# ‚ùå DEFAULT: Maximizes R¬≤ (statistical metric)
grid_search = GridSearchCV(model, param_grid, scoring='r2')

# ‚úÖ CUSTOM: Maximizes business value
def yield_improvement_scorer(estimator, X, y):
    y_pred = estimator.predict(X)
    baseline_yield = y.mean()
    predicted_yield = y_pred.mean()
    improvement = predicted_yield - baseline_yield
    return improvement  # Maximize yield improvement

grid_search = GridSearchCV(
    model, param_grid, 
    scoring=make_scorer(yield_improvement_scorer, greater_is_better=True)
)
```

**Common Business Metrics:**

| **Domain**                  | **Metric**                          | **Why**                                      |
|-----------------------------|-------------------------------------|----------------------------------------------|
| **Semiconductor yield**      | Revenue per wafer                   | Optimizes binning revenue, not just accuracy |
| **Fraud detection**          | Cost-weighted F1                    | False negatives cost $500, false positives $5|
| **Customer churn**           | Retention ROI                       | Balance incentive cost vs LTV                |
| **Medical diagnosis**        | Sensitivity (recall) for critical   | Missing cancer diagnosis is unacceptable     |
| **Test time optimization**   | F1 / inference_time                 | Multi-objective: accuracy + speed            |

---

#### **Pitfall 5: Lack of Reproducibility**

**Problem:** "Model had R¬≤=0.92 in development, now it's 0.85 in production" ‚Üí cannot debug.

**Solution:**
- **Set random seeds everywhere:**
```python
np.random.seed(42)
random_search = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_dist,
    random_state=42,  # For CV split reproducibility
    n_jobs=1  # Parallel jobs break reproducibility
)
```
- **Log all hyperparameters and results:**
```python
# Use MLflow, Weights & Biases, or simple CSV logging
results_df = pd.DataFrame(random_search.cv_results_)
results_df.to_csv('tuning_history.csv')
```
- **Version control data and code:**
  - Data: Track SHA256 hash of training set
  - Code: Git commit ID for model training script
  - Model: Save with `joblib.dump(model, f'model_{git_commit}.pkl')`

---

### **Production Deployment Guidelines**

#### **Development ‚Üí Production Pipeline**

```mermaid
graph LR
    A[Development<br/>Bayesian Opt<br/>8 hours] --> B[Validation<br/>Nested CV<br/>Unbiased score]
    B --> C[Staging<br/>Fixed hyperparams<br/>Monitor performance]
    C --> D{Performance OK?}
    D -->|Yes| E[Production<br/>Deploy model]
    D -->|No| F[Retune with<br/>production data]
    F --> B
    E --> G[Monitor Drift]
    G --> H{Performance degrades?>}
    H -->|Yes| I[Scheduled Retraining<br/>Monthly/Quarterly]
    H -->|No| E
    I --> B
```

**Best Practices:**

1. **Cache optimal hyperparameters:** Store in config file (YAML/JSON)
   ```yaml
   model_config:
     n_estimators: 300
     learning_rate: 0.08
     max_depth: 6
     last_tuned: 2024-12-01
     performance_r2: 0.89
   ```

2. **Monitor performance drift:** Track metrics weekly
   - If R¬≤ drops >5% ‚Üí retrigger hyperparameter tuning
   - If inference time increases >20% ‚Üí optimize for speed

3. **A/B test hyperparameter changes:**
   - Deploy new model to 10% of traffic
   - Compare business metrics (yield, revenue, latency)
   - Rollout if improvement >2% and CI doesn't overlap

4. **Automate retraining:**
   - Schedule: Monthly for slow drift, weekly for fast drift (fraud, stock)
   - Budget: 1-2 hours for retraining + tuning
   - Strategy: Random Search on previous best region (warm start)

---

### **Semiconductor-Specific Recommendations**

#### **Wafer Yield Prediction**
- **Validation:** GroupKFold by `wafer_id` (prevent spatial leakage)
- **Hyperparameters:** `max_depth ‚â§ 7`, `min_samples_leaf ‚â• 10`
- **Metric:** R¬≤ with spatial correlation penalty
- **Retraining:** After every 1000 wafers or quarterly

#### **Test Time Optimization**
- **Validation:** TimeSeriesSplit (respect temporal ordering)
- **Hyperparameters:** `n_estimators` √ó inference time <100ms
- **Metric:** F1-score / inference_time_ms
- **Retraining:** After equipment changes or new test insertion

#### **Defect Detection (Imbalanced)**
- **Validation:** Stratified K-Fold with `scale_pos_weight`
- **Hyperparameters:** Tune `scale_pos_weight = (1 - defect_rate) / defect_rate`
- **Metric:** F2-score (emphasize recall > precision)
- **Retraining:** Weekly (defect patterns evolve)

---

### **Resources for Further Learning**

**Papers:**
- Bergstra & Bengio (2012): "Random Search for Hyper-Parameter Optimization" - JMLR
- Snoek et al. (2012): "Practical Bayesian Optimization of Machine Learning Algorithms" - NeurIPS

**Libraries:**
- **scikit-learn:** `GridSearchCV`, `RandomizedSearchCV`, `HalvingRandomSearchCV`
- **Optuna:** Bayesian optimization with pruning (successor to Hyperopt)
- **Ray Tune:** Distributed hyperparameter tuning at scale
- **Weights & Biases:** Experiment tracking and hyperparameter sweeps

**Books:**
- "Hyperparameter Tuning in Practice" (O'Reilly, 2023)
- "Automated Machine Learning" (Springer, 2019) - Chapter on hyperparameter optimization

---

## ‚úÖ Summary

**What We Learned:**

1. **Grid Search:** Exhaustive but expensive - use for small search spaces (<50 configs)
2. **Random Search:** 10√ó faster than grid, often achieves similar performance
3. **Bayesian Optimization:** Smart sampling reduces evaluations 5-10√ó, best for expensive functions
4. **Early Stopping:** Automatic stopping for iterative models (GBM, XGBoost, NN) saves 40-70% time
5. **Successive Halving:** Tournament-style elimination achieves 5-15√ó speedup for large search spaces

**When to Use Each:**

| **Strategy**              | **Best For**                               | **Speedup**       |
|---------------------------|--------------------------------------------|-------------------|
| Grid Search               | Small search space, need exhaustive search | Baseline          |
| Random Search             | Large search space, quick baseline         | 10√ó               |
| Bayesian Optimization     | Expensive evaluations (>10 min/config)     | 5-10√ó             |
| Early Stopping            | Iterative models (GBM, XGBoost, NN)        | 2-5√ó              |
| Successive Halving        | Very large search space (>100 configs)     | 5-15√ó             |
| **Combined (Random+Bayes)**| Production systems (best performance)      | 10-20√ó (2-stage)  |

**Key Principle:** *Match tuning strategy to problem characteristics (search space size, computational budget, data structure).*

---

**Next Steps:**

- **Practice:** Apply to your domain (semiconductor, finance, healthcare, etc.)
- **Experiment:** Try different strategies on same problem, compare results
- **Monitor:** Track performance in production, retune when metrics degrade
- **Automate:** Build pipelines for scheduled retraining with hyperparameter tuning

**Happy Tuning! üöÄ**