## üî¢ K-Fold Cross-Validation: The Foundation

**K-Fold CV** is the most common cross-validation strategy. It works by:

1. **Split data into K equal-sized folds** (typically K=5 or K=10)
2. **For each fold k = 1 to K:**
   - Use fold k as **validation set**
   - Use remaining K-1 folds as **training set**
   - Train model and compute metric on validation fold
3. **Average metrics across all K folds** ‚Üí robust estimate
4. **Report mean ¬± standard deviation**

### Mathematical Formulation

Let $D$ be the full dataset with $n$ samples. Split into K folds: $D_1, D_2, ..., D_K$ (each has $\approx n/K$ samples).

For each fold $k$:
- **Training set**: $D_{train}^{(k)} = D \setminus D_k$ (all folds except k)
- **Validation set**: $D_{val}^{(k)} = D_k$
- Train model $f_k$ on $D_{train}^{(k)}$
- Compute metric: $M_k = \text{metric}(f_k, D_{val}^{(k)})$

**Final estimate**:
$$
M_{CV} = \frac{1}{K} \sum_{k=1}^{K} M_k \quad \text{(mean)}
$$

$$
\sigma_{CV} = \sqrt{\frac{1}{K-1} \sum_{k=1}^{K} (M_k - M_{CV})^2} \quad \text{(std)}
$$

**Report as**: $M_{CV} \pm \sigma_{CV}$ (e.g., "Accuracy = 0.90 ¬± 0.03")

### Key Properties

#### Training Set Size
Each fold trains on $(K-1)/K$ of data:
- K=5: 80% training (4000 samples from 5000)
- K=10: 90% training (4500 samples from 5000)
- K=n (LOO): ~100% training (n-1 samples)

**Trade-off**: Larger K ‚Üí more training data per fold (less bias), but more folds (higher variance, more computation)

#### Computational Cost
K-Fold requires training **K models**:
- K=5: 5√ó cost of single train/test
- K=10: 10√ó cost
- K=n (LOO): n√ó cost (prohibitive for large datasets)

**Typical choice**: K=5 for faster iteration, K=10 for final evaluation

#### Variance of Estimate
Standard error of the mean:
$$
SE = \frac{\sigma_{CV}}{\sqrt{K}}
$$

- Larger K ‚Üí smaller SE (more precise estimate)
- But folds are not independent (overlap in training sets) ‚Üí SE underestimates true variance

### When to Use K-Fold

‚úÖ **Use K-Fold when:**
- Data is **i.i.d.** (independent and identically distributed)
- Classes are **balanced** (or use Stratified K-Fold)
- No temporal/spatial ordering in data
- Need robust estimate with confidence intervals

‚ùå **Do NOT use K-Fold when:**
- Data has **temporal ordering** (time series) ‚Üí use Time Series Split
- Data has **group structure** (multiple samples from same patient/wafer) ‚Üí use Group K-Fold
- Classes are **extremely imbalanced** ‚Üí use Stratified K-Fold

### Semiconductor Example: Yield Prediction

**Scenario**: Predict device pass/fail from parametric test data (10,000 devices from 50 wafers)

**Single split approach:**
- Random 80/20 split: Accuracy = 92%
- **Problem**: Maybe test set happened to be from "easy" wafers?

**5-Fold CV approach:**
- Fold 1: Accuracy = 91%
- Fold 2: Accuracy = 93%
- Fold 3: Accuracy = 89%
- Fold 4: Accuracy = 92%
- Fold 5: Accuracy = 90%
- **Mean ¬± Std**: 91% ¬± 1.4%

**Interpretation**: 
- Expected accuracy is 91% (more reliable than single 92%)
- Variability is ¬±1.4% (performance is stable)
- 95% confidence interval: 91% ¬± 2√ó1.4% = [88.2%, 93.8%]

### Choosing K

| **K** | **Training Size** | **Bias** | **Variance** | **Computation** | **When to Use** |
|-------|-------------------|----------|--------------|-----------------|----------------|
| **K=3** | 67% | High | Low | 3√ó | Quick experiments, very small datasets |
| **K=5** | 80% | Medium | Medium | 5√ó | **Standard choice**, good bias-variance trade-off |
| **K=10** | 90% | Low | High | 10√ó | **Final evaluation**, low-bias estimate |
| **K=20** | 95% | Very Low | Very High | 20√ó | Large datasets, need low bias |
| **K=n (LOO)** | ~100% | Minimal | Maximum | n√ó | Small datasets (<1000), maximum data usage |

**Rule of thumb**: 
- Small dataset (<1000): K=10 or LOO
- Medium dataset (1K-100K): K=5 or K=10
- Large dataset (>100K): K=3 or K=5 (computational cost)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold, cross_val_score, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from typing import Dict, List, Tuple

# Set random seed
np.random.seed(42)

class KFoldEvaluator:
    """
    Comprehensive K-Fold cross-validation evaluator.
    
    Provides detailed analysis of model performance across folds,
    including mean/std metrics, confidence intervals, and visualizations.
    """
    
    def __init__(self, n_splits: int = 5, random_state: int = 42, shuffle: bool = True):
        """
        Initialize K-Fold cross-validator.
        
        Args:
            n_splits: Number of folds (K)
            random_state: Random seed for reproducibility
            shuffle: Whether to shuffle data before splitting
        """
        self.n_splits = n_splits
        self.random_state = random_state
        self.shuffle = shuffle
        self.kfold = KFold(n_splits=n_splits, random_state=random_state, shuffle=shuffle)
        self.results = {}
    
    def evaluate(self, model, X, y, scoring: str = 'accuracy') -> Dict:
        """
        Perform K-Fold cross-validation.
        
        Args:
            model: Sklearn-compatible model
            X: Features (n_samples, n_features)
            y: Target (n_samples,)
            scoring: Metric to compute ('accuracy', 'f1', 'roc_auc', etc.)
            
        Returns:
            Dictionary with scores and statistics
        """
        # Perform cross-validation
        cv_results = cross_validate(
            model, X, y, 
            cv=self.kfold, 
            scoring=scoring,
            return_train_score=True,
            n_jobs=-1
        )
        
        # Extract scores
        test_scores = cv_results['test_score']
        train_scores = cv_results['train_score']
        fit_times = cv_results['fit_time']
        
        # Compute statistics
        results = {
            'test_scores': test_scores,
            'train_scores': train_scores,
            'fit_times': fit_times,
            'test_mean': np.mean(test_scores),
            'test_std': np.std(test_scores),
            'test_min': np.min(test_scores),
            'test_max': np.max(test_scores),
            'train_mean': np.mean(train_scores),
            'train_std': np.std(train_scores),
            'mean_fit_time': np.mean(fit_times),
            'scoring': scoring
        }
        
        # Compute confidence interval (95%)
        # Using t-distribution for small sample size (K folds)
        from scipy.stats import t
        confidence = 0.95
        dof = len(test_scores) - 1  # Degrees of freedom
        t_critical = t.ppf((1 + confidence) / 2, dof)
        margin_of_error = t_critical * (results['test_std'] / np.sqrt(len(test_scores)))
        
        results['ci_lower'] = results['test_mean'] - margin_of_error
        results['ci_upper'] = results['test_mean'] + margin_of_error
        
        self.results = results
        return results
    
    def print_summary(self) -> None:
        """
        Print formatted summary of cross-validation results.
        """
        if not self.results:
            print("No results available. Run evaluate() first.")
            return
        
        r = self.results
        
        print("="*80)
        print(f"{self.n_splits}-FOLD CROSS-VALIDATION RESULTS")
        print("="*80)
        print(f"Metric: {r['scoring']}")
        print(f"Number of folds: {self.n_splits}")
        print(f"Shuffle: {self.shuffle}")
        
        print("\n" + "-"*80)
        print("TEST SET PERFORMANCE")
        print("-"*80)
        print(f"Mean:      {r['test_mean']:.6f}")
        print(f"Std Dev:   {r['test_std']:.6f}")
        print(f"Min:       {r['test_min']:.6f}")
        print(f"Max:       {r['test_max']:.6f}")
        print(f"Range:     {r['test_max'] - r['test_min']:.6f}")
        print(f"\n95% Confidence Interval: [{r['ci_lower']:.6f}, {r['ci_upper']:.6f}]")
        
        print("\n" + "-"*80)
        print("TRAIN SET PERFORMANCE (checking for overfitting)")
        print("-"*80)
        print(f"Mean:      {r['train_mean']:.6f}")
        print(f"Std Dev:   {r['train_std']:.6f}")
        print(f"\nTrain-Test Gap: {r['train_mean'] - r['test_mean']:.6f}")
        
        if r['train_mean'] - r['test_mean'] > 0.1:
            print("‚ö†Ô∏è  WARNING: Large train-test gap suggests overfitting!")
        elif r['train_mean'] - r['test_mean'] < 0.02:
            print("‚úÖ Good: Small train-test gap (low overfitting)")
        
        print("\n" + "-"*80)
        print("PER-FOLD SCORES")
        print("-"*80)
        for i, (train_score, test_score) in enumerate(zip(r['train_scores'], r['test_scores']), 1):
            print(f"Fold {i}: Train = {train_score:.6f}, Test = {test_score:.6f}, "
                  f"Gap = {train_score - test_score:.6f}")
        
        print("\n" + "-"*80)
        print("COMPUTATIONAL COST")
        print("-"*80)
        print(f"Mean fit time per fold: {r['mean_fit_time']:.4f} seconds")
        print(f"Total CV time: {r['mean_fit_time'] * self.n_splits:.4f} seconds")
        print("="*80)
    
    def plot_fold_comparison(self, figsize: Tuple[int, int] = (12, 5)) -> None:
        """
        Visualize performance across folds.
        
        Args:
            figsize: Figure size (width, height)
        """
        if not self.results:
            print("No results available. Run evaluate() first.")
            return
        
        r = self.results
        folds = np.arange(1, self.n_splits + 1)
        
        fig, axes = plt.subplots(1, 2, figsize=figsize)
        
        # Plot 1: Train vs Test scores per fold
        axes[0].plot(folds, r['train_scores'], 'o-', label='Train Score', 
                    color='green', linewidth=2, markersize=8)
        axes[0].plot(folds, r['test_scores'], 'o-', label='Test Score', 
                    color='blue', linewidth=2, markersize=8)
        axes[0].axhline(r['test_mean'], color='red', linestyle='--', linewidth=2, 
                       label=f'Test Mean = {r["test_mean"]:.4f}')
        axes[0].fill_between(folds, 
                            r['test_mean'] - r['test_std'], 
                            r['test_mean'] + r['test_std'], 
                            alpha=0.2, color='blue', label='¬± 1 Std Dev')
        
        axes[0].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
        axes[0].set_ylabel(f'{r["scoring"].capitalize()}', fontsize=11, fontweight='bold')
        axes[0].set_title('Performance Across Folds', fontsize=12, fontweight='bold')
        axes[0].legend()
        axes[0].grid(alpha=0.3)
        axes[0].set_xticks(folds)
        
        # Plot 2: Box plot of test scores
        bp = axes[1].boxplot([r['test_scores']], labels=['Test Scores'], patch_artist=True)
        bp['boxes'][0].set_facecolor('lightblue')
        bp['medians'][0].set_color('red')
        bp['medians'][0].set_linewidth(2)
        
        # Add individual points
        axes[1].scatter([1] * len(r['test_scores']), r['test_scores'], 
                       alpha=0.6, s=50, color='blue', zorder=3)
        
        # Add mean line
        axes[1].axhline(r['test_mean'], color='green', linestyle='--', linewidth=2, 
                       label=f'Mean = {r["test_mean"]:.4f}')
        
        axes[1].set_ylabel(f'{r["scoring"].capitalize()}', fontsize=11, fontweight='bold')
        axes[1].set_title(f'Distribution of Test Scores\n(Std = {r["test_std"]:.4f})', 
                         fontsize=12, fontweight='bold')
        axes[1].legend()
        axes[1].grid(alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()


# Example usage demonstration
if __name__ == "__main__":
    print("EXAMPLE: K-Fold Cross-Validation on Semiconductor Yield Prediction")
    print("="*80)
    
    # Generate synthetic semiconductor data
    n_devices = 1000
    n_features = 5
    
    # Features: VDD, IDD, Freq, Temp, Radial_distance
    X = np.random.randn(n_devices, n_features)
    X[:, 0] = np.random.normal(1.8, 0.05, n_devices)  # VDD
    X[:, 1] = np.random.normal(50, 5, n_devices)      # IDD
    X[:, 2] = np.random.normal(2000, 100, n_devices)  # Freq
    X[:, 3] = np.random.normal(85, 5, n_devices)      # Temp
    X[:, 4] = np.random.uniform(0, 5, n_devices)      # Radial distance
    
    # Target: fail more likely at edge (high radial distance) and extreme parameters
    fail_prob = 0.05 + 0.1 * (X[:, 4] / 5.0) + 0.1 * (np.abs(X[:, 0] - 1.8) > 0.1)
    y = (np.random.random(n_devices) < fail_prob).astype(int)
    
    print(f"Dataset: {n_devices} devices, {n_features} features")
    print(f"Class distribution: {(1-y.mean())*100:.1f}% pass, {y.mean()*100:.1f}% fail\n")
    
    # Create model
    model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    
    # Perform 5-Fold CV
    evaluator = KFoldEvaluator(n_splits=5, random_state=42)
    results = evaluator.evaluate(model, X, y, scoring='accuracy')
    
    # Print summary
    evaluator.print_summary()
    
    # Visualize
    evaluator.plot_fold_comparison()

## üéØ Stratified K-Fold: Preserving Class Distribution

**Problem with standard K-Fold**: For **imbalanced datasets**, random splits can create folds with very different class distributions:

**Example**: 1000 samples with 10% positive class (100 positives, 900 negatives)
- K=5 random splits: Each fold should have ~20 positives, ~180 negatives
- **But random chance**: Fold 1 might have 10 positives, Fold 2 might have 30 positives!
- This creates **high variance** in metrics across folds

**Stratified K-Fold solution**: Ensures each fold has **approximately the same class distribution** as the full dataset.

### How Stratified K-Fold Works

1. **Separate samples by class**: Positives and negatives
2. **Split each class into K folds independently**
3. **Combine corresponding folds**: Fold 1 = positives_fold1 + negatives_fold1
4. **Result**: Each fold has same positive/negative ratio as full dataset

### Mathematical Guarantee

For binary classification with positive class proportion $p$:

**Standard K-Fold**: Each fold has $\approx p \pm \sqrt{p(1-p)/n_k}$ positives (binomial variance)

**Stratified K-Fold**: Each fold has **exactly** $p$ positives (within 1 sample due to rounding)

**Example**:
- 1000 samples, 10% positive, K=5
- Standard K-Fold: Each fold could have 8-12% positives (variance)
- Stratified K-Fold: Each fold has exactly 10% positives (200 samples, 20 positive)

### When to Use Stratified K-Fold

‚úÖ **Always use for classification problems**, especially when:
- **Imbalanced classes** (minority class < 20%)
- **Small datasets** (where random variation is high)
- **Multi-class problems** (ensures all classes in every fold)
- **Need low-variance estimates** (reduces fold-to-fold variability)

‚ùå **Do NOT use for:**
- **Regression problems** (no classes to stratify by)
- **Time series data** (breaks temporal ordering)

### Semiconductor Example: Defect Detection

**Scenario**: Predict device defects (2% defect rate, 10,000 devices)

**Standard 5-Fold**:
- Fold 1: 15 defects (0.75%) ‚Üê Too few!
- Fold 2: 25 defects (1.25%)
- Fold 3: 50 defects (2.5%) ‚Üê Too many!
- Fold 4: 40 defects (2.0%)
- Fold 5: 70 defects (3.5%) ‚Üê Very unbalanced!
- **Result**: High variance in metrics (Recall varies wildly)

**Stratified 5-Fold**:
- Fold 1: 40 defects (2.0%) ‚úì
- Fold 2: 40 defects (2.0%) ‚úì
- Fold 3: 40 defects (2.0%) ‚úì
- Fold 4: 40 defects (2.0%) ‚úì
- Fold 5: 40 defects (2.0%) ‚úì
- **Result**: Low variance, reliable estimates

### Multi-Class Stratification

Stratified K-Fold also works for **multi-class problems**:

**Example**: Device binning (4 bins: BIN1=30%, BIN2=40%, BIN3=20%, BIN4=10%)

Each fold maintains proportions:
- BIN1: 30% in every fold
- BIN2: 40% in every fold
- BIN3: 20% in every fold
- BIN4: 10% in every fold

This ensures **all bins are present** in every fold (important for rare classes!).

### Variance Reduction

**Empirical observation**: Stratified K-Fold typically reduces variance by **30-50%** compared to standard K-Fold for imbalanced problems.

**Example metrics**:
- Standard K-Fold: F1 = 0.75 ¬± 0.08 (high variance)
- Stratified K-Fold: F1 = 0.76 ¬± 0.04 (low variance, more reliable)

### Edge Case: Extremely Rare Classes

**Problem**: If minority class has fewer samples than K, impossible to put in every fold!

**Example**: 5 defects in 1000 devices, K=10
- Can't put 0.5 defects in each fold!
- Stratified K-Fold will **fail** or distribute unevenly

**Solutions**:
1. Reduce K (use K=5 or K=3 instead)
2. Use Leave-One-Out CV (K=n)
3. Oversample minority class before CV
4. Use Group K-Fold with wafer/lot grouping

### Implementation Note

```python
from sklearn.model_selection import StratifiedKFold

# Automatically maintains class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in skf.split(X, y):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    # y_train and y_val have same class distribution as full y
```

**sklearn default**: `cross_val_score(cv=5)` uses **StratifiedKFold** for classification automatically! üéâ

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from typing import Tuple

np.random.seed(42)

def compare_stratified_vs_regular_kfold(
    X: np.ndarray, 
    y: np.ndarray, 
    n_splits: int = 5,
    figsize: Tuple[int, int] = (14, 10)
) -> None:
    """
    Compare Stratified K-Fold vs Regular K-Fold on imbalanced data.
    
    Demonstrates:
    1. Class distribution across folds
    2. Performance variance
    3. Why stratification matters for imbalanced data
    
    Args:
        X: Features (n_samples, n_features)
        y: Target (n_samples,) - binary classification
        n_splits: Number of folds
        figsize: Figure size
    """
    # Create both splitters
    regular_kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    stratified_kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    # Track class distributions
    regular_train_dist = []
    regular_val_dist = []
    stratified_train_dist = []
    stratified_val_dist = []
    
    # Regular K-Fold analysis
    for train_idx, val_idx in regular_kf.split(X, y):
        regular_train_dist.append(y[train_idx].mean())
        regular_val_dist.append(y[val_idx].mean())
    
    # Stratified K-Fold analysis
    for train_idx, val_idx in stratified_kf.split(X, y):
        stratified_train_dist.append(y[train_idx].mean())
        stratified_val_dist.append(y[val_idx].mean())
    
    # Compute performance metrics
    model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    
    regular_scores = cross_val_score(model, X, y, cv=regular_kf, scoring='f1')
    stratified_scores = cross_val_score(model, X, y, cv=stratified_kf, scoring='f1')
    
    # Print comparison
    print("="*80)
    print("STRATIFIED vs REGULAR K-FOLD COMPARISON")
    print("="*80)
    print(f"Dataset: {len(y)} samples")
    print(f"Overall positive class rate: {y.mean()*100:.2f}%")
    print(f"Number of folds: {n_splits}")
    
    print("\n" + "-"*80)
    print("CLASS DISTRIBUTION ACROSS FOLDS (Validation Sets)")
    print("-"*80)
    print(f"{'Fold':<10} {'Regular K-Fold':<20} {'Stratified K-Fold':<20}")
    print("-"*80)
    for i in range(n_splits):
        print(f"Fold {i+1:<5} {regular_val_dist[i]*100:>6.2f}% positive     "
              f"{stratified_val_dist[i]*100:>6.2f}% positive")
    
    print("\n" + "-"*80)
    print("VARIANCE IN CLASS DISTRIBUTION")
    print("-"*80)
    print(f"Regular K-Fold:    Std = {np.std(regular_val_dist)*100:.4f}%")
    print(f"Stratified K-Fold: Std = {np.std(stratified_val_dist)*100:.4f}%")
    print(f"Variance reduction: {(1 - np.std(stratified_val_dist)/np.std(regular_val_dist))*100:.1f}%")
    
    print("\n" + "-"*80)
    print("PERFORMANCE METRICS (F1 Score)")
    print("-"*80)
    print(f"Regular K-Fold:    {regular_scores.mean():.6f} ¬± {regular_scores.std():.6f}")
    print(f"Stratified K-Fold: {stratified_scores.mean():.6f} ¬± {stratified_scores.std():.6f}")
    print(f"Variance reduction: {(1 - stratified_scores.std()/regular_scores.std())*100:.1f}%")
    print("="*80)
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    folds = np.arange(1, n_splits + 1)
    overall_rate = y.mean()
    
    # Plot 1: Regular K-Fold class distribution
    axes[0, 0].bar(folds, np.array(regular_val_dist)*100, alpha=0.7, 
                   color='orange', edgecolor='black')
    axes[0, 0].axhline(overall_rate*100, color='red', linestyle='--', 
                       linewidth=2, label=f'Overall: {overall_rate*100:.2f}%')
    axes[0, 0].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
    axes[0, 0].set_ylabel('Positive Class %', fontsize=11, fontweight='bold')
    axes[0, 0].set_title(f'Regular K-Fold: Class Distribution\nStd = {np.std(regular_val_dist)*100:.4f}%', 
                         fontsize=12, fontweight='bold')
    axes[0, 0].legend()
    axes[0, 0].grid(alpha=0.3, axis='y')
    axes[0, 0].set_xticks(folds)
    
    # Plot 2: Stratified K-Fold class distribution
    axes[0, 1].bar(folds, np.array(stratified_val_dist)*100, alpha=0.7, 
                   color='green', edgecolor='black')
    axes[0, 1].axhline(overall_rate*100, color='red', linestyle='--', 
                       linewidth=2, label=f'Overall: {overall_rate*100:.2f}%')
    axes[0, 1].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
    axes[0, 1].set_ylabel('Positive Class %', fontsize=11, fontweight='bold')
    axes[0, 1].set_title(f'Stratified K-Fold: Class Distribution\nStd = {np.std(stratified_val_dist)*100:.4f}%', 
                         fontsize=12, fontweight='bold')
    axes[0, 1].legend()
    axes[0, 1].grid(alpha=0.3, axis='y')
    axes[0, 1].set_xticks(folds)
    
    # Plot 3: Performance comparison (F1 scores)
    axes[1, 0].plot(folds, regular_scores, 'o-', color='orange', 
                    linewidth=2, markersize=8, label='Regular K-Fold')
    axes[1, 0].plot(folds, stratified_scores, 'o-', color='green', 
                    linewidth=2, markersize=8, label='Stratified K-Fold')
    axes[1, 0].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
    axes[1, 0].set_ylabel('F1 Score', fontsize=11, fontweight='bold')
    axes[1, 0].set_title('Performance Across Folds', fontsize=12, fontweight='bold')
    axes[1, 0].legend()
    axes[1, 0].grid(alpha=0.3)
    axes[1, 0].set_xticks(folds)
    
    # Plot 4: Box plots of F1 scores
    bp = axes[1, 1].boxplot([regular_scores, stratified_scores], 
                            labels=['Regular K-Fold', 'Stratified K-Fold'],
                            patch_artist=True)
    bp['boxes'][0].set_facecolor('orange')
    bp['boxes'][1].set_facecolor('lightgreen')
    
    # Add scatter points
    for i, scores in enumerate([regular_scores, stratified_scores], 1):
        axes[1, 1].scatter([i] * len(scores), scores, alpha=0.6, s=50, 
                          color='blue', zorder=3)
    
    axes[1, 1].set_ylabel('F1 Score', fontsize=11, fontweight='bold')
    axes[1, 1].set_title('F1 Score Distribution', fontsize=12, fontweight='bold')
    axes[1, 1].grid(alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()


# Example usage
if __name__ == "__main__":
    print("\nDEMONSTRATION: Stratified K-Fold for Imbalanced Semiconductor Defect Detection\n")
    
    # Generate imbalanced dataset (5% defect rate)
    n_devices = 2000
    n_features = 5
    
    X = np.random.randn(n_devices, n_features)
    X[:, 0] = np.random.normal(1.8, 0.05, n_devices)  # VDD
    X[:, 1] = np.random.normal(50, 5, n_devices)      # IDD
    X[:, 2] = np.random.normal(2000, 100, n_devices)  # Freq
    X[:, 3] = np.random.normal(85, 5, n_devices)      # Temp
    X[:, 4] = np.random.uniform(0, 5, n_devices)      # Radial distance
    
    # Create imbalanced target (5% defect rate)
    fail_prob = 0.03 + 0.05 * (X[:, 4] / 5.0)  # Edge effect
    y = (np.random.random(n_devices) < fail_prob).astype(int)
    
    # Run comparison
    compare_stratified_vs_regular_kfold(X, y, n_splits=5)

## ‚è∞ Time Series Cross-Validation: Respecting Temporal Order

**Critical Problem**: Standard K-Fold and Stratified K-Fold **randomly shuffle** data before splitting. This is **catastrophic for time series** because:

1. **Data leakage**: Training on future data to predict past (impossible in production!)
2. **Unrealistic evaluation**: Model sees future patterns it won't have in deployment
3. **Overly optimistic metrics**: Performance much better than real-world

**Example of the problem**:
- You have test data from January-December 2024
- Standard K-Fold might train on June-December to predict January-May
- **In production**: Model deployed in January 2025 has NO access to future months!
- **Result**: Real performance much worse than CV suggests

### Time Series Split: Forward Chaining

**Solution**: Use **expanding window** or **rolling window** approach where validation is always **after** training:

```
Split 1:  [Train: Month 1-3]  ‚Üí  [Val: Month 4]
Split 2:  [Train: Month 1-4]  ‚Üí  [Val: Month 5]
Split 3:  [Train: Month 1-5]  ‚Üí  [Val: Month 6]
Split 4:  [Train: Month 1-6]  ‚Üí  [Val: Month 7]
...
```

**Key property**: Training data is always **before** validation data (no future leakage).

### Mathematical Formulation

Given time-ordered dataset $D = \\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\\}$ where index represents time:

**For k-th split**:
- **Training set**: $D_{train}^{(k)} = \\{(x_i, y_i) : i \\leq t_k\\}$ (all data up to time $t_k$)
- **Validation set**: $D_{val}^{(k)} = \\{(x_i, y_i) : t_k < i \\leq t_{k+1}\\}$ (data from $t_k$ to $t_{k+1}$)

Where $t_1 < t_2 < ... < t_K$ are split time points.

### Two Variants

#### 1. Expanding Window (sklearn default)

Training set **grows** with each split:

```
Split 1:  [‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë]  ‚Üí  Val
Split 2:  [‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë]  ‚Üí  Val
Split 3:  [‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë]  ‚Üí  Val
Split 4:  [‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë]  ‚Üí  Val
Split 5:  [‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë]  ‚Üí  Val
```

**Pros**:
- Uses all historical data (no waste)
- More training data ‚Üí better model
- Reflects production scenario (retrain with all historical data)

**Cons**:
- Later folds have more training data (bias)
- Computational cost increases (larger training sets)

#### 2. Rolling Window (fixed size)

Training set **slides** with fixed size:

```
Split 1:  [‚ñà‚ñà‚ñà‚ñà]‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë  ‚Üí  Val
Split 2:  ‚ñë[‚ñà‚ñà‚ñà‚ñà]‚ñë‚ñë‚ñë‚ñë‚ñë  ‚Üí  Val
Split 3:  ‚ñë‚ñë[‚ñà‚ñà‚ñà‚ñà]‚ñë‚ñë‚ñë‚ñë  ‚Üí  Val
Split 4:  ‚ñë‚ñë‚ñë[‚ñà‚ñà‚ñà‚ñà]‚ñë‚ñë‚ñë  ‚Üí  Val
Split 5:  ‚ñë‚ñë‚ñë‚ñë[‚ñà‚ñà‚ñà‚ñà]‚ñë‚ñë  ‚Üí  Val
```

**Pros**:
- Consistent training set size (fair comparison)
- Focuses on recent data (if older data less relevant)
- Faster training (fixed size)

**Cons**:
- Wastes early data (not used in later folds)
- May not reflect production (usually have all history)

### Post-Silicon Validation Example

**Scenario**: Test time prediction across production batches

- **Data**: 52 weeks of test data (Week 1-52, 2024)
- **Goal**: Deploy model in Week 1, 2025 ‚Üí predict Week 2-52, 2025

**Wrong approach (Standard K-Fold)**:
```python
# WRONG: Randomly splits weeks, trains on future to predict past!
kfold = KFold(n_splits=5, shuffle=True)
```

Result: Train on Week 40-52 to predict Week 1-10 ‚Üí **unrealistic!**

**Correct approach (Time Series Split)**:
```python
# CORRECT: Always train on past to predict future
tscv = TimeSeriesSplit(n_splits=5)
```

Splits:
- Split 1: Train [Week 1-42]  ‚Üí Val [Week 43-44]
- Split 2: Train [Week 1-44]  ‚Üí Val [Week 45-46]
- Split 3: Train [Week 1-46]  ‚Üí Val [Week 47-48]
- Split 4: Train [Week 1-48]  ‚Üí Val [Week 49-50]
- Split 5: Train [Week 1-50]  ‚Üí Val [Week 51-52]

**Interpretation**: Simulates deploying model at Week 42, 44, 46, 48, 50 and evaluating on next 2 weeks.

### When to Use Time Series Split

‚úÖ **Always use for temporal data**:
- **Time series forecasting** (stock prices, demand, sensor data)
- **Sequential test data** (device tests ordered by time/batch)
- **Longitudinal studies** (patient outcomes over time)
- **Manufacturing data** (production runs, process drift)
- **Any data with temporal ordering**

‚ùå **Do NOT use for**:
- **i.i.d. data** (no temporal correlation) ‚Üí use K-Fold
- **Small datasets** (not enough splits for reliable estimate)

### Special Considerations

#### Gap Between Train and Validation

Sometimes you need a **gap** to avoid leakage:

```
Train [Month 1-3]  ‚Üí  [Gap: Month 4]  ‚Üí  Val [Month 5]
```

**Why gap?**
- Avoid autocorrelation (today's value correlated with yesterday)
- Realistic: Model trained on Monday, deployed Wednesday (2-day gap)
- Example: Stock prediction (can't use today to predict tomorrow due to execution delay)

**sklearn doesn't support gaps natively** ‚Üí implement custom splitter.

#### Minimum Training Size

Early splits have **small training sets** ‚Üí poor model quality.

**Solution**: Set minimum training size:
```python
TimeSeriesSplit(n_splits=5, max_train_size=None)  # Expanding window
TimeSeriesSplit(n_splits=5, max_train_size=1000)  # Rolling window (max 1000)
```

Or skip early splits with insufficient data.

#### Choosing Number of Splits

**Trade-off**:
- **More splits** (K=10): Better variance estimate, but each validation set is smaller
- **Fewer splits** (K=3): Larger validation sets, but higher variance estimate

**Rule of thumb**:
- Short time series (<100 points): K=3-5
- Medium (100-1000): K=5-10
- Long (>1000): K=10+

### Comparison: Standard K-Fold vs Time Series Split

| **Aspect** | **Standard K-Fold** | **Time Series Split** |
|-----------|-------------------|---------------------|
| **Data order** | Random shuffle | Temporal order preserved |
| **Train/Val relationship** | Random split | Train always before Val |
| **Training set size** | Constant (~(K-1)/K) | Growing (expanding window) |
| **Realistic for time series?** | ‚ùå No (data leakage) | ‚úÖ Yes (no leakage) |
| **When to use** | i.i.d. data | Temporal data |
| **Metrics** | Overly optimistic | Realistic |

### Production Deployment Pattern

Time Series CV mimics **production retraining schedule**:

```
Week 1-52:  Train model on historical data  ‚Üí  Deploy Week 53
Week 53 ends: Retrain with Week 1-53        ‚Üí  Deploy Week 54
Week 54 ends: Retrain with Week 1-54        ‚Üí  Deploy Week 55
...
```

Each CV fold simulates one retraining cycle. Average metric across folds = expected production performance.

### Semiconductor Example: Process Drift Detection

**Scenario**: Predict device yield across 100 production lots (ordered by time)

**Wrong (Standard K-Fold)**:
- Metric: Accuracy = 95%
- **Problem**: Trained on Lot 80-100 to predict Lot 1-20 (impossible!)

**Correct (Time Series Split, K=5)**:
- Split 1: Train [Lot 1-70]  ‚Üí Val [Lot 71-74]  : Accuracy = 92%
- Split 2: Train [Lot 1-74]  ‚Üí Val [Lot 75-78]  : Accuracy = 90%
- Split 3: Train [Lot 1-78]  ‚Üí Val [Lot 79-82]  : Accuracy = 88%
- Split 4: Train [Lot 1-82]  ‚Üí Val [Lot 83-86]  : Accuracy = 86%
- Split 5: Train [Lot 1-86]  ‚Üí Val [Lot 87-90]  : Accuracy = 85%
- **Mean ¬± Std**: 88.2% ¬± 2.8%

**Key insight**: Performance **degrades over time** (88% ‚Üí 85%) due to process drift!

This tells you:
1. **Realistic performance**: 88.2% (not 95%)
2. **Model staleness**: Accuracy drops 7% over 20 lots
3. **Retraining schedule**: Retrain every 10-15 lots to maintain >90% accuracy

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit, KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from typing import Tuple, List
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

class TimeSeriesValidator:
    """
    Time Series Cross-Validation with visualization and analysis.
    
    Implements expanding window strategy (sklearn TimeSeriesSplit)
    with detailed performance tracking and trend analysis.
    """
    
    def __init__(self, n_splits: int = 5, max_train_size: int = None, test_size: int = None):
        """
        Initialize Time Series cross-validator.
        
        Args:
            n_splits: Number of splits
            max_train_size: Maximum size of training set (None = expanding window)
            test_size: Size of validation set (None = auto)
        """
        self.n_splits = n_splits
        self.max_train_size = max_train_size
        self.test_size = test_size
        self.tscv = TimeSeriesSplit(
            n_splits=n_splits, 
            max_train_size=max_train_size,
            test_size=test_size
        )
        self.results = {}
    
    def evaluate(self, model, X, y, scoring='r2') -> dict:
        """
        Perform time series cross-validation.
        
        Args:
            model: Sklearn-compatible model
            X: Features (n_samples, n_features) - time-ordered
            y: Target (n_samples,) - time-ordered
            scoring: Metric to compute
            
        Returns:
            Dictionary with detailed results
        """
        train_scores = []
        test_scores = []
        train_sizes = []
        test_sizes = []
        split_indices = []
        
        for fold_idx, (train_idx, test_idx) in enumerate(self.tscv.split(X), 1):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            
            # Train model
            model.fit(X_train, y_train)
            
            # Compute scores
            if scoring == 'r2':
                train_score = model.score(X_train, y_train)
                test_score = model.score(X_test, y_test)
            else:
                from sklearn.metrics import get_scorer
                scorer = get_scorer(scoring)
                train_score = scorer(model, X_train, y_train)
                test_score = scorer(model, X_test, y_test)
            
            train_scores.append(train_score)
            test_scores.append(test_score)
            train_sizes.append(len(train_idx))
            test_sizes.append(len(test_idx))
            split_indices.append((train_idx, test_idx))
        
        # Store results
        self.results = {
            'train_scores': np.array(train_scores),
            'test_scores': np.array(test_scores),
            'train_sizes': np.array(train_sizes),
            'test_sizes': np.array(test_sizes),
            'split_indices': split_indices,
            'test_mean': np.mean(test_scores),
            'test_std': np.std(test_scores),
            'test_min': np.min(test_scores),
            'test_max': np.max(test_scores),
            'scoring': scoring,
            'n_splits': self.n_splits
        }
        
        # Detect trend in performance
        from scipy.stats import linregress
        folds = np.arange(1, self.n_splits + 1)
        slope, intercept, r_value, p_value, std_err = linregress(folds, test_scores)
        
        self.results['trend_slope'] = slope
        self.results['trend_pvalue'] = p_value
        self.results['trend_significant'] = p_value < 0.05
        
        return self.results
    
    def print_summary(self) -> None:
        """
        Print formatted summary with trend analysis.
        """
        if not self.results:
            print("No results available. Run evaluate() first.")
            return
        
        r = self.results
        
        print("="*80)
        print(f"TIME SERIES CROSS-VALIDATION ({self.n_splits} SPLITS)")
        print("="*80)
        print(f"Metric: {r['scoring']}")
        print(f"Strategy: {'Expanding window' if self.max_train_size is None else f'Rolling window (max train={self.max_train_size})'}")
        
        print("\\n" + "-"*80)
        print("OVERALL PERFORMANCE")
        print("-"*80)
        print(f"Mean:      {r['test_mean']:.6f}")
        print(f"Std Dev:   {r['test_std']:.6f}")
        print(f"Min:       {r['test_min']:.6f}")
        print(f"Max:       {r['test_max']:.6f}")
        print(f"Range:     {r['test_max'] - r['test_min']:.6f}")
        
        print("\\n" + "-"*80)
        print("PERFORMANCE TREND ANALYSIS")
        print("-"*80)
        print(f"Trend slope: {r['trend_slope']:.6f} per fold")
        print(f"P-value: {r['trend_pvalue']:.6f}")
        
        if r['trend_significant']:
            if r['trend_slope'] < 0:
                print("‚ö†Ô∏è  SIGNIFICANT DOWNWARD TREND: Performance degrading over time!")
                print("    ‚Üí Model staleness detected (consider retraining schedule)")
                print(f"    ‚Üí Expected drop: {abs(r['trend_slope'] * self.n_splits):.4f} over {self.n_splits} folds")
            else:
                print("‚úÖ SIGNIFICANT UPWARD TREND: Performance improving over time")
                print("    ‚Üí More training data helps (expanding window working well)")
        else:
            print("‚úÖ NO SIGNIFICANT TREND: Performance stable over time")
        
        print("\\n" + "-"*80)
        print("PER-FOLD DETAILS")
        print("-"*80)
        print(f"{'Fold':<6} {'Train Size':<12} {'Test Size':<12} {'Train Score':<14} {'Test Score':<14} {'Gap':<10}")
        print("-"*80)
        
        for i in range(self.n_splits):
            gap = r['train_scores'][i] - r['test_scores'][i]
            print(f"{i+1:<6} {r['train_sizes'][i]:<12} {r['test_sizes'][i]:<12} "
                  f"{r['train_scores'][i]:<14.6f} {r['test_scores'][i]:<14.6f} {gap:<10.6f}")
        
        print("="*80)
    
    def plot_time_series_splits(self, figsize: Tuple[int, int] = (14, 10)) -> None:
        """
        Visualize time series splits and performance.
        
        Args:
            figsize: Figure size
        """
        if not self.results:
            print("No results available. Run evaluate() first.")
            return
        
        r = self.results
        fig, axes = plt.subplots(2, 2, figsize=figsize)
        
        # Plot 1: Visual representation of splits
        for i, (train_idx, test_idx) in enumerate(r['split_indices']):
            # Train data
            axes[0, 0].barh(i, len(train_idx), left=train_idx[0], 
                           color='green', alpha=0.6, edgecolor='black')
            # Test data
            axes[0, 0].barh(i, len(test_idx), left=test_idx[0], 
                           color='orange', alpha=0.6, edgecolor='black')
        
        axes[0, 0].set_xlabel('Sample Index (Time ‚Üí)', fontsize=11, fontweight='bold')
        axes[0, 0].set_ylabel('Fold Number', fontsize=11, fontweight='bold')
        axes[0, 0].set_title('Time Series Split Visualization\\n(Green=Train, Orange=Test)', 
                            fontsize=12, fontweight='bold')
        axes[0, 0].set_yticks(range(self.n_splits))
        axes[0, 0].set_yticklabels([f'Fold {i+1}' for i in range(self.n_splits)])
        axes[0, 0].grid(alpha=0.3, axis='x')
        
        # Plot 2: Performance across folds
        folds = np.arange(1, self.n_splits + 1)
        axes[0, 1].plot(folds, r['train_scores'], 'o-', color='green', 
                       linewidth=2, markersize=8, label='Train Score')
        axes[0, 1].plot(folds, r['test_scores'], 'o-', color='blue', 
                       linewidth=2, markersize=8, label='Test Score')
        axes[0, 1].axhline(r['test_mean'], color='red', linestyle='--', 
                          linewidth=2, label=f'Test Mean = {r[\"test_mean\"]:.4f}')
        
        # Add trend line
        from scipy.stats import linregress
        slope, intercept, _, _, _ = linregress(folds, r['test_scores'])
        trend_line = slope * folds + intercept
        axes[0, 1].plot(folds, trend_line, 'r--', linewidth=1.5, alpha=0.7, 
                       label=f'Trend (slope={slope:.4f})')
        
        axes[0, 1].set_xlabel('Fold Number (Time ‚Üí)', fontsize=11, fontweight='bold')
        axes[0, 1].set_ylabel(f'{r[\"scoring\"].capitalize()}', fontsize=11, fontweight='bold')
        axes[0, 1].set_title('Performance Over Time', fontsize=12, fontweight='bold')
        axes[0, 1].legend()
        axes[0, 1].grid(alpha=0.3)
        axes[0, 1].set_xticks(folds)
        
        # Plot 3: Training set size evolution
        axes[1, 0].bar(folds, r['train_sizes'], alpha=0.7, color='green', 
                      edgecolor='black', label='Train Size')
        axes[1, 0].bar(folds, r['test_sizes'], bottom=r['train_sizes'], 
                      alpha=0.7, color='orange', edgecolor='black', label='Test Size')
        
        axes[1, 0].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
        axes[1, 0].set_ylabel('Number of Samples', fontsize=11, fontweight='bold')
        axes[1, 0].set_title('Dataset Split Sizes', fontsize=12, fontweight='bold')
        axes[1, 0].legend()
        axes[1, 0].grid(alpha=0.3, axis='y')
        axes[1, 0].set_xticks(folds)
        
        # Plot 4: Train-test gap
        gap = r['train_scores'] - r['test_scores']
        colors = ['red' if g > 0.1 else 'green' for g in gap]
        axes[1, 1].bar(folds, gap, alpha=0.7, color=colors, edgecolor='black')
        axes[1, 1].axhline(0, color='black', linestyle='-', linewidth=1)
        axes[1, 1].axhline(0.1, color='red', linestyle='--', linewidth=1, 
                          label='Overfitting threshold (0.1)')
        
        axes[1, 1].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
        axes[1, 1].set_ylabel('Train - Test Gap', fontsize=11, fontweight='bold')
        axes[1, 1].set_title('Overfitting Detection\\n(Red = High gap)', 
                            fontsize=12, fontweight='bold')
        axes[1, 1].legend()
        axes[1, 1].grid(alpha=0.3, axis='y')
        axes[1, 1].set_xticks(folds)
        
        plt.tight_layout()
        plt.show()


def compare_kfold_vs_timeseries(X, y, n_splits=5, figsize=(14, 5)):
    """
    Compare Standard K-Fold (WRONG) vs Time Series Split (CORRECT) for temporal data.
    
    Demonstrates why K-Fold fails on time series and Time Series Split succeeds.
    
    Args:
        X: Features (time-ordered)
        y: Target (time-ordered)
        n_splits: Number of folds
        figsize: Figure size
    """
    model = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42)
    
    # Standard K-Fold (WRONG for time series)
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    kfold_scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')
    
    # Time Series Split (CORRECT)
    tscv = TimeSeriesSplit(n_splits=n_splits)
    tscv_scores = cross_val_score(model, X, y, cv=tscv, scoring='r2')
    
    print("="*80)
    print("COMPARISON: K-FOLD vs TIME SERIES SPLIT")
    print("="*80)
    print(f"Dataset size: {len(y)} samples (time-ordered)")
    print(f"Number of splits: {n_splits}")
    
    print("\\n" + "-"*80)
    print("RESULTS")
    print("-"*80)
    print(f"{'Method':<25} {'Mean R¬≤':<15} {'Std R¬≤':<15} {'Assessment':<30}")
    print("-"*80)
    print(f"{'K-Fold (WRONG)':<25} {kfold_scores.mean():<15.6f} {kfold_scores.std():<15.6f} "
          f"{'OVERLY OPTIMISTIC ‚ùå':<30}")
    print(f"{'Time Series Split':<25} {tscv_scores.mean():<15.6f} {tscv_scores.std():<15.6f} "
          f"{'REALISTIC ‚úÖ':<30}")
    print("-"*80)
    print(f"\\nOptimism bias: {(kfold_scores.mean() - tscv_scores.mean()):.6f}")
    print(f"Percentage overestimation: {((kfold_scores.mean() - tscv_scores.mean()) / tscv_scores.mean() * 100):.2f}%")
    print("="*80)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    folds = np.arange(1, n_splits + 1)
    
    # Plot scores
    axes[0].plot(folds, kfold_scores, 'o-', color='red', linewidth=2, 
                markersize=8, label=f'K-Fold (Mean={kfold_scores.mean():.4f})')
    axes[0].plot(folds, tscv_scores, 'o-', color='green', linewidth=2, 
                markersize=8, label=f'Time Series Split (Mean={tscv_scores.mean():.4f})')
    
    axes[0].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
    axes[0].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
    axes[0].set_title('Performance Comparison\\n(K-Fold overestimates!)', 
                     fontsize=12, fontweight='bold')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    axes[0].set_xticks(folds)
    
    # Box plot
    bp = axes[1].boxplot([kfold_scores, tscv_scores], 
                         labels=['K-Fold\\n(WRONG)', 'Time Series\\n(CORRECT)'],
                         patch_artist=True)
    bp['boxes'][0].set_facecolor('lightcoral')
    bp['boxes'][1].set_facecolor('lightgreen')
    
    axes[1].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
    axes[1].set_title('Distribution Comparison', fontsize=12, fontweight='bold')
    axes[1].grid(alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()


# Example usage
if __name__ == "__main__":
    print("\\nEXAMPLE: Time Series Cross-Validation for Semiconductor Test Time Prediction\\n")
    
    # Generate time-ordered test time data (with trend)
    n_weeks = 52
    n_devices_per_week = 50
    n_samples = n_weeks * n_devices_per_week
    
    # Time-varying features (simulating process drift)
    time = np.arange(n_samples)
    complexity = 0.5 + 0.01 * (time / n_samples)  # Increasing complexity over time
    n_test_points = 20 + 10 * np.sin(2 * np.pi * time / (n_devices_per_week * 4))  # Seasonal
    
    X = np.column_stack([
        complexity + 0.1 * np.random.randn(n_samples),
        n_test_points + 5 * np.random.randn(n_samples),
        np.random.normal(2000, 200, n_samples),  # Frequency
    ])
    
    # Test time increases over time (process drift) + noise
    y = (10 + 30 * complexity + 0.5 * n_test_points + 
         5 * np.random.randn(n_samples))
    
    print(f"Dataset: {n_samples} samples ({n_weeks} weeks)")
    print(f"Features: complexity (trending up), n_test_points (seasonal), frequency")
    print(f"Target: test_time_ms (with process drift)\\n")
    
    # Time Series CV
    validator = TimeSeriesValidator(n_splits=5)
    results = validator.evaluate(LinearRegression(), X, y, scoring='r2')
    validator.print_summary()
    validator.plot_time_series_splits()
    
    # Comparison
    print("\\n\\n")
    compare_kfold_vs_timeseries(X, y, n_splits=5)"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}

## üîÅ Nested Cross-Validation: Unbiased Hyperparameter Tuning

**Critical Problem**: When you tune hyperparameters using CV and report those CV scores, you get **optimistically biased estimates**.

### The Hyperparameter Tuning Bias

**Naive approach** (WRONG):
```python
# WRONG: Optimistic bias!
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)
print(f"Best score: {grid_search.best_score_}")  # ‚Üê BIASED (too optimistic)
```

**Problem**: `best_score_` was obtained by selecting the best hyperparameters from 5 folds. This is **data snooping** - you peeked at validation performance to choose parameters!

**Result**: Reported score is higher than true generalization performance.

### Nested Cross-Validation Solution

**Idea**: Use **two nested CV loops**:
- **Outer loop**: Estimates true performance (unbiased)
- **Inner loop**: Tunes hyperparameters (on training set only)

```
Outer Fold 1:
    Inner CV on Train ‚Üí Find best params ‚Üí Evaluate on Outer Val ‚Üí Score 1
Outer Fold 2:
    Inner CV on Train ‚Üí Find best params ‚Üí Evaluate on Outer Val ‚Üí Score 2
...
Outer Fold K:
    Inner CV on Train ‚Üí Find best params ‚Üí Evaluate on Outer Val ‚Üí Score K

Average scores ‚Üí Unbiased performance estimate
```

### Mathematical Formulation

Given dataset $D$, split into K outer folds: $D = D_1 \\cup D_2 \\cup ... \\cup D_K$

**For each outer fold k**:
1. **Outer training set**: $D_{train}^{outer} = D \\setminus D_k$
2. **Inner CV** on $D_{train}^{outer}$:
   - Split into J inner folds
   - For each hyperparameter config $\\theta$:
     - Compute inner CV score: $S_{inner}(\\theta)$
   - Select best: $\\theta_k^* = \\arg\\max_\\theta S_{inner}(\\theta)$
3. **Train final model** on $D_{train}^{outer}$ with $\\theta_k^*$
4. **Evaluate** on $D_k$ ‚Üí get $S_k^{outer}$

**Final unbiased estimate**:
$$
S_{nested} = \\frac{1}{K} \\sum_{k=1}^{K} S_k^{outer}
$$

This is **unbiased** because outer validation sets were never used for hyperparameter selection.

### Why Nested CV is Necessary

**Experiment**: Compare naive CV vs nested CV

**Naive CV** (5-fold with hyperparameter tuning):
- Reports: Accuracy = 92%
- **But this used all data for hyperparameter selection!**

**Nested CV** (5√ó3: 5 outer, 3 inner):
- Reports: Accuracy = 88%
- **This is the true expected performance on new data**

**Bias**: 92% - 88% = 4% optimistic bias!

The more hyperparameters you tune, the larger the bias.

### Computational Cost

**Nested CV is expensive**:
- Outer folds: K
- Inner folds: J
- Total models trained: K √ó J √ó (number of hyperparameter configs)

**Example**:
- K=5, J=3, 100 hyperparameter configs
- Total: 5 √ó 3 √ó 100 = **1,500 model trainings**!

**Mitigation strategies**:
1. Use fewer outer folds (K=3) for final estimate
2. Use fewer inner folds (J=3) for hyperparameter tuning
3. Use RandomizedSearchCV instead of GridSearchCV (fewer configs)
4. Cache models if possible

### When to Use Nested CV

‚úÖ **Use nested CV when**:
- Publishing research (need unbiased estimates)
- Comparing multiple models fairly
- Reporting final performance to stakeholders
- Hyperparameter tuning is part of workflow

‚ùå **Skip nested CV when**:
- Just exploring/prototyping (too slow)
- Hyperparameters are fixed (no tuning)
- Dataset is huge (computational cost prohibitive)

### Nested CV vs Hold-Out Test Set

**Alternative approach**: Use 3-way split

```
Data ‚Üí [Train: 60%] [Validation: 20%] [Test: 20%]

1. Tune hyperparameters on Train+Val
2. Report final performance on Test (never used before)
```

**Nested CV advantages**:
- Uses all data (no held-out test set)
- More reliable estimate (averaged over folds)

**Hold-out advantages**:
- Much faster (1 test evaluation vs K)
- Simpler to implement
- Test set truly unseen

**Rule of thumb**:
- Small data (<10K): Use nested CV (can't afford to hold out 20%)
- Large data (>100K): Use hold-out test set (faster, simpler)

### Semiconductor Example

**Scenario**: Tune Random Forest for yield prediction (5,000 devices)

**Hyperparameters to tune**: `n_estimators`, `max_depth`, `min_samples_split` (10 configs)

**Naive CV** (WRONG):
```python
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X, y)
print(grid.best_score_)  # Reports: 0.92 (optimistic!)
```

**Nested CV** (CORRECT):
```python
# Outer loop: 5-fold
# Inner loop: 3-fold for hyperparameter tuning
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []
for train_idx, test_idx in outer_cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Inner CV for hyperparameter tuning
    grid = GridSearchCV(RandomForestClassifier(), param_grid, 
                        cv=inner_cv, n_jobs=-1)
    grid.fit(X_train, y_train)
    
    # Evaluate best model on outer test set
    score = grid.best_estimator_.score(X_test, y_test)
    nested_scores.append(score)

print(f"Nested CV: {np.mean(nested_scores):.4f} ¬± {np.std(nested_scores):.4f}")
# Reports: 0.88 ¬± 0.03 (unbiased!)
```

**Cost**: 5 outer √ó 3 inner √ó 10 configs = 150 model trainings

### Reporting Guidelines

When using nested CV, report:

1. **Nested CV score** (unbiased estimate): "Accuracy = 0.88 ¬± 0.03"
2. **Best hyperparameters** (from each outer fold): Shows stability
3. **Final model**: Retrain on ALL data with most common best params

**Example report**:
```
Nested Cross-Validation Results (5 outer √ó 3 inner folds):
- Accuracy: 0.88 ¬± 0.03 (unbiased estimate)
- Best hyperparameters selected per fold:
  * Fold 1: n_estimators=100, max_depth=10
  * Fold 2: n_estimators=150, max_depth=10
  * Fold 3: n_estimators=100, max_depth=12
  * Fold 4: n_estimators=100, max_depth=10
  * Fold 5: n_estimators=100, max_depth=10
- Most common: n_estimators=100, max_depth=10
- Final model: Retrained on all data with n_estimators=100, max_depth=10
```

### Practical Recommendations

For **production ML workflows**:

1. **Development phase**: Use simple CV or train/val split (fast iteration)
2. **Model selection**: Use nested CV to compare models fairly
3. **Final deployment**: 
   - Use nested CV to get unbiased performance estimate
   - Retrain on all data with best hyperparameters
   - Monitor production metrics (may differ from CV!)

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from typing import Dict, List
import time

np.random.seed(42)

def nested_cross_validation(
    X, y,
    model_class,
    param_grid: Dict,
    outer_cv_splits: int = 5,
    inner_cv_splits: int = 3,
    scoring: str = 'accuracy'
) -> Dict:
    """
    Perform nested cross-validation for unbiased performance estimation.
    
    Args:
        X: Features
        y: Target
        model_class: Sklearn model class (e.g., RandomForestClassifier)
        param_grid: Hyperparameter grid for tuning
        outer_cv_splits: Number of outer folds
        inner_cv_splits: Number of inner folds
        scoring: Metric to optimize
        
    Returns:
        Dictionary with nested CV results
    """
    outer_cv = StratifiedKFold(n_splits=outer_cv_splits, shuffle=True, random_state=42)
    inner_cv = StratifiedKFold(n_splits=inner_cv_splits, shuffle=True, random_state=42)
    
    outer_scores = []
    best_params_per_fold = []
    inner_best_scores = []
    
    print("="*80)
    print(f"NESTED CROSS-VALIDATION: {outer_cv_splits} Outer √ó {inner_cv_splits} Inner Folds")
    print("="*80)
    
    start_time = time.time()
    
    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), 1):
        print(f"\nOuter Fold {fold_idx}/{outer_cv_splits}:")
        print("-" * 40)
        
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        # Inner CV for hyperparameter tuning
        grid_search = GridSearchCV(
            model_class(),
            param_grid,
            cv=inner_cv,
            scoring=scoring,
            n_jobs=-1,
            verbose=0
        )
        
        grid_search.fit(X_train, y_train)
        
        # Get best params and score from inner CV
        best_params = grid_search.best_params_
        inner_score = grid_search.best_score_
        
        # Evaluate on outer test set (never used in inner CV)
        outer_score = grid_search.best_estimator_.score(X_test, y_test)
        
        outer_scores.append(outer_score)
        best_params_per_fold.append(best_params)
        inner_best_scores.append(inner_score)
        
        print(f"  Best params: {best_params}")
        print(f"  Inner CV score: {inner_score:.6f}")
        print(f"  Outer test score: {outer_score:.6f}")
        print(f"  Gap (inner - outer): {inner_score - outer_score:.6f}")
    
    elapsed_time = time.time() - start_time
    
    # Compute statistics
    results = {
        'outer_scores': np.array(outer_scores),
        'inner_scores': np.array(inner_best_scores),
        'best_params_per_fold': best_params_per_fold,
        'mean_outer_score': np.mean(outer_scores),
        'std_outer_score': np.std(outer_scores),
        'mean_inner_score': np.mean(inner_best_scores),
        'std_inner_score': np.std(inner_best_scores),
        'optimism_bias': np.mean(inner_best_scores) - np.mean(outer_scores),
        'elapsed_time': elapsed_time,
        'outer_cv_splits': outer_cv_splits,
        'inner_cv_splits': inner_cv_splits
    }
    
    # Summary
    print("\n" + "="*80)
    print("NESTED CV SUMMARY")
    print("="*80)
    print(f"Outer CV (unbiased): {results['mean_outer_score']:.6f} ¬± {results['std_outer_score']:.6f}")
    print(f"Inner CV (biased):   {results['mean_inner_score']:.6f} ¬± {results['std_inner_score']:.6f}")
    print(f"Optimism bias:       {results['optimism_bias']:.6f}")
    print(f"Total time:          {elapsed_time:.2f} seconds")
    print("="*80)
    
    return results


def compare_naive_vs_nested_cv(
    X, y,
    model_class,
    param_grid: Dict,
    outer_cv_splits: int = 5,
    inner_cv_splits: int = 3,
    scoring: str = 'accuracy'
) -> None:
    """
    Compare naive CV (biased) vs nested CV (unbiased).
    
    Args:
        X: Features
        y: Target
        model_class: Sklearn model class
        param_grid: Hyperparameter grid
        outer_cv_splits: Outer folds
        inner_cv_splits: Inner folds
        scoring: Metric
    """
    print("\n" + "="*80)
    print("COMPARISON: NAIVE CV vs NESTED CV")
    print("="*80)
    
    # Naive CV (WRONG: uses all data for hyperparameter tuning)
    print("\n[1] Running NAIVE CV (biased)...")
    naive_start = time.time()
    
    grid_search = GridSearchCV(
        model_class(),
        param_grid,
        cv=outer_cv_splits,
        scoring=scoring,
        n_jobs=-1
    )
    grid_search.fit(X, y)
    
    naive_time = time.time() - naive_start
    naive_score = grid_search.best_score_
    naive_params = grid_search.best_params_
    
    print(f"  Best score (OPTIMISTIC): {naive_score:.6f}")
    print(f"  Best params: {naive_params}")
    print(f"  Time: {naive_time:.2f} seconds")
    
    # Nested CV (CORRECT: outer folds never used for hyperparameter tuning)
    print("\n[2] Running NESTED CV (unbiased)...")
    nested_results = nested_cross_validation(
        X, y, model_class, param_grid, 
        outer_cv_splits, inner_cv_splits, scoring
    )
    
    # Comparison
    print("\n" + "="*80)
    print("FINAL COMPARISON")
    print("="*80)
    print(f"{'Method':<20} {'Score':<20} {'Assessment':<30}")
    print("-"*80)
    print(f"{'Naive CV':<20} {naive_score:<20.6f} {'OPTIMISTIC BIAS ‚ùå':<30}")
    print(f"{'Nested CV':<20} {nested_results['mean_outer_score']:<20.6f} {'UNBIASED ESTIMATE ‚úÖ':<30}")
    print("-"*80)
    print(f"\nOptimism bias: {naive_score - nested_results['mean_outer_score']:.6f}")
    print(f"Percentage overestimation: {((naive_score - nested_results['mean_outer_score']) / nested_results['mean_outer_score'] * 100):.2f}%")
    print(f"\nComputation time:")
    print(f"  Naive CV:  {naive_time:.2f} seconds")
    print(f"  Nested CV: {nested_results['elapsed_time']:.2f} seconds (√ó{nested_results['elapsed_time']/naive_time:.1f})")
    print("="*80)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Score comparison
    methods = ['Naive CV\\n(Biased)', 'Nested CV\\n(Unbiased)']
    scores = [naive_score, nested_results['mean_outer_score']]
    colors = ['red', 'green']
    
    bars = axes[0].bar(methods, scores, color=colors, alpha=0.7, edgecolor='black')
    axes[0].set_ylabel('Score', fontsize=11, fontweight='bold')
    axes[0].set_title('Score Comparison\\n(Naive CV overestimates!)', 
                     fontsize=12, fontweight='bold')
    axes[0].grid(alpha=0.3, axis='y')
    
    # Add value labels on bars
    for bar, score in zip(bars, scores):
        height = bar.get_height()
        axes[0].text(bar.get_x() + bar.get_width()/2., height + 0.005,
                    f'{score:.4f}', ha='center', va='bottom', fontweight='bold')
    
    # Plot 2: Inner vs Outer scores in nested CV
    folds = np.arange(1, outer_cv_splits + 1)
    axes[1].plot(folds, nested_results['inner_scores'], 'o-', 
                color='orange', linewidth=2, markersize=8, 
                label=f'Inner CV (Mean={nested_results["mean_inner_score"]:.4f})')
    axes[1].plot(folds, nested_results['outer_scores'], 'o-', 
                color='green', linewidth=2, markersize=8, 
                label=f'Outer CV (Mean={nested_results["mean_outer_score"]:.4f})')
    
    axes[1].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
    axes[1].set_ylabel('Score', fontsize=11, fontweight='bold')
    axes[1].set_title('Nested CV: Inner vs Outer Scores\\n(Gap shows optimism bias)', 
                     fontsize=12, fontweight='bold')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    axes[1].set_xticks(folds)
    
    plt.tight_layout()
    plt.show()


# Example usage
if __name__ == "__main__":
    print("\nEXAMPLE: Nested CV for Semiconductor Yield Prediction with Hyperparameter Tuning\n")
    
    # Generate synthetic semiconductor data
    X, y = make_classification(
        n_samples=2000,
        n_features=10,
        n_informative=8,
        n_redundant=2,
        n_classes=2,
        weights=[0.85, 0.15],  # Imbalanced (15% defect rate)
        random_state=42
    )
    
    print(f"Dataset: {len(y)} devices")
    print(f"Features: 10 parametric test measurements")
    print(f"Target: Pass/Fail (imbalanced: {(y==0).sum()} pass, {(y==1).sum()} fail)")
    print(f"Defect rate: {y.mean()*100:.1f}%")
    
    # Define hyperparameter grid
    param_grid = {
        'n_estimators': [50, 100, 150],
        'max_depth': [5, 10, 15],
        'min_samples_split': [2, 5, 10]
    }
    
    print(f"\nHyperparameter grid: {len(param_grid['n_estimators']) * len(param_grid['max_depth']) * len(param_grid['min_samples_split'])} configurations")
    
    # Run comparison
    compare_naive_vs_nested_cv(
        X, y,
        RandomForestClassifier,
        param_grid,
        outer_cv_splits=5,
        inner_cv_splits=3,
        scoring='f1'
    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## üéØ CV Strategy Selection Guide\n",
    "\n",
    "### Decision Flowchart\n",
    "\n",
    "```mermaid\n",
    "graph TD\n",
    "    A[Need to evaluate model] --> B{Data has<br/>temporal order?}\n",
    "    \n",
    "    B -->|Yes| C[Use Time Series Split]\n",
    "    C --> C1{Process drift<br/>expected?}\n",
    "    C1 -->|Yes| C2[Monitor performance<br/>trend across folds]\n",
    "    C1 -->|No| C3[Standard Time Series CV]\n",
    "    \n",
    "    B -->|No| D{Data has<br/>group structure?}\n",
    "    \n",
    "    D -->|Yes| E[Use Group K-Fold]\n",
    "    E --> E1[Keep groups together<br/>in same fold]\n",
    "    \n",
    "    D -->|No| F{Classification<br/>or Regression?}\n",
    "    \n",
    "    F -->|Classification| G{Classes<br/>balanced?}\n",
    "    G -->|Yes| H[Use K-Fold]\n",
    "    G -->|No| I[Use Stratified K-Fold]\n",
    "    \n",
    "    F -->|Regression| H\n",
    "    \n",
    "    H --> J{Need to tune<br/>hyperparameters?}\n",
    "    I --> J\n",
    "    \n",
    "    J -->|Yes| K[Use Nested CV]\n",
    "    J -->|No| L[Use Simple CV]\n",
    "    \n",
    "    K --> M[Report unbiased<br/>performance]\n",
    "    L --> M\n",
    "```\n",
    "\n",
    "### Quick Reference Table\n",
    "\n",
    "| **Data Characteristic** | **Recommended Strategy** | **Why** |\n",
    "|------------------------|-------------------------|----------|\n",
    "| **Temporal ordering** | Time Series Split | Prevents data leakage, realistic |\n",
    "| **Imbalanced classes** | Stratified K-Fold | Maintains class distribution |\n",
    "| **Group structure** (e.g., multiple samples per patient/wafer) | Group K-Fold | Prevents group leakage |\n",
    "| **i.i.d. data, balanced** | K-Fold | Simple, standard |\n",
    "| **Need hyperparameter tuning** | Nested CV | Unbiased performance |\n",
    "| **Small dataset (<1000)** | K=10 or LOO | Maximizes training data |\n",
    "| **Large dataset (>100K)** | K=3 or K=5 | Reduces computation |\n",
    "| **Multi-class** | Stratified K-Fold | Ensures all classes in every fold |\n",
    "\n",
    "### Semiconductor-Specific Guidelines\n",
    "\n",
    "| **Scenario** | **CV Strategy** | **Specific Considerations** |\n",
    "|--------------|----------------|---------------------------|\n",
    "| **Wafer-level models** | Group K-Fold by wafer_id | Multiple dies from same wafer correlated |\n",
    "| **Lot-based analysis** | Group K-Fold by lot_id | Manufacturing lots share process conditions |\n",
    "| **Production time series** | Time Series Split | Process drift, equipment aging |\n",
    "| **Spatial yield models** | Stratified K-Fold by yield bins | Maintain yield distribution |\n",
    "| **Test time prediction** | Time Series Split | Test program evolves over time |\n",
    "| **Defect detection** (rare) | Stratified K-Fold | Maintain low defect rate in each fold |\n",
    "| **Multi-fab comparison** | Group K-Fold by fab_id | Fab-specific characteristics |\n",
    "\n",
    "### Practical Recommendations\n",
    "\n",
    "#### Development Phase (Fast Iteration)\n",
    "- Use simple train/val split (80/20) or 3-Fold CV\n",
    "- Focus on model development, not rigorous evaluation\n",
    "- Accept higher variance for speed\n",
    "\n",
    "#### Model Selection (Compare Algorithms)\n",
    "- Use 5-Fold CV (or stratified/time series variant)\n",
    "- Report mean ¬± std for each model\n",
    "- Use statistical tests (paired t-test, McNemar) to compare\n",
    "\n",
    "#### Final Evaluation (Production Readiness)\n",
    "- Use 10-Fold CV or nested CV\n",
    "- Report confidence intervals\n",
    "- Include multiple metrics (not just accuracy)\n",
    "- Document CV strategy in model card\n",
    "\n",
    "#### Research/Publication\n",
    "- Use nested CV for hyperparameter tuning\n",
    "- Report both inner and outer CV scores\n",
    "- Use multiple random seeds to verify stability\n",
    "- Provide full reproducibility details\n",
    "\n",
    "### Common Mistakes to Avoid\n",
    "\n",
    "#### ‚ùå Mistake 1: Using K-Fold on Time Series\n",
    "**Problem**: Data leakage (training on future)\n",
    "**Solution**: Always use Time Series Split\n",
    "\n",
    "#### ‚ùå Mistake 2: Reporting Inner CV Score as Final Performance\n",
    "**Problem**: Optimistic bias from hyperparameter tuning\n",
    "**Solution**: Use nested CV and report outer CV score\n",
    "\n",
    "#### ‚ùå Mistake 3: Not Stratifying Imbalanced Classes\n",
    "**Problem**: High variance in metrics across folds\n",
    "**Solution**: Use Stratified K-Fold\n",
    "\n",
    "#### ‚ùå Mistake 4: Splitting Groups Across Folds\n",
    "**Problem**: Leakage from correlated samples (same wafer, patient)\n",
    "**Solution**: Use Group K-Fold\n",
    "\n",
    "#### ‚ùå Mistake 5: Not Checking for Process Drift\n",
    "**Problem**: Model performs well in CV but degrades in production\n",
    "**Solution**: Use Time Series Split and monitor trend\n",
    "\n",
    "### Computational Considerations\n",
    "\n",
    "#### Time Complexity\n",
    "- **K-Fold**: K √ó T (where T = model training time)\n",
    "- **Nested CV**: K_outer √ó K_inner √ó N_configs √ó T\n",
    "- **LOO**: n √ó T (where n = number of samples)\n",
    "\n",
    "#### Memory Requirements\n",
    "- K-Fold: Single model in memory\n",
    "- Nested CV: Single model (sequential)\n",
    "- Parallelization: Can run folds in parallel (increases memory)\n",
    "\n",
    "#### Speed Optimization Tips\n",
    "1. Use `n_jobs=-1` for parallel fold execution\n",
    "2. Reduce K for large datasets (K=3 or K=5)\n",
    "3. Use RandomizedSearchCV instead of GridSearchCV\n",
    "4. Cache computations when possible\n",
    "5. Use early stopping for iterative models\n",
    "\n",
    "### Integration with Production Workflow\n",
    "\n",
    "#### Model Card Documentation\n",
    "Include in model card:\n",
    "```yaml\n",
    "validation:\n",
    "  strategy: Stratified 5-Fold Cross-Validation\n",
    "  outer_folds: 5\n",
    "  inner_folds: 3  # if nested\n",
    "  metric: F1-Score\n",
    "  performance: 0.88 ¬± 0.03\n",
    "  confidence_interval: [0.82, 0.94] (95%)\n",
    "  hyperparameters: \n",
    "    tuning_method: GridSearchCV\n",
    "    best_params: {n_estimators: 100, max_depth: 10}\n",
    "  notes: Imbalanced dataset (15% positive class)\n",
    "```\n",
    "\n",
    "#### Monitoring in Production\n",
    "After deployment:\n",
    "1. Compare production metrics to CV estimates\n",
    "2. Alert if performance drops below CV - 2œÉ\n",
    "3. Re-run CV periodically on new data\n",
    "4. Retrain when CV performance degrades"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}

## üë• Group K-Fold: Handling Clustered Data

### The Problem: Group Leakage

Many real-world datasets have **group structure** where multiple samples belong to the same underlying entity:
- **Medical**: Multiple measurements from same patient
- **Semiconductor**: Multiple dies from same wafer
- **Finance**: Multiple transactions from same customer
- **Education**: Multiple test scores from same student

**Critical issue**: If samples from the same group appear in both training and test sets, the model learns group-specific patterns rather than generalizable patterns.

### How Group K-Fold Works

1. **Group identification**: Each sample has a group label (wafer_id, patient_id, etc.)
2. **Group-level splitting**: Groups (not samples) are divided into K folds
3. **No group leakage**: All samples from a group stay in the same fold
4. **Cross-validation**: Standard K-Fold procedure on group assignments

### Mathematical Formulation

Let $G = \{g_1, g_2, ..., g_m\}$ be the set of groups.

**Standard K-Fold** (WRONG):
- Randomly split samples ‚Üí Group $g_i$ may appear in both train and test

**Group K-Fold** (CORRECT):
- Split groups into K folds: $G = F_1 \cup F_2 \cup ... \cup F_K$
- For fold $k$: Train on $\bigcup_{i \neq k} F_i$, test on $F_k$
- Guarantee: $F_i \cap F_j = \emptyset$ for $i \neq j$

### Semiconductor Example: Wafer-Level Yield Prediction

**Scenario**: Predict device yield from parametric tests

**Data structure**:
```
Wafer 001: 100 dies ‚Üí Die A, Die B, Die C, ...
Wafer 002: 100 dies ‚Üí Die A, Die B, Die C, ...
Wafer 003: 100 dies ‚Üí Die A, Die B, Die C, ...
...
```

**Problem with Standard K-Fold**:
- Dies from Wafer 001 in training
- Dies from Wafer 001 in testing
- Model learns wafer-specific patterns (spatial correlations, fab conditions)
- **Production failure**: New wafer comes ‚Üí model doesn't generalize

**Example**:
- Standard K-Fold: 95% accuracy (overly optimistic)
- Group K-Fold: 88% accuracy (realistic, generalizes to new wafers)

### When to Use Group K-Fold

| **Use Case** | **Group By** | **Why** |
|--------------|--------------|---------|
| ‚úÖ **Medical data** | patient_id | Multiple visits/measurements per patient |
| ‚úÖ **Semiconductor** | wafer_id, lot_id | Spatial/process correlations |
| ‚úÖ **Finance** | customer_id | Customer-specific behavior patterns |
| ‚úÖ **Image classification** | scene_id | Multiple images from same scene |
| ‚úÖ **Time series** | entity_id | Multiple time points per entity |
| ‚ùå **i.i.d. samples** | (none) | Use standard K-Fold |
| ‚ùå **Single measurement per entity** | (none) | No group structure |

### Comparison: Standard vs Group K-Fold

| **Aspect** | **Standard K-Fold** | **Group K-Fold** |
|------------|---------------------|------------------|
| **Splitting** | Random samples | Random groups |
| **Leakage risk** | High (if groups exist) | None |
| **Performance estimate** | Optimistic | Realistic |
| **Production generalization** | Poor | Good |
| **Variance** | Lower | Higher (fewer "effective" samples) |
| **Fold sizes** | Balanced samples | May be imbalanced (group sizes vary) |

### Implementation Considerations

#### Unbalanced Folds
**Problem**: Groups have different sizes
```
Group A: 5 samples
Group B: 100 samples
Group C: 10 samples
```
- Fold 1 may have 115 samples, Fold 2 may have 5 samples

**Solution**:
- Use `StratifiedGroupKFold` to balance target distribution
- Monitor fold size variance
- Consider using more folds (K=10 instead of K=5)

#### Small Number of Groups
**Problem**: If only 10 groups exist, K=5 means only 2 groups per fold
- High variance in estimates

**Solution**:
- Use Leave-One-Group-Out (LOGO)
- Collect more groups if possible
- Report uncertainty honestly

#### Hierarchical Groups
**Example**: Dies within wafers, wafers within lots
- Group by highest level (lot_id)
- Alternative: Nested CV (outer=lots, inner=wafers)

### Semiconductor-Specific Patterns

#### Spatial Correlation (Wafer Map)
```
Wafer layout:
[Good] [Good] [Fail] [Fail]
[Good] [Good] [Fail] [Fail]
[Good] [Fail] [Fail] [Fail]
```
- Dies near each other have correlated outcomes
- Must group by wafer to avoid spatial leakage

#### Lot-Based Process Drift
```
Lot 001 (Week 1): High yield (98%)
Lot 002 (Week 2): Medium yield (95%)
Lot 003 (Week 3): Low yield (92%)
```
- Group by lot_id to test generalization across process conditions
- Mimics production: Model trained on past lots ‚Üí predicts new lots

#### Multi-Fab Analysis
```
Fab A: 1000 wafers
Fab B: 800 wafers
Fab C: 500 wafers
```
- Group by fab_id to test cross-fab generalization
- Critical for models deployed across multiple fabs

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut, KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt
from typing import Dict, Tuple

np.random.seed(42)

class GroupCVEvaluator:
    """
    Cross-validation evaluator for grouped data.
    """
    
    def __init__(self, n_splits: int = 5, cv_type: str = 'group'):
        """
        Args:
            n_splits: Number of folds (ignored for LOGO)
            cv_type: 'group' (GroupKFold) or 'logo' (LeaveOneGroupOut)
        """
        self.n_splits = n_splits
        self.cv_type = cv_type
        
        if cv_type == 'group':
            self.cv = GroupKFold(n_splits=n_splits)
        elif cv_type == 'logo':
            self.cv = LeaveOneGroupOut()
        else:
            raise ValueError("cv_type must be 'group' or 'logo'")
    
    def evaluate(self, model, X, y, groups, scoring='r2') -> Dict:
        """
        Perform group cross-validation.
        
        Args:
            model: Sklearn model
            X: Features
            y: Target
            groups: Group labels for each sample
            scoring: 'r2' or 'mae'
            
        Returns:
            Dictionary with CV results
        """
        train_scores = []
        test_scores = []
        group_info = []
        
        print(f"\n{'='*80}")
        print(f"GROUP CROSS-VALIDATION: {self.cv_type.upper()}")
        print(f"{'='*80}")
        print(f"Total samples: {len(y)}")
        print(f"Total groups: {len(np.unique(groups))}")
        print(f"Samples per group (avg): {len(y) / len(np.unique(groups)):.1f}")
        
        for fold_idx, (train_idx, test_idx) in enumerate(self.cv.split(X, y, groups), 1):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            groups_train = groups[train_idx]
            groups_test = groups[test_idx]
            
            # Train model
            model.fit(X_train, y_train)
            
            # Evaluate
            if scoring == 'r2':
                train_score = model.score(X_train, y_train)
                test_score = model.score(X_test, y_test)
            elif scoring == 'mae':
                train_score = -mean_absolute_error(y_train, model.predict(X_train))
                test_score = -mean_absolute_error(y_test, model.predict(X_test))
            
            train_scores.append(train_score)
            test_scores.append(test_score)
            
            # Group info
            group_info.append({
                'fold': fold_idx,
                'train_samples': len(train_idx),
                'test_samples': len(test_idx),
                'train_groups': len(np.unique(groups_train)),
                'test_groups': len(np.unique(groups_test)),
                'train_groups_list': list(np.unique(groups_train)),
                'test_groups_list': list(np.unique(groups_test))
            })
            
            print(f"\nFold {fold_idx}:")
            print(f"  Train: {len(train_idx)} samples, {len(np.unique(groups_train))} groups")
            print(f"  Test:  {len(test_idx)} samples, {len(np.unique(groups_test))} groups")
            print(f"  Train score: {train_score:.6f}")
            print(f"  Test score:  {test_score:.6f}")
        
        results = {
            'train_scores': np.array(train_scores),
            'test_scores': np.array(test_scores),
            'group_info': group_info,
            'mean_train': np.mean(train_scores),
            'std_train': np.std(train_scores),
            'mean_test': np.mean(test_scores),
            'std_test': np.std(test_scores),
            'cv_type': self.cv_type,
            'n_splits': len(train_scores)
        }
        
        print(f"\n{'='*80}")
        print(f"SUMMARY")
        print(f"{'='*80}")
        print(f"Train: {results['mean_train']:.6f} ¬± {results['std_train']:.6f}")
        print(f"Test:  {results['mean_test']:.6f} ¬± {results['std_test']:.6f}")
        print(f"Generalization gap: {results['mean_train'] - results['mean_test']:.6f}")
        print(f"{'='*80}")
        
        return results
    
    def plot_results(self, results: Dict, comparison_results: Dict = None):
        """
        Visualize group CV results.
        
        Args:
            results: Group CV results
            comparison_results: Optional standard K-Fold results for comparison
        """
        if comparison_results:
            fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        else:
            fig, axes = plt.subplots(1, 2, figsize=(14, 5))
            axes = axes.reshape(1, 2)
        
        # Plot 1: Fold performance (Group CV)
        folds = np.arange(1, results['n_splits'] + 1)
        axes[0, 0].plot(folds, results['train_scores'], 'o-', 
                       color='blue', linewidth=2, markersize=8, label='Train')
        axes[0, 0].plot(folds, results['test_scores'], 'o-', 
                       color='green', linewidth=2, markersize=8, label='Test')
        axes[0, 0].axhline(results['mean_test'], color='green', 
                          linestyle='--', alpha=0.7, label=f'Mean Test ({results["mean_test"]:.4f})')
        axes[0, 0].fill_between(folds, 
                                results['mean_test'] - results['std_test'],
                                results['mean_test'] + results['std_test'],
                                alpha=0.2, color='green')
        axes[0, 0].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
        axes[0, 0].set_ylabel('Score (R¬≤)', fontsize=11, fontweight='bold')
        axes[0, 0].set_title(f'Group CV Performance\\n({results["cv_type"].upper()})', 
                            fontsize=12, fontweight='bold')
        axes[0, 0].legend()
        axes[0, 0].grid(alpha=0.3)
        axes[0, 0].set_xticks(folds)
        
        # Plot 2: Fold size distribution (Group CV)
        fold_nums = [info['fold'] for info in results['group_info']]
        train_samples = [info['train_samples'] for info in results['group_info']]
        test_samples = [info['test_samples'] for info in results['group_info']]
        
        x = np.arange(len(fold_nums))
        width = 0.35
        axes[0, 1].bar(x - width/2, train_samples, width, label='Train', 
                      color='blue', alpha=0.7, edgecolor='black')
        axes[0, 1].bar(x + width/2, test_samples, width, label='Test', 
                      color='green', alpha=0.7, edgecolor='black')
        axes[0, 1].set_xlabel('Fold Number', fontsize=11, fontweight='bold')
        axes[0, 1].set_ylabel('Number of Samples', fontsize=11, fontweight='bold')
        axes[0, 1].set_title('Fold Size Distribution\\n(May be imbalanced with groups)', 
                            fontsize=12, fontweight='bold')
        axes[0, 1].legend()
        axes[0, 1].set_xticks(x)
        axes[0, 1].set_xticklabels(fold_nums)
        axes[0, 1].grid(alpha=0.3, axis='y')
        
        if comparison_results:
            # Plot 3: Comparison - Performance
            methods = ['Standard K-Fold\\n(Biased)', f'Group CV\\n(Unbiased)']
            scores = [comparison_results['mean_test'], results['mean_test']]
            colors = ['red', 'green']
            
            bars = axes[1, 0].bar(methods, scores, color=colors, alpha=0.7, edgecolor='black')
            axes[1, 0].set_ylabel('Test Score (R¬≤)', fontsize=11, fontweight='bold')
            axes[1, 0].set_title('Performance Comparison\\n(Group CV shows realistic performance)', 
                                fontsize=12, fontweight='bold')
            axes[1, 0].grid(alpha=0.3, axis='y')
            
            for bar, score in zip(bars, scores):
                height = bar.get_height()
                axes[1, 0].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                               f'{score:.4f}', ha='center', va='bottom', fontweight='bold')
            
            # Add optimism bias annotation
            optimism = comparison_results['mean_test'] - results['mean_test']
            axes[1, 0].annotate(f'Optimism bias: {optimism:.4f}', 
                               xy=(0.5, max(scores) * 0.95), 
                               ha='center', fontsize=10, 
                               bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
            
            # Plot 4: Distribution comparison
            data_to_plot = [comparison_results['test_scores'], results['test_scores']]
            bp = axes[1, 1].boxplot(data_to_plot, labels=methods, patch_artist=True)
            for patch, color in zip(bp['boxes'], colors):
                patch.set_facecolor(color)
                patch.set_alpha(0.7)
            axes[1, 1].set_ylabel('Test Score (R¬≤)', fontsize=11, fontweight='bold')
            axes[1, 1].set_title('Score Distribution\\n(Group CV has higher variance)', 
                                fontsize=12, fontweight='bold')
            axes[1, 1].grid(alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()


def generate_wafer_data(n_wafers: int = 20, dies_per_wafer: int = 100) -> Tuple:
    """
    Generate synthetic semiconductor wafer data with spatial correlation.
    
    Args:
        n_wafers: Number of wafers
        dies_per_wafer: Number of dies per wafer
        
    Returns:
        X, y, groups (wafer IDs)
    """
    n_samples = n_wafers * dies_per_wafer
    
    X = []
    y = []
    groups = []
    
    for wafer_id in range(n_wafers):
        # Wafer-level effects (process variation)
        wafer_offset = np.random.normal(0, 0.15)
        
        for die_id in range(dies_per_wafer):
            # Die-level features (parametric tests)
            features = np.random.normal(0, 1, 5)
            
            # Target with wafer-level correlation
            target = (
                2.0 * features[0] +
                1.5 * features[1] +
                1.0 * features[2] +
                wafer_offset +  # Wafer-level effect!
                np.random.normal(0, 0.1)
            )
            
            X.append(features)
            y.append(target)
            groups.append(wafer_id)
    
    return np.array(X), np.array(y), np.array(groups)


# Example usage
if __name__ == "__main__":
    print("\nEXAMPLE: Group CV for Wafer-Level Yield Prediction\n")
    
    # Generate wafer data
    X, y, groups = generate_wafer_data(n_wafers=20, dies_per_wafer=100)
    
    print(f"Dataset: {len(y)} dies from {len(np.unique(groups))} wafers")
    print(f"Dies per wafer: {len(y) // len(np.unique(groups))}")
    print(f"Features: 5 parametric test measurements")
    print(f"Target: Yield-related metric")
    print(f"Key: Data has wafer-level spatial correlation")
    
    # Standard K-Fold (WRONG - suffers from group leakage)
    print("\n" + "="*80)
    print("[1] STANDARD K-FOLD (WRONG - Group Leakage!)")
    print("="*80)
    
    standard_cv = KFold(n_splits=5, shuffle=True, random_state=42)
    standard_scores = []
    
    for fold_idx, (train_idx, test_idx) in enumerate(standard_cv.split(X), 1):
        model = RandomForestRegressor(n_estimators=50, random_state=42)
        model.fit(X[train_idx], y[train_idx])
        score = model.score(X[test_idx], y[test_idx])
        standard_scores.append(score)
        
        # Check for group leakage
        train_groups = set(groups[train_idx])
        test_groups = set(groups[test_idx])
        overlap = train_groups & test_groups
        
        print(f"Fold {fold_idx}: R¬≤ = {score:.6f}, Group overlap = {len(overlap)} wafers ‚ùå")
    
    standard_results = {
        'test_scores': np.array(standard_scores),
        'mean_test': np.mean(standard_scores),
        'std_test': np.std(standard_scores)
    }
    
    print(f"\nStandard K-Fold: {standard_results['mean_test']:.6f} ¬± {standard_results['std_test']:.6f}")
    print("WARNING: This is OPTIMISTIC due to group leakage!")
    
    # Group K-Fold (CORRECT - no group leakage)
    print("\n" + "="*80)
    print("[2] GROUP K-FOLD (CORRECT - No Group Leakage)")
    print("="*80)
    
    model = RandomForestRegressor(n_estimators=50, random_state=42)
    evaluator = GroupCVEvaluator(n_splits=5, cv_type='group')
    group_results = evaluator.evaluate(model, X, y, groups, scoring='r2')
    
    # Verify no group overlap
    print("\nVerifying no group leakage:")
    for info in group_results['group_info']:
        train_groups_set = set(info['train_groups_list'])
        test_groups_set = set(info['test_groups_list'])
        overlap = train_groups_set & test_groups_set
        print(f"  Fold {info['fold']}: Overlap = {len(overlap)} ‚úÖ")
    
    # Comparison
    print("\n" + "="*80)
    print("FINAL COMPARISON")
    print("="*80)
    print(f"Standard K-Fold: {standard_results['mean_test']:.6f} ¬± {standard_results['std_test']:.6f} ‚ùå (Optimistic)")
    print(f"Group K-Fold:    {group_results['mean_test']:.6f} ¬± {group_results['std_test']:.6f} ‚úÖ (Realistic)")
    print(f"Optimism bias:   {standard_results['mean_test'] - group_results['mean_test']:.6f}")
    print(f"Production generalization: Group CV gives realistic estimate for NEW WAFERS")
    print("="*80)
    
    # Visualization
    evaluator.plot_results(group_results, standard_results)

## üî¨ Complete Example 1: Semiconductor Test Time Optimization

### Problem Statement
A semiconductor test engineer needs to predict final test time for new devices based on early parametric measurements. The goal is to optimize test scheduling and resource allocation.

### Dataset Characteristics
- **Temporal data**: 52 weeks of production data
- **Features**: Device complexity score, number of test points, operating frequency, power consumption, temperature
- **Target**: Test time in milliseconds
- **Challenge**: Process drift over time (equipment aging, test program updates)

### Why Standard K-Fold Fails
Standard K-Fold randomly shuffles data, training on Week 40-52 to predict Week 1-10. This is **impossible in production** where you can only use past data to predict future.

### Solution: Time Series Cross-Validation
Use forward chaining with expanding window to realistically simulate production deployment.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import TimeSeriesSplit, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Generate realistic test time data with temporal patterns
def generate_test_time_data(n_weeks=52, devices_per_week=50):
    """Generate semiconductor test time data with process drift."""
    data = []
    
    for week in range(n_weeks):
        # Process drift: Equipment aging (test time increases over time)
        drift = 0.005 * week
        
        # Seasonal pattern: Quarterly maintenance cycles
        seasonal = 0.1 * np.sin(2 * np.pi * week / 13)
        
        for device in range(devices_per_week):
            # Device features
            complexity = np.random.uniform(50, 150)  # Complexity score
            n_test_points = np.random.randint(100, 500)  # Number of tests
            frequency = np.random.uniform(1.0, 3.5)  # GHz
            power = np.random.uniform(5, 25)  # Watts
            temperature = np.random.normal(25, 2)  # Celsius
            
            # Test time model
            base_time = (
                0.5 * complexity +
                0.3 * n_test_points +
                50 * frequency +
                10 * power +
                2 * temperature
            )
            
            # Add drift and seasonal effects
            test_time = base_time * (1 + drift + seasonal) + np.random.normal(0, 50)
            
            data.append({
                'week': week,
                'complexity': complexity,
                'n_test_points': n_test_points,
                'frequency': frequency,
                'power': power,
                'temperature': temperature,
                'test_time_ms': test_time
            })
    
    return pd.DataFrame(data)

# Generate data
print("="*80)
print("COMPLETE EXAMPLE: SEMICONDUCTOR TEST TIME PREDICTION")
print("="*80)
print("\n[1] Generating Data...")

df = generate_test_time_data(n_weeks=52, devices_per_week=50)

print(f"‚úÖ Generated {len(df)} device measurements")
print(f"   Timespan: {df['week'].min()} to {df['week'].max()} weeks")
print(f"   Features: complexity, n_test_points, frequency, power, temperature")
print(f"   Target: test_time_ms (mean={df['test_time_ms'].mean():.1f} ms)")

# Prepare data
X = df[['complexity', 'n_test_points', 'frequency', 'power', 'temperature']].values
y = df['test_time_ms'].values

print("\n[2] Comparing Cross-Validation Strategies...")
print("-"*80)

# Strategy 1: Standard K-Fold (WRONG for time series)
print("\nüìâ STRATEGY 1: Standard K-Fold (WRONG - Data Leakage!)")
print("-"*40)

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
kfold_scores = []

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X[train_idx])
    X_test_scaled = scaler.transform(X[test_idx])
    
    model = GradientBoostingRegressor(n_estimators=100, random_state=42)
    model.fit(X_train_scaled, y[train_idx])
    
    score = model.score(X_test_scaled, y[test_idx])
    mae = mean_absolute_error(y[test_idx], model.predict(X_test_scaled))
    
    kfold_scores.append(score)
    
    # Check temporal leakage
    train_weeks = df.iloc[train_idx]['week']
    test_weeks = df.iloc[test_idx]['week']
    print(f"  Fold {fold_idx}: R¬≤={score:.4f}, MAE={mae:.1f} ms")
    print(f"    Train weeks: {train_weeks.min()}-{train_weeks.max()}")
    print(f"    Test weeks:  {test_weeks.min()}-{test_weeks.max()}")
    print(f"    ‚ùå Training on future data! (Week {train_weeks.max()} > Week {test_weeks.min()})")

kfold_mean = np.mean(kfold_scores)
kfold_std = np.std(kfold_scores)
print(f"\n  K-Fold Result: R¬≤ = {kfold_mean:.4f} ¬± {kfold_std:.4f}")
print(f"  ‚ö†Ô∏è  OPTIMISTIC due to data leakage!")

# Strategy 2: Time Series Split (CORRECT)
print("\nüìà STRATEGY 2: Time Series Split (CORRECT - No Leakage)")
print("-"*40)

tscv = TimeSeriesSplit(n_splits=5)
ts_scores = []
ts_maes = []

for fold_idx, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X[train_idx])
    X_test_scaled = scaler.transform(X[test_idx])
    
    model = GradientBoostingRegressor(n_estimators=100, random_state=42)
    model.fit(X_train_scaled, y[train_idx])
    
    score = model.score(X_test_scaled, y[test_idx])
    mae = mean_absolute_error(y[test_idx], model.predict(X_test_scaled))
    
    ts_scores.append(score)
    ts_maes.append(mae)
    
    train_weeks = df.iloc[train_idx]['week']
    test_weeks = df.iloc[test_idx]['week']
    print(f"  Fold {fold_idx}: R¬≤={score:.4f}, MAE={mae:.1f} ms")
    print(f"    Train weeks: {train_weeks.min()}-{train_weeks.max()}")
    print(f"    Test weeks:  {test_weeks.min()}-{test_weeks.max()}")
    print(f"    ‚úÖ Train < Test (Week {train_weeks.max()} < Week {test_weeks.min()})")

ts_mean = np.mean(ts_scores)
ts_std = np.std(ts_scores)
print(f"\n  Time Series CV Result: R¬≤ = {ts_mean:.4f} ¬± {ts_std:.4f}")
print(f"  ‚úÖ REALISTIC estimate for production!")

# Detect performance trend
slope, intercept, r_value, p_value, std_err = stats.linregress(range(1, 6), ts_scores)
print(f"\n  Performance trend: slope={slope:.6f}, p-value={p_value:.4f}")
if p_value < 0.05:
    if slope < 0:
        print(f"  ‚ö†Ô∏è  Significant DOWNWARD trend detected!")
        print(f"     ‚Üí Model staleness: Performance degrades {ts_scores[0]:.4f} ‚Üí {ts_scores[-1]:.4f}")
        print(f"     ‚Üí Recommendation: Retrain model periodically or use online learning")
    else:
        print(f"  ‚úÖ Upward trend: More data improves performance")
else:
    print(f"  ‚úÖ Stable performance across time")

# Final comparison
print("\n" + "="*80)
print("FINAL COMPARISON")
print("="*80)
print(f"{'Method':<25} {'R¬≤ Score':<20} {'Assessment':<30}")
print("-"*80)
print(f"{'Standard K-Fold':<25} {kfold_mean:.4f} ¬± {kfold_std:.4f}    {'OPTIMISTIC ‚ùå':<30}")
print(f"{'Time Series Split':<25} {ts_mean:.4f} ¬± {ts_std:.4f}    {'REALISTIC ‚úÖ':<30}")
print("-"*80)
print(f"Optimism bias: {kfold_mean - ts_mean:.4f} ({(kfold_mean - ts_mean)/ts_mean*100:.1f}%)")
print(f"\nProduction Recommendation:")
print(f"  - Expected R¬≤: {ts_mean:.4f} (use Time Series CV estimate)")
print(f"  - Expected MAE: {np.mean(ts_maes):.1f} ¬± {np.std(ts_maes):.1f} ms")
print(f"  - Monitor for model staleness (performance degrading over time)")
print(f"  - Retrain model every {52 // 5} weeks based on CV splits")
print("="*80)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Score comparison
methods = ['Standard K-Fold\\n(Biased)', 'Time Series Split\\n(Unbiased)']
scores_mean = [kfold_mean, ts_mean]
scores_std = [kfold_std, ts_std]
colors = ['red', 'green']

bars = axes[0, 0].bar(methods, scores_mean, color=colors, alpha=0.7, edgecolor='black')
axes[0, 0].errorbar(range(len(methods)), scores_mean, yerr=scores_std, 
                    fmt='none', color='black', capsize=5)
axes[0, 0].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Cross-Validation Strategy Comparison\\n(Time Series Split shows realistic performance)', 
                    fontsize=12, fontweight='bold')
axes[0, 0].grid(alpha=0.3, axis='y')

for bar, score in zip(bars, scores_mean):
    height = bar.get_height()
    axes[0, 0].text(bar.get_x() + bar.get_width()/2., height + 0.02,
                   f'{score:.4f}', ha='center', va='bottom', fontweight='bold')

# Plot 2: Performance trend over time
folds = np.arange(1, 6)
axes[0, 1].plot(folds, ts_scores, 'o-', color='green', linewidth=2, markersize=8, label='Actual')
axes[0, 1].plot(folds, slope * folds + intercept, '--', color='red', 
               linewidth=2, label=f'Trend (slope={slope:.4f})')
axes[0, 1].set_xlabel('Fold Number (Time ‚Üí)', fontsize=11, fontweight='bold')
axes[0, 1].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Time Series CV: Performance Over Time\\n(Check for model staleness)', 
                    fontsize=12, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
axes[0, 1].set_xticks(folds)

# Plot 3: MAE over time
axes[1, 0].plot(folds, ts_maes, 'o-', color='orange', linewidth=2, markersize=8)
axes[1, 0].set_xlabel('Fold Number (Time ‚Üí)', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('MAE (milliseconds)', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Prediction Error Over Time\\n(Monitor for drift)', 
                    fontsize=12, fontweight='bold')
axes[1, 0].grid(alpha=0.3)
axes[1, 0].set_xticks(folds)

# Plot 4: Distribution comparison
data_to_plot = [kfold_scores, ts_scores]
bp = axes[1, 1].boxplot(data_to_plot, labels=methods, patch_artist=True)
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[1, 1].set_ylabel('R¬≤ Score', fontsize=11, fontweight='bold')
axes[1, 1].set_title('Score Distribution\\n(Higher variance in Time Series CV is normal)', 
                    fontsize=12, fontweight='bold')
axes[1, 1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n‚úÖ Complete example finished!")

## üéØ 8 Real-World Project Ideas

### Post-Silicon Validation Projects

#### 1. **Wafer Yield Prediction with Spatial Cross-Validation**
**Objective**: Predict wafer-level yield using parametric test data while accounting for spatial correlations.

**Why This Matters**: Standard CV causes spatial leakage. Group K-Fold by wafer_id ensures model generalizes to new wafers, not just new dies on same wafer.

**Key Features**:
- Parametric measurements: Vdd, Idd, frequency, power
- Spatial features: die_x, die_y coordinates
- Process features: lot_id, fab_id, equipment_id
- Target: Yield percentage or pass/fail

**Implementation Hints**:
- Use `GroupKFold` with groups=wafer_id
- Alternative: Stratified Group K-Fold to maintain yield distribution
- Visualize wafer maps to understand spatial patterns
- Compare Group CV vs standard K-Fold to quantify spatial leakage

**Success Metrics**:
- Group CV R¬≤ > 0.85 (realistic for new wafers)
- MAE < 2% yield (actionable for manufacturing decisions)
- No performance degradation when deployed to new wafers
- Spatial leakage quantified: Standard CV - Group CV

**Business Value**: $500K+ annual savings by predicting low-yield wafers early and adjusting process parameters.

---

#### 2. **Test Time Optimization with Temporal Cross-Validation**
**Objective**: Predict final test time to optimize test scheduling, accounting for process drift over time.

**Why This Matters**: Test programs evolve, equipment ages ‚Üí model trained on old data may not work on new lots. Time Series CV reveals realistic performance.

**Key Features**:
- Device complexity score
- Number of test points
- Operating frequency, power, temperature
- Historical test time trends
- Equipment age (implicit via timestamp)

**Implementation Hints**:
- Use `TimeSeriesSplit` with n_splits=5-10
- Monitor performance trend across folds (check for staleness)
- Consider rolling window if equipment replaced periodically
- Include gap between train/test if test program updated in batches

**Success Metrics**:
- Time Series CV MAE < 50ms (test scheduling precision)
- Performance trend stable or upward (model doesn't degrade)
- Detect when retraining needed (significant downward trend)
- 30% reduction in test resource idle time

**Business Value**: $200K+ annual savings via optimized test scheduling and reduced equipment downtime.

---

#### 3. **Multi-Fab Yield Model with Nested Cross-Validation**
**Objective**: Build yield prediction model that works across multiple fabs, with rigorous hyperparameter tuning.

**Why This Matters**: Naive hyperparameter tuning gives optimistic estimates. Nested CV provides unbiased performance for cross-fab deployment.

**Key Features**:
- Parametric test measurements (standardized across fabs)
- Fab-specific features: process node, equipment type
- Environmental: temperature, humidity
- Target: Yield or defect density

**Implementation Hints**:
- Outer loop: Group K-Fold by fab_id (generalization to new fabs)
- Inner loop: Stratified K-Fold for hyperparameter tuning
- Tune: Model type, feature engineering, threshold selection
- Report both inner (optimistic) and outer (realistic) scores

**Success Metrics**:
- Nested CV R¬≤ > 0.80 (realistic cross-fab performance)
- Inner-outer gap < 0.05 (low optimism bias)
- Best hyperparameters consistent across outer folds
- Model performs within 5% when deployed to new fab

**Business Value**: Enable standardized yield models across 3-5 fabs, saving $1M+ in duplicated development effort.

---

#### 4. **Parametric Outlier Detection with Stratified Cross-Validation**
**Objective**: Detect rare parametric outliers (0.5-2% rate) that indicate process excursions.

**Why This Matters**: Imbalanced data (98% normal, 2% outliers) ‚Üí standard K-Fold creates variable class distributions across folds. Stratified K-Fold maintains consistent 2% rate.

**Key Features**:
- All parametric measurements from test data
- Statistical features: Z-scores, Mahalanobis distance
- Temporal features: Time since last outlier
- Spatial features: Neighboring die measurements

**Implementation Hints**:
- Use `StratifiedKFold` to maintain outlier rate in each fold
- Consider SMOTE or class weighting for extreme imbalance
- Use F1-score or AUPRC (not accuracy) due to imbalance
- Compare Stratified vs Regular K-Fold variance

**Success Metrics**:
- Stratified CV F1 > 0.85 with std < 0.03 (low variance)
- AUPRC > 0.90 (good precision-recall tradeoff)
- 50% reduction in variance vs regular K-Fold
- Catch 95% of outliers with <1% false positive rate

**Business Value**: $300K+ savings by detecting process excursions early, preventing scrap of entire lots.

---

### General AI/ML Projects

#### 5. **Customer Churn Prediction with Stratified Nested CV**
**Objective**: Predict customer churn (10-15% rate) with rigorous model selection and unbiased performance estimate.

**Why This Matters**: Imbalanced classes + hyperparameter tuning ‚Üí double optimism bias. Use Stratified K-Fold + Nested CV.

**Key Features**:
- Customer demographics: age, location, tenure
- Usage patterns: login frequency, feature usage
- Support interactions: ticket count, resolution time
- Billing: payment history, plan changes

**Implementation Hints**:
- Outer loop: Stratified K-Fold (maintain churn rate)
- Inner loop: Stratified K-Fold (hyperparameter tuning)
- Tune: Model type, class weighting, threshold
- Report confidence intervals for churn rate impact

**Success Metrics**:
- Nested CV AUPRC > 0.75 (realistic performance)
- Inner-outer gap < 0.05 (low optimism)
- Churn rate consistent across folds (within 1%)
- 25% reduction in customer acquisition cost via retention

**Business Value**: $500K+ annual revenue retention by proactively targeting at-risk customers.

---

#### 6. **Stock Price Prediction with Rolling Window CV**
**Objective**: Predict next-day stock price movement using time series with non-stationary patterns.

**Why This Matters**: Financial markets have regime changes. Rolling window CV (fixed training size) better mimics production than expanding window.

**Key Features**:
- Technical indicators: Moving averages, RSI, MACD
- Fundamental: P/E ratio, earnings, volume
- Sentiment: News sentiment scores
- Market features: Index movements, sector performance

**Implementation Hints**:
- Use custom `TimeSeriesSplit` with `max_train_size` (rolling window)
- Training window: 252 days (1 trading year)
- Test window: 21 days (1 month)
- Monitor performance trend to detect regime changes

**Success Metrics**:
- Time Series CV accuracy > 55% (statistically significant)
- Performance stable across folds (regime-invariant)
- Sharpe ratio > 1.5 in backtest
- Detect regime changes when performance drops >10%

**Business Value**: 15-20% annual returns above market benchmark via systematic trading strategy.

---

#### 7. **Medical Diagnosis with Patient-Level Group CV**
**Objective**: Predict disease diagnosis from medical images, ensuring model generalizes to new patients (not just new images from same patients).

**Why This Matters**: Multiple images per patient ‚Üí standard CV causes patient leakage. Group K-Fold by patient_id ensures generalization.

**Key Features**:
- Image features: CNN embeddings, texture, shape
- Patient metadata: Age, sex, medical history
- Temporal: Disease progression stage
- Clinical: Lab results, vitals

**Implementation Hints**:
- Use `GroupKFold` with groups=patient_id
- Ensure train/test have no overlapping patients
- Consider Leave-One-Group-Out if few patients
- Stratify by diagnosis if possible (StratifiedGroupKFold)

**Success Metrics**:
- Group CV AUROC > 0.90 (clinically useful)
- No patient leakage (verify group separation)
- Performance within 5% when deployed to new hospital
- Sensitivity > 0.95 (catch most positive cases)

**Business Value**: Enable early diagnosis, reducing treatment costs by $10K+ per patient and improving outcomes.

---

#### 8. **Sales Forecasting with Hierarchical Time Series CV**
**Objective**: Forecast sales across multiple product categories and regions, accounting for temporal and group structure.

**Why This Matters**: Sales data has both temporal ordering and hierarchy (products within categories, stores within regions). Need hybrid CV strategy.

**Key Features**:
- Historical sales: Past 2 years daily
- Seasonality: Day of week, month, holidays
- Promotions: Discount percentage, campaign type
- External: Weather, economic indicators
- Hierarchy: Product ‚Üí Category, Store ‚Üí Region

**Implementation Hints**:
- Outer loop: Time Series Split (temporal)
- Consider grouping by region for cross-region validation
- Use separate models per category or hierarchical model
- Aggregate forecasts to ensure consistency (bottom-up or top-down)

**Success Metrics**:
- Time Series CV MAPE < 15% (industry standard)
- Performance stable across seasons
- Forecast accuracy within 10% at category level
- Enable 20% inventory reduction via better planning

**Business Value**: $2M+ savings annually through optimized inventory management and reduced stockouts/overstock.

## üéì Key Takeaways and Best Practices

### Core Principles

#### 1. **Match CV Strategy to Data Structure**
- ‚úÖ **Temporal data** ‚Üí Time Series Split (forward chaining)
- ‚úÖ **Imbalanced classes** ‚Üí Stratified K-Fold
- ‚úÖ **Group structure** ‚Üí Group K-Fold
- ‚úÖ **Hyperparameter tuning** ‚Üí Nested CV
- ‚úÖ **i.i.d. data** ‚Üí Standard K-Fold

**Golden Rule**: Your CV strategy should mimic how the model will be used in production.

#### 2. **Understand the Bias-Variance Tradeoff in CV**
- **More folds (K=10)**: Lower bias, higher variance, more computation
- **Fewer folds (K=3)**: Higher bias, lower variance, less computation
- **LOO (K=n)**: Lowest bias, highest variance, expensive
- **Typical choice**: K=5 (good balance)

**Recommendation**: Start with K=5, increase to K=10 for final evaluation.

#### 3. **Report Uncertainty Honestly**
- Always report **mean ¬± std** (not just mean)
- Include **confidence intervals** (95% CI using t-distribution)
- Show **per-fold results** (check for outliers)
- Document **CV strategy** in model card

**Bad reporting**: "Model achieves 92% accuracy"  
**Good reporting**: "Model achieves 88.2% ¬± 2.8% accuracy (95% CI: [82.7%, 93.7%]) using 5-Fold Stratified CV on imbalanced dataset (15% positive class)"

---

### Common Pitfalls and Solutions

#### ‚ùå **Pitfall 1: Data Leakage Through Time**
**Problem**: Using K-Fold on time series data  
**Consequence**: Training on future to predict past ‚Üí overly optimistic  
**Solution**: Always use Time Series Split for temporal data  
**Detection**: Check if max(train_dates) > min(test_dates) in any fold

#### ‚ùå **Pitfall 2: Group Leakage**
**Problem**: Same patient/wafer/customer in train and test  
**Consequence**: Model learns entity-specific patterns, not generalizable  
**Solution**: Use Group K-Fold, ensure no group overlap  
**Detection**: Check if train_groups ‚à© test_groups ‚â† ‚àÖ

#### ‚ùå **Pitfall 3: Optimistic Hyperparameter Tuning**
**Problem**: Reporting best_score_ from GridSearchCV  
**Consequence**: 3-5% optimism bias from data snooping  
**Solution**: Use Nested CV for unbiased estimate  
**Detection**: Compare inner CV score to outer CV score (gap = bias)

#### ‚ùå **Pitfall 4: Not Stratifying Imbalanced Data**
**Problem**: Some folds have 1% positive, others 5%  
**Consequence**: High variance in metrics across folds  
**Solution**: Use Stratified K-Fold for classification  
**Detection**: Check class distribution per fold (should be consistent)

#### ‚ùå **Pitfall 5: Preprocessing Leakage**
**Problem**: Fitting scaler on full dataset before CV  
**Consequence**: Test set statistics leak into training  
**Solution**: Fit preprocessing inside CV loop (use sklearn Pipeline)  
**Detection**: Check if preprocessing uses test set information

---

### Production Deployment Guidelines

#### **Phase 1: Development (Fast Iteration)**
- Use simple train/test split (80/20)
- CV not required for rapid prototyping
- Focus: Model architecture, feature engineering
- Speed > Rigor

#### **Phase 2: Model Selection (Compare Algorithms)**
- Use 5-Fold CV (stratified/time series as appropriate)
- Compare multiple models with same CV strategy
- Report mean ¬± std for each model
- Use statistical tests (paired t-test) to compare

#### **Phase 3: Final Evaluation (Production Readiness)**
- Use 10-Fold CV or Nested CV
- Report confidence intervals
- Include multiple metrics (accuracy, precision, recall, AUROC, etc.)
- Document CV strategy in model card
- Validate on hold-out test set (if available)

#### **Phase 4: Monitoring (Post-Deployment)**
- Compare production metrics to CV estimates
- Alert if performance drops below CV - 2œÉ
- Re-run CV periodically on new data
- Retrain when CV performance degrades significantly

---

### Computational Optimization

#### **Speed vs Accuracy Tradeoffs**
1. **Reduce K**: Use K=3 instead of K=10 (3.3√ó speedup)
2. **Parallelize**: Use `n_jobs=-1` in CV functions
3. **Subsample**: Use stratified subset for large datasets
4. **Early stopping**: For iterative models (GBM, neural networks)
5. **RandomizedSearchCV**: Instead of GridSearchCV (10-100√ó speedup)
6. **Cache**: Use `memory` parameter in sklearn Pipeline

#### **When to Use Each Strategy**
- **Development**: K=3, single random seed
- **Model selection**: K=5, multiple metrics
- **Final evaluation**: K=10 or Nested CV, multiple seeds
- **Research/Publication**: Nested CV, comprehensive metrics, reproducibility details

---

### Semiconductor-Specific Best Practices

#### **Wafer-Level Models**
- Always group by wafer_id (avoid spatial leakage)
- Consider stratifying by yield bins
- Visualize wafer maps to understand spatial patterns
- Report performance per fab if multi-fab deployment

#### **Lot-Based Models**
- Group by lot_id (avoid process correlation leakage)
- Use Time Series Split if modeling across production time
- Monitor for process drift (performance trend analysis)
- Consider separate models per product family

#### **Test Time Models**
- Use Time Series Split (test programs evolve)
- Monitor for equipment aging effects
- Consider rolling window if equipment replaced
- Include gap between train/test for test program updates

#### **Defect Detection Models**
- Use Stratified K-Fold (maintain defect rate)
- Consider SMOTE or class weighting for extreme imbalance
- Report AUPRC (not accuracy) due to imbalance
- Tune threshold separately for production (precision vs recall tradeoff)

---

### Advanced Topics (Beyond This Notebook)

#### **Custom Cross-Validation Strategies**
- Implement custom splitters for domain-specific needs
- Example: Block CV for spatial data, Walk-Forward CV for finance
- Inherit from `sklearn.model_selection.BaseCrossValidator`

#### **Ensemble Cross-Validation**
- Train ensemble where each member trained on different CV fold
- Average predictions for better generalization
- Requires K models (memory/computation tradeoff)

#### **Cross-Validation for Deep Learning**
- Use K-Fold for small datasets (<10K samples)
- Use single train/val split for large datasets (>100K)
- Consider stratified split by class for image classification
- Use TimeSeriesSplit for sequential data (RNN, LSTM)

#### **Multi-Objective Cross-Validation**
- Optimize multiple metrics simultaneously
- Example: Maximize accuracy while minimizing inference time
- Use Pareto-optimal solutions

---

### Final Recommendations

#### **For Beginners**
1. Start with standard K-Fold (K=5)
2. Learn to recognize data structure violations (time, groups, imbalance)
3. Switch to appropriate CV strategy when needed
4. Always report mean ¬± std

#### **For Practitioners**
1. Match CV strategy to production deployment
2. Use Nested CV for hyperparameter tuning
3. Monitor for data leakage (time, groups, preprocessing)
4. Document CV strategy in model card

#### **For Researchers**
1. Use Nested CV for unbiased estimates
2. Report both inner and outer CV scores
3. Use multiple random seeds to verify stability
4. Provide full reproducibility details

#### **For Semiconductor Engineers**
1. Group by wafer_id or lot_id (avoid spatial/process leakage)
2. Use Time Series Split for production time series
3. Stratify by yield bins or defect classes
4. Monitor for process drift post-deployment

---

### Resources for Further Learning

#### **Sklearn Documentation**
- [Cross-validation guide](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Model evaluation metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)

#### **Academic Papers**
- Cawley, G. C., & Talbot, N. L. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation." JMLR.
- Bergmeir, C., & Ben√≠tez, J. M. (2012). "On the use of cross-validation for time series predictor evaluation." Information Sciences.

#### **Practical Guides**
- Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning."
- Kuhn, M., & Johnson, K. (2013). "Applied Predictive Modeling." Chapter 4: Over-Fitting and Model Tuning.

---

### Summary

**Cross-validation is not optional‚Äîit's the foundation of reliable machine learning.**

- ‚úÖ Choose CV strategy based on data structure
- ‚úÖ Report uncertainty (mean ¬± std, confidence intervals)
- ‚úÖ Avoid data leakage (time, groups, preprocessing)
- ‚úÖ Use Nested CV for hyperparameter tuning
- ‚úÖ Monitor production performance vs CV estimates

**Remember**: The goal is not to maximize CV score‚Äîit's to get an **honest, unbiased estimate** of how your model will perform in production. A lower but realistic CV score is infinitely more valuable than a high but optimistic one.

---

**Congratulations!** You now have a comprehensive understanding of cross-validation strategies. Use this knowledge to build robust, reliable machine learning models that generalize well to production environments.

**Next Steps**: Apply these techniques to real datasets, experiment with different CV strategies, and always validate your assumptions. Cross-validation is both an art and a science‚Äîmaster it, and you'll build models that stand the test of time (and production!).