# 106: A/B Testing for ML Models

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** statistical foundations of A/B testing: hypothesis testing, p-values, statistical power
- **Implement** online A/B tests comparing model variants in production
- **Build** multi-armed bandit strategies for adaptive experimentation
- **Apply** A/B testing to semiconductor yield prediction model deployments
- **Evaluate** test duration, sample size requirements, and early stopping criteria

## üìö What is A/B Testing for ML Models?

A/B testing for machine learning validates whether a new model actually performs better than the current production model under real-world conditions. Unlike offline evaluation on test sets, A/B testing exposes both models to live data simultaneously, randomly routing traffic between them while measuring business metrics. This reveals issues invisible in offline testing: data distribution shifts, user behavior changes, system integration bugs, and actual business impact.

Traditional A/B testing compares static variants (e.g., blue button vs red button). ML model A/B testing is more complex because models are non-deterministic, predictions interact with downstream systems, and metrics may have high variance or delayed feedback. A rigorous A/B test requires proper randomization, sufficient statistical power, guardrail metrics (to catch regressions), and clear success criteria agreed upon before deployment.

In semiconductor manufacturing, A/B testing validates whether new yield prediction models, test time optimizations, or binning algorithms actually improve KPIs (yield, cost, quality) without introducing unexpected failures. For example, a model may show 95% accuracy offline but cause 10% more false rejects in production due to calibration drift‚ÄîA/B testing catches this before full rollout.

**Why A/B Testing for ML Models?**
- ‚úÖ **Validation**: Offline metrics (R¬≤, AUC) don't guarantee real-world improvement‚ÄîA/B tests measure actual impact
- ‚úÖ **Risk Mitigation**: Gradual rollout (5% ‚Üí 50% ‚Üí 100%) limits blast radius if new model fails
- ‚úÖ **Causal Inference**: Randomization ensures performance differences are due to model, not confounders
- ‚úÖ **Business Metrics**: Test what matters (revenue, cost, yield) not just ML metrics (RMSE, accuracy)
- ‚úÖ **Continuous Improvement**: Culture of experimentation enables rapid model iteration

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Yield Prediction Model Upgrade**
- **Setup**: A = Current Random Forest (R¬≤=0.88), B = New XGBoost (R¬≤=0.92 offline)
- **Metric**: False reject rate (devices incorrectly predicted to fail)
- **Test**: Route 50% of lots to each model for 2 weeks (n=200 lots)
- **Result**: Model B reduces false rejects 15% (p=0.003), saves $500K/month in unnecessary scrapping
- **Decision**: Roll out Model B to 100% of production

**Use Case 2: Adaptive Test Insertion Algorithm**
- **Setup**: A = Fixed test sequence, B = ML-driven adaptive testing (skip low-risk tests)
- **Metrics**: Primary = Test time, Guardrail = Defect escape rate
- **Test**: Multi-armed bandit with Thompson sampling, 10K devices
- **Result**: Model B reduces test time 28% BUT defect escapes increase 2% ‚Üí REJECT Model B
- **Decision**: Retrain Model B with stricter safety constraints, re-test

**Use Case 3: Wafer Map Defect Classifier**
- **Setup**: A = Rule-based classifier, B = CNN-based AutoML model
- **Metric**: Correct defect type identification (validated by engineers)
- **Test**: Parallel deployment, engineers label 500 wafer maps for ground truth
- **Result**: Model B achieves 94% accuracy vs 78% for Model A (p<0.001)
- **Decision**: Deploy Model B, decommission rule-based system

**Use Case 4: Binning Algorithm Optimization**
- **Setup**: A = Manual binning rules, B = Data-driven ML binning
- **Metrics**: Primary = BIN1 yield (premium), Guardrail = Customer returns <0.1%
- **Test**: A/A test first (validate infrastructure), then A/B for 4 weeks
- **Result**: Model B increases BIN1 yield 6% with zero return rate increase (p=0.012)
- **Value**: $3M additional quarterly revenue from premium bin optimization

## üîÑ A/B Testing Workflow

```mermaid
graph TB
    A[New Model Candidate] --> B[Offline Evaluation]
    B --> C{Passes Threshold?}
    
    C -->|No| D[Reject Model]
    C -->|Yes| E[Define Success Metrics]
    
    E --> F[Power Analysis]
    F --> G[Calculate Sample Size]
    G --> H[Design Experiment]
    
    H --> I[A/A Test]
    I --> J{Infrastructure OK?}
    J -->|No| K[Fix Bias]
    K --> I
    
    J -->|Yes| L[A/B Test Launch]
    L --> M[Traffic Splitting]
    
    M --> N[Control: Model A]
    M --> O[Treatment: Model B]
    
    N --> P[Monitor Metrics]
    O --> P
    
    P --> Q{Guardrails OK?}
    Q -->|No| R[Emergency Stop]
    R --> D
    
    Q -->|Yes| S{Significant Result?}
    S -->|Not Yet| T{Budget Exhausted?}
    T -->|No| P
    T -->|Yes| U[Inconclusive]
    
    S -->|Yes, B Better| V[Gradual Rollout]
    V --> W[100% Traffic to B]
    
    S -->|Yes, A Better| D
    
    style A fill:#e1f5ff
    style W fill:#e1ffe1
    style R fill:#ffe1e1
    style D fill:#ffe1e1
```

## üìä Learning Path Context

**Prerequisites:**
- **041**: Model Evaluation - Understanding offline metrics
- **104**: Model Interpretability - Debugging model differences
- **105**: AutoML - Generating candidate models to test

**This Notebook (106):**
- Hypothesis testing fundamentals (t-tests, chi-square)
- Sample size and statistical power calculations
- A/B test implementation (traffic splitting, metric collection)
- Multi-armed bandits (Thompson sampling, UCB)
- Sequential testing and early stopping

**Next Steps:**
- **107**: Model Monitoring - Continuous performance tracking post-deployment
- **131**: Cloud Deployment - Production infrastructure for A/B testing at scale

---

Let's test models the right way‚Äîwith real data and real impact! üìä

## 1. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import ttest_ind, chi2_contingency, norm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

# Random seed
np.random.seed(42)

print("‚úÖ Environment ready for A/B testing!")

## 2. Generate Semiconductor Production Data

**Purpose:** Simulate production environment for A/B testing.

**Key Points:**
- **Realistic variance**: Production data has more noise than offline test sets
- **Time dependency**: Sequential lots with temporal patterns
- **Business metrics**: False rejects (cost), false accepts (quality risk)
- **Why this matters**: A/B tests must handle real-world variability

In [None]:
# Simulate 500 production lots tested over time
n_lots = 500
devices_per_lot = 100

# Time-based patterns (production drift)
time = np.arange(n_lots)
drift = 0.02 * np.sin(2 * np.pi * time / 100)  # Seasonal drift

# Generate lot-level features
lot_data = []
for lot_id in range(n_lots):
    # Parametric measurements with drift
    vdd = np.random.normal(1.2 + drift[lot_id], 0.08, devices_per_lot)
    idd = np.random.normal(50 + drift[lot_id] * 10, 8, devices_per_lot)
    freq = np.random.normal(2000, 150, devices_per_lot)
    temp = np.random.normal(85, 12, devices_per_lot)
    vth = np.random.normal(0.4 + drift[lot_id], 0.03, devices_per_lot)
    
    # True yield (unknown in production)
    power = vdd * idd
    true_yield = (
        100 - 0.35 * power + 12 * vth - 0.01 * temp * freq / 1000
        + np.random.normal(0, 3, devices_per_lot)
    )
    true_yield = np.clip(true_yield, 60, 100)
    
    # Pass/fail labels (yield > 85 = pass)
    pass_fail = (true_yield > 85).astype(int)
    
    # Store lot-level aggregates
    lot_data.append({
        'lot_id': lot_id,
        'time': lot_id,
        'avg_vdd': vdd.mean(),
        'avg_idd': idd.mean(),
        'avg_freq': freq.mean(),
        'avg_temp': temp.mean(),
        'avg_vth': vth.mean(),
        'true_yield_pct': pass_fail.mean() * 100,
        'devices': devices_per_lot
    })

df_production = pd.DataFrame(lot_data)

print(f"Production dataset: {len(df_production)} lots, {devices_per_lot} devices/lot")
print(f"\nYield statistics:")
print(df_production['true_yield_pct'].describe())
print(f"\nTemporal drift range: {drift.min():.4f} to {drift.max():.4f}")

## 3. Train Two Model Variants

**Purpose:** Create Model A (baseline) and Model B (new candidate) for comparison.

**Key Points:**
- **Model A**: Current production model (Random Forest)
- **Model B**: New candidate (Gradient Boosting)
- **Offline metrics**: B appears better, but does it hold in production?
- **Why this matters**: Offline superiority ‚â† production superiority

In [None]:
# Prepare training data (first 300 lots)
train_df = df_production.iloc[:300].copy()
test_df = df_production.iloc[300:].copy()

feature_cols = ['avg_vdd', 'avg_idd', 'avg_freq', 'avg_temp', 'avg_vth']
X_train = train_df[feature_cols]
y_train = train_df['true_yield_pct']
X_test = test_df[feature_cols]
y_test = test_df['true_yield_pct']

# Model A: Random Forest (current production)
model_a = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
model_a.fit(X_train, y_train)
y_pred_a = model_a.predict(X_test)
rmse_a = np.sqrt(mean_squared_error(y_test, y_pred_a))
r2_a = r2_score(y_test, y_pred_a)

# Model B: Gradient Boosting (new candidate)
model_b = GradientBoostingRegressor(n_estimators=150, max_depth=5, learning_rate=0.1, random_state=42)
model_b.fit(X_train, y_train)
y_pred_b = model_b.predict(X_test)
rmse_b = np.sqrt(mean_squared_error(y_test, y_pred_b))
r2_b = r2_score(y_test, y_pred_b)

print("Offline Evaluation (on test set):")
print(f"\nModel A (Random Forest):")
print(f"  RMSE: {rmse_a:.3f}%")
print(f"  R¬≤: {r2_a:.4f}")

print(f"\nModel B (Gradient Boosting):")
print(f"  RMSE: {rmse_b:.3f}%")
print(f"  R¬≤: {r2_b:.4f}")

print(f"\nüìä Offline Comparison:")
print(f"  Model B RMSE improvement: {((rmse_a - rmse_b) / rmse_a * 100):.1f}%")
print(f"  Model B R¬≤ improvement: {(r2_b - r2_a):.4f}")
print(f"\n‚ùì Question: Will this offline improvement translate to production?")

## 4. Statistical Power Analysis

**Concept:** Calculate required sample size for detecting meaningful differences.

**Mathematics:**
$$n = \frac{2(Z_{\alpha/2} + Z_{\beta})^2 \sigma^2}{\delta^2}$$

Where:
- $n$ = sample size per group
- $Z_{\alpha/2}$ = critical value for significance level (1.96 for Œ±=0.05)
- $Z_{\beta}$ = critical value for power (0.84 for 80% power)
- $\sigma$ = standard deviation
- $\delta$ = minimum detectable effect

**Why critical:** Underpowered tests waste time, overpowered tests waste resources

In [None]:
def calculate_sample_size(baseline_std, min_effect, alpha=0.05, power=0.80):
    """
    Calculate required sample size for two-sample t-test.
    
    Parameters:
    - baseline_std: Standard deviation of metric
    - min_effect: Minimum effect size to detect (absolute units)
    - alpha: Significance level (Type I error rate)
    - power: Statistical power (1 - Type II error rate)
    """
    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = norm.ppf(power)
    
    n = 2 * ((z_alpha + z_beta) ** 2) * (baseline_std ** 2) / (min_effect ** 2)
    
    return int(np.ceil(n))

# Power analysis for our A/B test
baseline_std = y_test.std()  # Variance in yield
min_effect = 2.0  # Want to detect 2% yield difference

sample_size = calculate_sample_size(baseline_std, min_effect)

print("Statistical Power Analysis:")
print(f"\nParameters:")
print(f"  Baseline std: {baseline_std:.2f}%")
print(f"  Minimum detectable effect: {min_effect:.1f}%")
print(f"  Significance level (Œ±): 0.05")
print(f"  Statistical power (1-Œ≤): 0.80")

print(f"\nRequired sample size: {sample_size} lots per group")
print(f"Total lots needed: {sample_size * 2}")

# Test duration estimate
lots_per_day = 10
test_days = (sample_size * 2) / lots_per_day
print(f"\nEstimated test duration: {test_days:.1f} days (at {lots_per_day} lots/day)")

print(f"\nüí° Interpretation:")
print(f"  Need {sample_size} lots in each group to detect {min_effect}% yield difference")
print(f"  With 80% probability (power) and 5% false positive rate (Œ±)")

## 5. Simulate A/B Test Execution

**Purpose:** Randomly assign production lots to Model A vs Model B.

**Key Points:**
- **Random assignment**: Coin flip for each lot ensures unbiased comparison
- **Business metric**: False reject rate (predicted fail, actually pass)
- **Guardrail metric**: False accept rate (predicted pass, actually fail)
- **Why this matters**: Real A/B tests track multiple metrics simultaneously

In [None]:
# Use production lots 300-500 for A/B test (200 lots available)
ab_test_df = df_production.iloc[300:].copy()
n_ab_lots = len(ab_test_df)

# Random assignment (50/50 split)
np.random.seed(42)
ab_test_df['variant'] = np.random.choice(['A', 'B'], size=n_ab_lots)

# Make predictions for each lot based on assignment
predictions_a = model_a.predict(ab_test_df[feature_cols])
predictions_b = model_b.predict(ab_test_df[feature_cols])

ab_test_df['predicted_yield'] = np.where(
    ab_test_df['variant'] == 'A',
    predictions_a,
    predictions_b
)

# Business metrics (using 85% threshold)
ab_test_df['predicted_pass'] = (ab_test_df['predicted_yield'] > 85).astype(int)
ab_test_df['actual_pass'] = (ab_test_df['true_yield_pct'] > 85).astype(int)

# False rejects: Predicted fail, actually pass (COSTLY - we reject good devices)
ab_test_df['false_reject'] = (
    (ab_test_df['predicted_pass'] == 0) & (ab_test_df['actual_pass'] == 1)
).astype(int)

# False accepts: Predicted pass, actually fail (RISKY - quality escapes)
ab_test_df['false_accept'] = (
    (ab_test_df['predicted_pass'] == 1) & (ab_test_df['actual_pass'] == 0)
).astype(int)

print("A/B Test Setup:")
print(f"\nTotal lots: {n_ab_lots}")
print(f"  Variant A: {(ab_test_df['variant'] == 'A').sum()} lots")
print(f"  Variant B: {(ab_test_df['variant'] == 'B').sum()} lots")

print(f"\nRandomization check (should be ~50/50):")
print(f"  A: {(ab_test_df['variant'] == 'A').sum() / n_ab_lots * 100:.1f}%")
print(f"  B: {(ab_test_df['variant'] == 'B').sum() / n_ab_lots * 100:.1f}%")

## 6. Analyze A/B Test Results

**Purpose:** Statistical comparison of Model A vs Model B performance.

**Key Points:**
- **Primary metric**: False reject rate (cost reduction)
- **Guardrail metric**: False accept rate (quality protection)
- **Statistical test**: Two-proportion z-test
- **Why this matters**: Need statistically significant improvement to justify deployment

In [None]:
# Calculate metrics by variant
results_a = ab_test_df[ab_test_df['variant'] == 'A']
results_b = ab_test_df[ab_test_df['variant'] == 'B']

# Primary metric: False reject rate
fr_rate_a = results_a['false_reject'].mean() * 100
fr_rate_b = results_b['false_reject'].mean() * 100

# Guardrail metric: False accept rate
fa_rate_a = results_a['false_accept'].mean() * 100
fa_rate_b = results_b['false_accept'].mean() * 100

# Prediction error (RMSE)
rmse_prod_a = np.sqrt(mean_squared_error(
    results_a['true_yield_pct'],
    results_a['predicted_yield']
))
rmse_prod_b = np.sqrt(mean_squared_error(
    results_b['true_yield_pct'],
    results_b['predicted_yield']
))

print("A/B Test Results:")
print("="*60)

print(f"\nüìä PRIMARY METRIC: False Reject Rate")
print(f"  Model A: {fr_rate_a:.2f}% ({results_a['false_reject'].sum()} / {len(results_a)} lots)")
print(f"  Model B: {fr_rate_b:.2f}% ({results_b['false_reject'].sum()} / {len(results_b)} lots)")
print(f"  Improvement: {fr_rate_a - fr_rate_b:.2f} percentage points")
print(f"  Relative improvement: {((fr_rate_a - fr_rate_b) / fr_rate_a * 100):.1f}%")

print(f"\nüõ°Ô∏è GUARDRAIL METRIC: False Accept Rate")
print(f"  Model A: {fa_rate_a:.2f}%")
print(f"  Model B: {fa_rate_b:.2f}%")
print(f"  Change: {fa_rate_b - fa_rate_a:+.2f} percentage points")

print(f"\nüìà RMSE (Prediction Accuracy)")
print(f"  Model A: {rmse_prod_a:.3f}%")
print(f"  Model B: {rmse_prod_b:.3f}%")
print(f"  Improvement: {rmse_prod_a - rmse_prod_b:.3f}%")

# Statistical significance test for false reject rate
from statsmodels.stats.proportion import proportions_ztest

successes = np.array([results_a['false_reject'].sum(), results_b['false_reject'].sum()])
samples = np.array([len(results_a), len(results_b)])

z_stat, p_value = proportions_ztest(successes, samples)

print(f"\nüìä Statistical Significance Test (False Reject Rate):")
print(f"  Z-statistic: {z_stat:.3f}")
print(f"  P-value: {p_value:.4f}")
print(f"  Significance level: Œ± = 0.05")

if p_value < 0.05:
    print(f"\n  ‚úÖ SIGNIFICANT: Difference is statistically significant (p < 0.05)")
    print(f"     Model B is reliably better than Model A")
else:
    print(f"\n  ‚ùå NOT SIGNIFICANT: Difference could be due to chance (p >= 0.05)")
    print(f"     Need more data or larger effect size")

## 7. Visualize A/B Test Results

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: False reject rate comparison
metrics = ['False Reject Rate (%)', 'False Accept Rate (%)']
a_values = [fr_rate_a, fa_rate_a]
b_values = [fr_rate_b, fa_rate_b]

x = np.arange(len(metrics))
width = 0.35

axes[0, 0].bar(x - width/2, a_values, width, label='Model A', alpha=0.8, color='skyblue')
axes[0, 0].bar(x + width/2, b_values, width, label='Model B', alpha=0.8, color='lightcoral')
axes[0, 0].set_ylabel('Rate (%)')
axes[0, 0].set_title('Business Metrics Comparison')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(metrics)
axes[0, 0].legend()
axes[0, 0].grid(axis='y', alpha=0.3)

# Plot 2: Prediction error distribution
errors_a = results_a['true_yield_pct'] - results_a['predicted_yield']
errors_b = results_b['true_yield_pct'] - results_b['predicted_yield']

axes[0, 1].hist(errors_a, bins=20, alpha=0.6, label=f'Model A (œÉ={errors_a.std():.2f})', color='skyblue')
axes[0, 1].hist(errors_b, bins=20, alpha=0.6, label=f'Model B (œÉ={errors_b.std():.2f})', color='lightcoral')
axes[0, 1].axvline(0, color='black', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Prediction Error (%)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Error Distribution')
axes[0, 1].legend()
axes[0, 1].grid(axis='y', alpha=0.3)

# Plot 3: Cumulative false reject rate over time
results_a_sorted = results_a.sort_values('time')
results_b_sorted = results_b.sort_values('time')

cumulative_fr_a = results_a_sorted['false_reject'].cumsum() / np.arange(1, len(results_a_sorted) + 1) * 100
cumulative_fr_b = results_b_sorted['false_reject'].cumsum() / np.arange(1, len(results_b_sorted) + 1) * 100

axes[1, 0].plot(cumulative_fr_a.values, label='Model A', linewidth=2, color='skyblue')
axes[1, 0].plot(cumulative_fr_b.values, label='Model B', linewidth=2, color='lightcoral')
axes[1, 0].set_xlabel('Lots Tested')
axes[1, 0].set_ylabel('Cumulative False Reject Rate (%)')
axes[1, 0].set_title('Sequential Test Monitoring')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Prediction scatter
axes[1, 1].scatter(results_a['true_yield_pct'], results_a['predicted_yield'],
                   alpha=0.5, s=30, label='Model A', color='skyblue')
axes[1, 1].scatter(results_b['true_yield_pct'], results_b['predicted_yield'],
                   alpha=0.5, s=30, label='Model B', color='lightcoral')
axes[1, 1].plot([60, 100], [60, 100], 'k--', lw=2, label='Perfect prediction')
axes[1, 1].axhline(85, color='red', linestyle=':', linewidth=1, label='Pass/Fail threshold')
axes[1, 1].axvline(85, color='red', linestyle=':', linewidth=1)
axes[1, 1].set_xlabel('Actual Yield (%)')
axes[1, 1].set_ylabel('Predicted Yield (%)')
axes[1, 1].set_title('Prediction Accuracy')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Multi-Armed Bandit (Thompson Sampling)

**Concept:** Adaptively allocate traffic to better-performing variant during test.

**Mathematics (Beta-Bernoulli):**
$$P(\theta_A | \text{data}) \sim \text{Beta}(\alpha_A + s_A, \beta_A + f_A)$$

Where:
- $s_A$ = successes (correct predictions)
- $f_A$ = failures (incorrect predictions)
- Sample from both distributions, route to higher sample

**Advantage:** Minimizes regret (cost of testing inferior variant)

In [None]:
class ThompsonSampling:
    def __init__(self, n_variants=2):
        # Beta distribution parameters (prior: uniform)
        self.alpha = np.ones(n_variants)  # Successes + 1
        self.beta = np.ones(n_variants)   # Failures + 1
        self.n_variants = n_variants
        self.history = []
        
    def select_variant(self):
        """Sample from Beta distributions and select best."""
        samples = np.random.beta(self.alpha, self.beta)
        selected = np.argmax(samples)
        return selected, samples
    
    def update(self, variant, reward):
        """Update Beta parameters based on outcome."""
        if reward == 1:  # Success (correct prediction)
            self.alpha[variant] += 1
        else:  # Failure (incorrect prediction)
            self.beta[variant] += 1
            
        self.history.append({
            'variant': variant,
            'reward': reward,
            'alpha': self.alpha.copy(),
            'beta': self.beta.copy()
        })

# Run Thompson Sampling on production data
ts = ThompsonSampling(n_variants=2)
selections = []
rewards = []

for idx, row in ab_test_df.iterrows():
    # Select variant
    variant, _ = ts.select_variant()
    
    # Make prediction
    if variant == 0:  # Model A
        pred = model_a.predict(row[feature_cols].values.reshape(1, -1))[0]
    else:  # Model B
        pred = model_b.predict(row[feature_cols].values.reshape(1, -1))[0]
    
    # Evaluate (reward = 1 if prediction correct, 0 otherwise)
    pred_class = 1 if pred > 85 else 0
    actual_class = 1 if row['true_yield_pct'] > 85 else 0
    reward = 1 if pred_class == actual_class else 0
    
    # Update
    ts.update(variant, reward)
    selections.append(variant)
    rewards.append(reward)

# Analyze results
selections = np.array(selections)
rewards = np.array(rewards)

print("Thompson Sampling Results:")
print(f"\nVariant selection:")
print(f"  Model A: {(selections == 0).sum()} times ({(selections == 0).mean() * 100:.1f}%)")
print(f"  Model B: {(selections == 1).sum()} times ({(selections == 1).mean() * 100:.1f}%)")

print(f"\nPerformance by variant:")
print(f"  Model A accuracy: {rewards[selections == 0].mean() * 100:.1f}%")
print(f"  Model B accuracy: {rewards[selections == 1].mean() * 100:.1f}%")

print(f"\nOverall accuracy: {rewards.mean() * 100:.1f}%")

# Plot selection over time
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cumulative selection proportion
cumulative_b = np.cumsum(selections == 1) / np.arange(1, len(selections) + 1)
axes[0].plot(cumulative_b * 100, linewidth=2)
axes[0].axhline(50, color='red', linestyle='--', label='Equal split')
axes[0].set_xlabel('Trial Number')
axes[0].set_ylabel('Model B Selection Rate (%)')
axes[0].set_title('Thompson Sampling: Adaptive Allocation')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Beta distributions at end of test
x = np.linspace(0, 1, 1000)
dist_a = stats.beta.pdf(x, ts.alpha[0], ts.beta[0])
dist_b = stats.beta.pdf(x, ts.alpha[1], ts.beta[1])

axes[1].plot(x, dist_a, label=f'Model A (Œ±={ts.alpha[0]:.0f}, Œ≤={ts.beta[0]:.0f})', linewidth=2)
axes[1].plot(x, dist_b, label=f'Model B (Œ±={ts.alpha[1]:.0f}, Œ≤={ts.beta[1]:.0f})', linewidth=2)
axes[1].set_xlabel('Success Rate')
axes[1].set_ylabel('Probability Density')
axes[1].set_title('Posterior Distributions (End of Test)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüí° Thompson Sampling automatically favored the better model!")
print(f"   Reduced regret by testing inferior variant less frequently")

## 9. Decision Framework

**Purpose:** Structured decision criteria for A/B test outcomes.

**Key Points:**
- **Statistical significance**: p < 0.05
- **Practical significance**: Effect size > minimum threshold
- **Guardrail checks**: No degradation in quality metrics
- **Business value**: ROI justifies deployment cost

In [None]:
def make_ab_decision(p_value, effect_size, guardrail_ok, min_effect=1.0, alpha=0.05):
    """
    Decision framework for A/B test outcomes.
    
    Parameters:
    - p_value: Statistical significance
    - effect_size: Magnitude of improvement (percentage points)
    - guardrail_ok: Boolean, True if guardrails passed
    - min_effect: Minimum practical effect size
    - alpha: Significance threshold
    
    Returns:
    - Decision: 'SHIP', 'ITERATE', or 'STOP'
    """
    statistically_significant = p_value < alpha
    practically_significant = abs(effect_size) >= min_effect
    
    print("A/B Test Decision Framework")
    print("="*60)
    
    print(f"\n1. Statistical Significance:")
    print(f"   p-value: {p_value:.4f}")
    print(f"   Threshold: {alpha}")
    print(f"   Result: {'‚úÖ PASS' if statistically_significant else '‚ùå FAIL'}")
    
    print(f"\n2. Practical Significance:")
    print(f"   Effect size: {effect_size:.2f} percentage points")
    print(f"   Minimum threshold: {min_effect:.2f}")
    print(f"   Result: {'‚úÖ PASS' if practically_significant else '‚ùå FAIL'}")
    
    print(f"\n3. Guardrail Metrics:")
    print(f"   False accept rate: {fa_rate_b:.2f}% (Model B)")
    print(f"   Baseline: {fa_rate_a:.2f}% (Model A)")
    print(f"   Result: {'‚úÖ PASS' if guardrail_ok else '‚ùå FAIL (Quality degradation!)'}")
    
    # Decision logic
    print(f"\n" + "="*60)
    if statistically_significant and practically_significant and guardrail_ok:
        decision = "SHIP IT! üöÄ"
        print(f"\n‚úÖ {decision}")
        print(f"   Model B is significantly better and guardrails passed")
        print(f"   Recommended rollout: 10% ‚Üí 50% ‚Üí 100% over 2 weeks")
    elif statistically_significant and practically_significant and not guardrail_ok:
        decision = "ITERATE ‚öôÔ∏è"
        print(f"\n‚ö†Ô∏è  {decision}")
        print(f"   Model B improves primary metric but degrades guardrail")
        print(f"   Recommended: Retrain with guardrail constraints, re-test")
    elif statistically_significant and not practically_significant:
        decision = "STOP üõë"
        print(f"\n‚ùå {decision}")
        print(f"   Effect size too small to justify deployment cost")
        print(f"   Recommended: Keep Model A, focus on larger improvements")
    else:
        decision = "INCONCLUSIVE ü§∑"
        print(f"\n‚ùå {decision}")
        print(f"   Not statistically significant - could be noise")
        print(f"   Recommended: Extend test duration or increase sample size")
    
    return decision

# Apply decision framework to our test
effect_size = fr_rate_a - fr_rate_b  # Improvement in false reject rate
guardrail_ok = fa_rate_b <= fa_rate_a * 1.05  # Allow 5% guardrail degradation

decision = make_ab_decision(
    p_value=p_value,
    effect_size=effect_size,
    guardrail_ok=guardrail_ok,
    min_effect=1.0
)

## 10. Project Templates

### Project 1: Production A/B Testing Infrastructure
**Objective:** Build reusable A/B testing platform for all ML models
- Create traffic splitter routing production requests to variant A or B
- Implement metric logging (latency, accuracy, business KPIs)
- Build real-time dashboard showing test progress
- Auto-stop feature if guardrails violated
- **Success Metric:** Deploy 5+ A/B tests in 6 months, <2% infrastructure overhead

### Project 2: Sequential A/B Testing with Early Stopping
**Objective:** Implement SPRT (Sequential Probability Ratio Test) for faster decisions
- Calculate upper/lower boundaries for cumulative test statistic
- Stop test as soon as crossing boundary (don't wait for fixed duration)
- Reduce average test time 40% while maintaining Type I/II error rates
- Validate with simulations before production deployment
- **Success Metric:** Reduce test duration from 4 weeks to 10 days on average

### Project 3: Multi-Metric A/B Testing
**Objective:** Optimize for multiple objectives simultaneously
- Primary: False reject rate, Secondary: Test time, Guardrail: False accept rate
- Use Bonferroni correction for multiple comparisons
- Build Pareto frontier of non-dominated solutions
- Let stakeholders choose preferred trade-off point
- **Success Metric:** Deployed model balances 3 metrics vs optimizing single metric

### Project 4: Contextual Bandits for Personalized Models
**Objective:** Route to best model based on lot characteristics
- Features: Product family, wafer fab, test program version
- LinUCB algorithm for exploration-exploitation
- Learn which model works best for which context
- Deploy hybrid system using multiple specialized models
- **Success Metric:** 10% better performance than single global model

### Project 5: A/B Testing ROI Calculator
**Objective:** Business case tool for justifying test investments
- Inputs: False reject cost, test duration, deployment effort
- Calculate NPV of deploying Model B vs staying with Model A
- Sensitivity analysis on key assumptions
- Automated report generation for management
- **Success Metric:** 100% of A/B tests have pre-approved ROI threshold

### Project 6: Bayesian A/B Testing
**Objective:** Replace frequentist tests with Bayesian credible intervals
- Implement Bayesian t-test with informative priors
- Report probability that Model B is better (not just p-values)
- Incorporate domain knowledge ("Model B shouldn't be 50% better")
- Continuous monitoring with posterior updates
- **Success Metric:** More intuitive results for stakeholders, faster convergence

### Project 7: Automated A/B Test Analysis
**Objective:** Auto-generate insights from completed tests
- Segment analysis: Which product families benefit most?
- Temporal analysis: Does performance vary by time-of-day/week?
- Novelty detection: Flag unusual patterns in test data
- Natural language summary: "Model B reduces costs 12% for Product X"
- **Success Metric:** Zero manual analysis, insights delivered within 1 hour of test completion

### Project 8: Long-Term Holdout Validation
**Objective:** Catch slow degradation missed by short A/B tests
- Keep 5% traffic on Model A permanently (even after B wins)
- Monitor for concept drift over 3-6 months
- Detect if Model B advantage disappears over time
- Alert if Model A becomes better (trigger rollback)
- **Success Metric:** Catch 2+ drift-related failures before impacting 100% of traffic

## üéì Key Takeaways

**When to Use A/B Testing:**
- ‚úÖ **Production deployment**: Always test new models in production before 100% rollout
- ‚úÖ **High-stakes decisions**: When mistakes cost >$100K (yield, quality, safety)
- ‚úÖ **Uncertain offline-online correlation**: When offline metrics poorly predict real impact
- ‚úÖ **Multiple candidates**: Compare 2-5 model variants simultaneously
- ‚úÖ **Iterative improvement**: Culture of continuous experimentation

**When NOT to Use A/B Testing:**
- ‚ùå **Low traffic**: <1000 samples takes too long to reach significance
- ‚ùå **Immediate need**: Can't wait weeks for statistical significance
- ‚ùå **Unsafe testing**: Can't expose customers/devices to potentially worse variant
- ‚ùå **Identical offline performance**: If offline metrics are identical, A/B test won't help

**Critical Success Factors:**
1. **Randomization**: Truly random assignment (coin flip, hash function)
2. **Sample size**: Power analysis upfront, don't start underpowered tests
3. **Guardrails**: Monitor quality/safety metrics, auto-stop if violated
4. **Pre-registration**: Define success criteria before test starts (avoid p-hacking)
5. **Multiple metrics**: Track primary + secondary + guardrail metrics
6. **Iteration**: Failed test = learning, iterate on Model B and re-test

**Common Pitfalls:**
- ‚ö†Ô∏è **Peeking**: Checking results early and stopping when "significant" (inflates false positives)
- ‚ö†Ô∏è **Multiple testing**: Running many A/B tests without correction (Bonferroni, FDR)
- ‚ö†Ô∏è **Sample ratio mismatch**: 51/49 split instead of 50/50 indicates bias
- ‚ö†Ô∏è **Ignoring guardrails**: Optimizing primary metric at expense of quality
- ‚ö†Ô∏è **Novelty effects**: Initial improvement fades after 2-4 weeks
- ‚ö†Ô∏è **Insufficient power**: Starting test that's mathematically unlikely to detect real effects

**Best Practices:**
1. **A/A test first**: Validate infrastructure has no bias (should see no difference)
2. **Pre-compute sample size**: Don't guess‚Äîuse power analysis
3. **Define MDE**: Minimum detectable effect‚Äîsmallest improvement worth deploying
4. **Segment analysis**: Does Model B help all customer segments equally?
5. **Long-term holdout**: Keep 5% traffic on Model A to catch drift
6. **Document everything**: Test plan, results, decision rationale
7. **Automate analysis**: Reduce human error in statistical tests
8. **Gradual rollout**: 10% ‚Üí 50% ‚Üí 100% over days/weeks

**Statistical Checklist:**
- [ ] Null/alternative hypotheses defined
- [ ] Significance level (Œ±) and power (1-Œ≤) set
- [ ] Sample size calculated and achievable
- [ ] Randomization mechanism validated (A/A test)
- [ ] Primary metric defined and measurable
- [ ] Guardrail metrics defined with thresholds
- [ ] Test duration estimated (time to N samples)
- [ ] Early stopping rules defined (if using sequential testing)
- [ ] Multiple testing correction planned (if >1 metric)

**Next Steps:**
- Study **107: Model Monitoring** for post-deployment tracking
- Explore Bayesian A/B testing (more intuitive for stakeholders)
- Learn multi-armed bandits for faster convergence
- Read "Trustworthy Online Controlled Experiments" (Kohavi et al.)
- Practice with real production traffic (start with low-stakes models)