# 110: Experimental Design & A/B Testing

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** statistical hypothesis testing and experimental design principles
- **Calculate** sample sizes using power analysis for reliable experiments
- **Implement** A/B tests with proper statistical rigor (t-tests, chi-square, Bayesian)
- **Apply** multi-armed bandit strategies for adaptive experimentation
- **Evaluate** experiment validity using p-values, confidence intervals, and effect sizes
- **Design** production-ready A/B testing frameworks for post-silicon and product development

## üìö What is Experimental Design?

**Experimental design** is the systematic planning of experiments to answer specific questions while controlling for confounding variables. It ensures that observed effects are due to treatments (interventions) rather than chance or bias.

**A/B testing** (also called split testing) is the most common experimental design in tech, comparing two variants (A = control, B = treatment) to determine which performs better on a key metric. It's the foundation of data-driven decision making in product development, marketing, and engineering.

**Why Experimental Design?**
- ‚úÖ **Causal Inference**: Proves causation (not just correlation) through randomization
- ‚úÖ **Risk Mitigation**: Test changes on small groups before full rollout
- ‚úÖ **Quantified Impact**: Measure effect size with statistical confidence
- ‚úÖ **Optimization**: Continuously improve products/processes through iteration

## üè≠ Post-Silicon Validation Use Cases

**Test Program Optimization**
- Input: Test flow variants (e.g., parallel vs sequential test insertion)
- Output: Test time reduction % with maintained defect coverage
- Value: $500K+ annual savings per tester (faster throughput)

**Device Binning Strategy**
- Input: Alternative voltage/frequency bin thresholds
- Output: Yield improvement % vs product quality trade-off
- Value: 2-5% yield gain = millions in revenue for high-volume products

**Burn-In Process Effectiveness**
- Input: Burn-in duration variants (24hr vs 48hr vs 72hr)
- Output: Infant mortality reduction % vs cost increase
- Value: Reduced field failures, warranty costs

**Parametric Limit Tuning**
- Input: Tighter vs relaxed test limits for Vdd, Idd, frequency
- Output: Defect escape rate vs yield loss
- Value: Balance quality (customer satisfaction) with profitability

## üîÑ Experimental Design Workflow

```mermaid
graph LR
    A[Define Hypothesis] --> B[Calculate Sample Size]
    B --> C[Randomize Assignment]
    C --> D[Run Experiment]
    D --> E[Collect Data]
    E --> F[Statistical Test]
    F --> G{Significant?}
    G -->|Yes| H[Implement Winner]
    G -->|No| I[No Change]
    
    style A fill:#e1f5ff
    style H fill:#e1ffe1
    style I fill:#ffe1e1
```

## üìä Learning Path Context

**Prerequisites:**
- 010: Linear Regression (hypothesis testing basics)
- 106: A/B Testing ML Models (model comparison context)

**Next Steps:**
- 111: Causal Inference (advanced treatment effect estimation)
- 112: Bayesian Statistics (Bayesian A/B testing)

---

Let's build rigorous experimentation systems! üöÄ

## 1. Setup & Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency, ttest_ind, mannwhitneyu
from statsmodels.stats.power import TTestIndPower, NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

# Random seed for reproducibility
np.random.seed(42)

print("‚úÖ Libraries loaded successfully!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")

## 2. Statistical Power Analysis

**Purpose:** Calculate required sample size to detect meaningful effects with high confidence.

**Key Points:**
- **Statistical Power (1 - Œ≤)**: Probability of detecting true effect (target: 80%+)
- **Significance Level (Œ±)**: Probability of false positive (Type I error, typically 0.05)
- **Effect Size (Cohen's d)**: Magnitude of difference (small=0.2, medium=0.5, large=0.8)
- **Sample Size Trade-off**: Larger samples detect smaller effects but cost more

**Why This Matters:** Underpowered experiments waste resources and miss real effects. Overpowered experiments waste money detecting trivial differences.

In [None]:
# Power analysis for t-test (continuous metrics)
def calculate_sample_size_ttest(effect_size, alpha=0.05, power=0.8):
    """
    Calculate required sample size per group for independent t-test.
    
    Parameters:
    - effect_size: Cohen's d (standardized difference between means)
    - alpha: Significance level (Type I error rate)
    - power: Statistical power (1 - Type II error rate)
    
    Returns:
    - Required sample size per group
    """
    analysis = TTestIndPower()
    sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1.0)
    return int(np.ceil(sample_size))

# Example: Post-silicon test time reduction experiment
# Current avg test time: 5.0s, New flow: 4.5s, Std: 0.8s
# Effect size = (5.0 - 4.5) / 0.8 = 0.625 (medium-large effect)

effect_sizes = [0.2, 0.5, 0.8]  # Small, medium, large
labels = ['Small (0.2)', 'Medium (0.5)', 'Large (0.8)']

print("Sample Size Requirements for t-test:")
print("=" * 60)
for effect, label in zip(effect_sizes, labels):
    n = calculate_sample_size_ttest(effect)
    print(f"Effect Size {label:15s}: {n:4d} samples per group ({n*2} total)")

# Visualize power curves
sample_sizes = np.arange(10, 500, 10)
fig, ax = plt.subplots(figsize=(10, 6))

for effect, label in zip(effect_sizes, labels):
    analysis = TTestIndPower()
    power_values = [analysis.solve_power(effect_size=effect, nobs1=n, alpha=0.05, ratio=1.0) for n in sample_sizes]
    ax.plot(sample_sizes, power_values, label=label, linewidth=2)

ax.axhline(y=0.8, color='red', linestyle='--', label='Target Power (80%)')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('Statistical Power')
ax.set_title('Power Curves for Different Effect Sizes (Œ± = 0.05)')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüìä Interpretation: To detect a medium effect (0.5) with 80% power, need ~64 devices per variant.")

## 3. A/B Test: Continuous Metrics (t-test)

**Purpose:** Compare means of two groups when the metric is continuous (e.g., test time, voltage, revenue).

**Key Points:**
- **Null Hypothesis (H‚ÇÄ)**: Mean_A = Mean_B (no difference)
- **Alternative Hypothesis (H‚ÇÅ)**: Mean_A ‚â† Mean_B (two-tailed test)
- **T-statistic**: Measures how many standard errors the means differ by
- **P-value**: Probability of observing this difference if H‚ÇÄ is true (p < 0.05 ‚Üí reject H‚ÇÄ)

**Why This Matters:** T-tests are the workhorse of A/B testing for continuous outcomes. Post-silicon examples: test time, power consumption, frequency performance.

In [None]:
# Simulate A/B test: Test flow optimization
# Control (A): Current test flow, avg = 5.0s, std = 0.8s
# Treatment (B): Optimized flow, avg = 4.5s, std = 0.7s (10% faster)

np.random.seed(123)
n_samples = 100  # Per group (based on power analysis for effect size ~0.625)

# Generate data
test_time_A = np.random.normal(5.0, 0.8, n_samples)  # Control
test_time_B = np.random.normal(4.5, 0.7, n_samples)  # Treatment

# Create dataframe
ab_data = pd.DataFrame({
    'variant': ['A'] * n_samples + ['B'] * n_samples,
    'test_time_sec': np.concatenate([test_time_A, test_time_B])
})

# Descriptive statistics
summary_stats = ab_data.groupby('variant')['test_time_sec'].agg(['mean', 'std', 'count'])
print("A/B Test Summary Statistics:")
print(summary_stats)

# Perform t-test
t_stat, p_value = ttest_ind(test_time_A, test_time_B)
mean_diff = test_time_A.mean() - test_time_B.mean()
percent_improvement = (mean_diff / test_time_A.mean()) * 100

# Cohen's d (effect size)
pooled_std = np.sqrt(((n_samples - 1) * test_time_A.std()**2 + (n_samples - 1) * test_time_B.std()**2) / (2 * n_samples - 2))
cohens_d = mean_diff / pooled_std

# Confidence interval (95%)
se_diff = np.sqrt(test_time_A.var()/n_samples + test_time_B.var()/n_samples)
ci_lower = mean_diff - 1.96 * se_diff
ci_upper = mean_diff + 1.96 * se_diff

print(f"\n{'='*60}")
print(f"T-Test Results:")
print(f"{'='*60}")
print(f"Mean Difference (A - B): {mean_diff:.3f} seconds")
print(f"Improvement: {percent_improvement:.2f}%")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Cohen's d (Effect Size): {cohens_d:.3f}")
print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")

if p_value < 0.05:
    print(f"\n‚úÖ RESULT: Statistically significant (p < 0.05)")
    print(f"   Decision: Adopt variant B (optimized test flow)")
    print(f"   Expected savings: {percent_improvement:.1f}% test time reduction")
else:
    print(f"\n‚ùå RESULT: Not statistically significant (p ‚â• 0.05)")
    print(f"   Decision: Insufficient evidence to change")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram comparison
axes[0].hist(test_time_A, bins=20, alpha=0.6, label='Control (A)', color='blue', edgecolor='black')
axes[0].hist(test_time_B, bins=20, alpha=0.6, label='Treatment (B)', color='green', edgecolor='black')
axes[0].axvline(test_time_A.mean(), color='blue', linestyle='--', linewidth=2, label=f'Mean A: {test_time_A.mean():.2f}s')
axes[0].axvline(test_time_B.mean(), color='green', linestyle='--', linewidth=2, label=f'Mean B: {test_time_B.mean():.2f}s')
axes[0].set_xlabel('Test Time (seconds)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution Comparison')
axes[0].legend()

# Boxplot comparison
ab_data.boxplot(column='test_time_sec', by='variant', ax=axes[1])
axes[1].set_title('Test Time by Variant')
axes[1].set_xlabel('Variant')
axes[1].set_ylabel('Test Time (seconds)')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

## 4. A/B Test: Proportion Metrics (Chi-Square Test)

**Purpose:** Compare conversion rates or proportions between two groups (e.g., pass/fail rates, click-through rates).

**Key Points:**
- **Use Case**: Binary outcomes (pass/fail, click/no-click, buy/no-buy)
- **Chi-Square Test**: Tests independence between categorical variables
- **Expected vs Observed Counts**: Compares actual data to what's expected under H‚ÇÄ
- **Contingency Table**: 2x2 table showing counts for each variant √ó outcome combination

**Why This Matters:** Most product metrics are proportions (conversion rate, defect rate, yield%). Chi-square is the standard test for categorical A/B tests.

In [None]:
# Simulate A/B test: Device binning strategy
# Control (A): Conservative limits, 88% pass rate
# Treatment (B): Relaxed limits, 92% pass rate (4% yield improvement)

np.random.seed(456)
n_devices = 500  # Per variant

# Generate data
pass_rate_A = 0.88
pass_rate_B = 0.92

devices_A = np.random.binomial(1, pass_rate_A, n_devices)  # 1 = pass, 0 = fail
devices_B = np.random.binomial(1, pass_rate_B, n_devices)

# Create contingency table
pass_A = devices_A.sum()
fail_A = n_devices - pass_A
pass_B = devices_B.sum()
fail_B = n_devices - pass_B

contingency_table = np.array([
    [pass_A, fail_A],  # Control
    [pass_B, fail_B]   # Treatment
])

# Chi-square test
chi2, p_value_chi, dof, expected = chi2_contingency(contingency_table)

# Effect size (Cramer's V)
n_total = contingency_table.sum()
cramers_v = np.sqrt(chi2 / n_total)

# Observed rates
observed_rate_A = pass_A / n_devices
observed_rate_B = pass_B / n_devices
rate_diff = observed_rate_B - observed_rate_A
relative_lift = (rate_diff / observed_rate_A) * 100

print("Contingency Table (Observed Counts):")
print("=" * 40)
print(f"{'Variant':<10} {'Pass':<10} {'Fail':<10} {'Total':<10}")
print(f"{'A (Control)':<10} {pass_A:<10} {fail_A:<10} {n_devices:<10}")
print(f"{'B (Treatment)':<10} {pass_B:<10} {fail_B:<10} {n_devices:<10}")

print(f"\n{'='*60}")
print(f"Chi-Square Test Results:")
print(f"{'='*60}")
print(f"Pass Rate A: {observed_rate_A:.4f} ({observed_rate_A*100:.2f}%)")
print(f"Pass Rate B: {observed_rate_B:.4f} ({observed_rate_B*100:.2f}%)")
print(f"Absolute Difference: {rate_diff:.4f} ({rate_diff*100:.2f} percentage points)")
print(f"Relative Lift: {relative_lift:.2f}%")
print(f"Chi-Square Statistic: {chi2:.3f}")
print(f"P-value: {p_value_chi:.6f}")
print(f"Cramer's V (Effect Size): {cramers_v:.3f}")

if p_value_chi < 0.05:
    print(f"\n‚úÖ RESULT: Statistically significant (p < 0.05)")
    print(f"   Decision: Adopt variant B (relaxed binning limits)")
    print(f"   Expected yield gain: {rate_diff*100:.2f} percentage points")
    print(f"   Revenue impact: ~${(rate_diff * n_devices * 2 * 10):,.0f} (assuming $10/device, 500K units/year)")
else:
    print(f"\n‚ùå RESULT: Not statistically significant (p ‚â• 0.05)")
    print(f"   Decision: Insufficient evidence to change binning strategy")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart: Pass rates
variants = ['A (Control)', 'B (Treatment)']
pass_rates = [observed_rate_A, observed_rate_B]
colors = ['blue', 'green']

axes[0].bar(variants, pass_rates, color=colors, alpha=0.7, edgecolor='black')
axes[0].axhline(y=0.88, color='red', linestyle='--', label='Baseline (88%)')
axes[0].set_ylabel('Pass Rate')
axes[0].set_title('Device Pass Rate by Variant')
axes[0].set_ylim(0.8, 1.0)
axes[0].legend()

# Add percentage labels
for i, (variant, rate) in enumerate(zip(variants, pass_rates)):
    axes[0].text(i, rate + 0.01, f'{rate*100:.2f}%', ha='center', fontweight='bold')

# Stacked bar: Pass/Fail counts
pass_counts = [pass_A, pass_B]
fail_counts = [fail_A, fail_B]

axes[1].bar(variants, pass_counts, label='Pass', color='green', alpha=0.7, edgecolor='black')
axes[1].bar(variants, fail_counts, bottom=pass_counts, label='Fail', color='red', alpha=0.7, edgecolor='black')
axes[1].set_ylabel('Count')
axes[1].set_title('Pass/Fail Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

## 5. Bayesian A/B Testing

**Purpose:** Use Bayesian inference to estimate probability that variant B is better than A (more intuitive than p-values).

**Key Points:**
- **Prior Distribution**: Initial belief before seeing data (e.g., Beta(1,1) = uniform)
- **Likelihood**: Observed data (successes/failures)
- **Posterior Distribution**: Updated belief after seeing data (Beta distribution)
- **Probability B > A**: Direct probability statement ("95% chance B is better")

**Why This Matters:** Bayesian A/B testing provides intuitive probabilities instead of confusing p-values. Allows early stopping when posterior probability is convincing.

In [None]:
from scipy.stats import beta

# Use same data from chi-square test
# Prior: Beta(1, 1) = uniform (no prior knowledge)
alpha_prior = 1
beta_prior = 1

# Posterior parameters (Beta distribution)
# Posterior ~ Beta(alpha_prior + successes, beta_prior + failures)
alpha_A = alpha_prior + pass_A
beta_A = beta_prior + fail_A

alpha_B = alpha_prior + pass_B
beta_B = beta_prior + fail_B

# Sample from posterior distributions
n_samples_bayes = 100000
samples_A = np.random.beta(alpha_A, beta_A, n_samples_bayes)
samples_B = np.random.beta(alpha_B, beta_B, n_samples_bayes)

# Probability that B > A
prob_B_better = (samples_B > samples_A).mean()

# Expected loss if we choose wrong variant
loss_if_choose_A = np.maximum(samples_B - samples_A, 0).mean()  # Loss if we pick A but B is better
loss_if_choose_B = np.maximum(samples_A - samples_B, 0).mean()  # Loss if we pick B but A is better

# Credible intervals (Bayesian equivalent of confidence intervals)
ci_A = np.percentile(samples_A, [2.5, 97.5])
ci_B = np.percentile(samples_B, [2.5, 97.5])

print("Bayesian A/B Test Results:")
print("=" * 60)
print(f"Posterior Mean (A): {samples_A.mean():.4f}")
print(f"Posterior Mean (B): {samples_B.mean():.4f}")
print(f"95% Credible Interval (A): [{ci_A[0]:.4f}, {ci_A[1]:.4f}]")
print(f"95% Credible Interval (B): [{ci_B[0]:.4f}, {ci_B[1]:.4f}]")
print(f"\nProbability B > A: {prob_B_better:.4f} ({prob_B_better*100:.2f}%)")
print(f"Expected Loss if Choose A: {loss_if_choose_A:.6f}")
print(f"Expected Loss if Choose B: {loss_if_choose_B:.6f}")

if prob_B_better > 0.95:
    print(f"\n‚úÖ DECISION: Choose variant B (>{prob_B_better*100:.0f}% probability of being better)")
elif prob_B_better < 0.05:
    print(f"\n‚úÖ DECISION: Choose variant A ({(1-prob_B_better)*100:.0f}% probability of being better)")
else:
    print(f"\n‚ö†Ô∏è DECISION: Inconclusive - continue testing or use business judgment")

# Visualization: Posterior distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Posterior distributions
x = np.linspace(0.8, 1.0, 1000)
posterior_A = beta.pdf(x, alpha_A, beta_A)
posterior_B = beta.pdf(x, alpha_B, beta_B)

axes[0].plot(x, posterior_A, label='A (Control)', color='blue', linewidth=2)
axes[0].plot(x, posterior_B, label='B (Treatment)', color='green', linewidth=2)
axes[0].fill_between(x, posterior_A, alpha=0.3, color='blue')
axes[0].fill_between(x, posterior_B, alpha=0.3, color='green')
axes[0].set_xlabel('Pass Rate')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Posterior Distributions')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Difference distribution (B - A)
difference = samples_B - samples_A
axes[1].hist(difference, bins=50, color='purple', alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='No Difference')
axes[1].axvline(x=difference.mean(), color='green', linestyle='-', linewidth=2, label=f'Mean: {difference.mean():.4f}')
axes[1].set_xlabel('Pass Rate Difference (B - A)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Pass Rate Improvement')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìä Interpretation: With {prob_B_better*100:.1f}% probability, variant B has higher pass rate.")

## 6. Multi-Armed Bandit (Thompson Sampling)

**Purpose:** Adaptive experimentation that balances exploration (trying variants) and exploitation (using best variant).

**Key Points:**
- **Regret Minimization**: Reduce opportunity cost of using suboptimal variants during testing
- **Thompson Sampling**: Bayesian algorithm that samples from posterior and picks best sample
- **Dynamic Allocation**: Automatically shifts traffic to better-performing variants
- **Continuous Learning**: No fixed test duration, keeps optimizing

**Why This Matters:** Traditional A/B tests waste 50% of traffic on inferior variants. Bandits reduce this waste while still learning which is best.

In [None]:
# Simulate multi-armed bandit for device binning optimization
# 3 variants: Conservative (88%), Moderate (91%), Aggressive (93%)

class ThompsonSamplingBandit:
    def __init__(self, n_arms, true_rates):
        self.n_arms = n_arms
        self.true_rates = true_rates  # True pass rates (unknown to algorithm)
        
        # Prior: Beta(1, 1) for each arm
        self.alpha = np.ones(n_arms)
        self.beta = np.ones(n_arms)
        
        # Tracking
        self.pulls = np.zeros(n_arms)
        self.successes = np.zeros(n_arms)
        self.cumulative_reward = 0
        self.cumulative_regret = 0
        self.history = []
        
    def select_arm(self):
        """Thompson Sampling: Sample from each arm's posterior and pick highest."""
        samples = [np.random.beta(self.alpha[i], self.beta[i]) for i in range(self.n_arms)]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        """Update posterior after observing reward (1 = pass, 0 = fail)."""
        self.pulls[arm] += 1
        self.successes[arm] += reward
        
        # Update Beta parameters
        self.alpha[arm] += reward
        self.beta[arm] += (1 - reward)
        
        # Track performance
        self.cumulative_reward += reward
        best_rate = max(self.true_rates)
        self.cumulative_regret += (best_rate - self.true_rates[arm])
        
        self.history.append({
            'arm': arm,
            'reward': reward,
            'cumulative_regret': self.cumulative_regret
        })
    
    def run(self, n_rounds):
        """Run bandit for n_rounds."""
        for _ in range(n_rounds):
            arm = self.select_arm()
            reward = np.random.binomial(1, self.true_rates[arm])  # Simulate device test
            self.update(arm, reward)

# Run simulation
np.random.seed(789)
true_rates = [0.88, 0.91, 0.93]  # Conservative, Moderate, Aggressive
n_rounds = 1000

bandit = ThompsonSamplingBandit(n_arms=3, true_rates=true_rates)
bandit.run(n_rounds)

# Results
print("Multi-Armed Bandit Results (Thompson Sampling):")
print("=" * 60)
for i in range(3):
    empirical_rate = bandit.successes[i] / bandit.pulls[i] if bandit.pulls[i] > 0 else 0
    print(f"Arm {i} (True Rate: {true_rates[i]:.2f}):")
    print(f"  Pulls: {int(bandit.pulls[i])} ({bandit.pulls[i]/n_rounds*100:.1f}%)")
    print(f"  Successes: {int(bandit.successes[i])}")
    print(f"  Empirical Rate: {empirical_rate:.4f}")

best_arm = np.argmax(bandit.successes / (bandit.pulls + 1e-10))
print(f"\n‚úÖ Best Arm: {best_arm} (Rate: {true_rates[best_arm]:.2f})")
print(f"Cumulative Reward: {bandit.cumulative_reward:.0f} / {n_rounds} = {bandit.cumulative_reward/n_rounds:.4f}")
print(f"Cumulative Regret: {bandit.cumulative_regret:.2f}")

# Compare to fixed A/B/C test (equal traffic split)
fixed_reward = n_rounds * np.mean(true_rates)  # If we split traffic evenly
optimal_reward = n_rounds * max(true_rates)  # If we knew best arm upfront

print(f"\nComparison:")
print(f"  Fixed A/B/C Test (equal split): {fixed_reward:.0f} successes")
print(f"  Thompson Sampling: {bandit.cumulative_reward:.0f} successes")
print(f"  Optimal (oracle): {optimal_reward:.0f} successes")
print(f"  Bandit Gain vs Fixed: {bandit.cumulative_reward - fixed_reward:.0f} extra passes")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Arm selection over time
history_df = pd.DataFrame(bandit.history)
arm_counts = history_df.groupby([history_df.index // 50, 'arm']).size().unstack(fill_value=0)
arm_counts.plot(kind='bar', stacked=True, ax=axes[0, 0], color=['blue', 'orange', 'green'])
axes[0, 0].set_title('Arm Selection Over Time (50-round bins)')
axes[0, 0].set_xlabel('Time Period')
axes[0, 0].set_ylabel('Pulls')
axes[0, 0].legend(['Arm 0 (88%)', 'Arm 1 (91%)', 'Arm 2 (93%)'])

# 2. Cumulative regret
axes[0, 1].plot(history_df['cumulative_regret'], color='red', linewidth=2)
axes[0, 1].set_title('Cumulative Regret Over Time')
axes[0, 1].set_xlabel('Round')
axes[0, 1].set_ylabel('Cumulative Regret')
axes[0, 1].grid(alpha=0.3)

# 3. Final arm distribution
pull_percentages = bandit.pulls / n_rounds * 100
axes[1, 0].bar(['Arm 0\n(88%)', 'Arm 1\n(91%)', 'Arm 2\n(93%)'], pull_percentages, 
               color=['blue', 'orange', 'green'], edgecolor='black', alpha=0.7)
axes[1, 0].set_ylabel('Percentage of Pulls')
axes[1, 0].set_title('Final Traffic Allocation')
axes[1, 0].axhline(y=33.33, color='red', linestyle='--', label='Equal Split (33.3%)')
axes[1, 0].legend()

# 4. Posterior distributions
x_post = np.linspace(0.8, 1.0, 1000)
for i in range(3):
    posterior = beta.pdf(x_post, bandit.alpha[i], bandit.beta[i])
    axes[1, 1].plot(x_post, posterior, label=f'Arm {i}', linewidth=2)
    axes[1, 1].axvline(true_rates[i], color=f'C{i}', linestyle='--', alpha=0.5)

axes[1, 1].set_xlabel('Pass Rate')
axes[1, 1].set_ylabel('Posterior Density')
axes[1, 1].set_title('Learned Posterior Distributions')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìä Interpretation: Bandit automatically converged to best arm (2) with ~{pull_percentages[2]:.0f}% traffic.")

## üöÄ Real-World Project Templates

Build production experimentation systems using these frameworks:

### 1Ô∏è‚É£ **Post-Silicon Test Program Optimizer**
- **Objective**: A/B test different test insertion orders to minimize total test time  
- **Data**: 10K+ devices, test times per block, defect coverage metrics  
- **Success Metric**: Reduce test time by 10%+ while maintaining 99%+ defect coverage  
- **Features**: Multi-objective optimization (time vs coverage), sequential testing, factorial design  
- **Tech Stack**: Python, statsmodels, DOE (Design of Experiments), Monte Carlo simulation

### 2Ô∏è‚É£ **Website Conversion Optimization Platform**
- **Objective**: Test landing page variants to maximize sign-up conversion rate  
- **Data**: 100K+ visitors/month, click-through, bounce rate, conversion events  
- **Success Metric**: Increase conversion from 2.5% ‚Üí 3.0% (20% relative lift)  
- **Features**: Bayesian sequential testing, multi-variant testing, segmentation analysis  
- **Tech Stack**: Google Optimize, Python, BigQuery, Looker dashboards

### 3Ô∏è‚É£ **Email Campaign A/B Testing Engine**
- **Objective**: Test subject lines, send times, content to maximize open/click rates  
- **Data**: 500K subscribers, open rates, click rates, unsubscribe rates  
- **Success Metric**: Improve click-through rate from 3.2% ‚Üí 4.0%  
- **Features**: Multi-armed bandit for subject lines, time-based segmentation, fatigue analysis  
- **Tech Stack**: Mailchimp API, Python, Thompson Sampling, Redshift

### 4Ô∏è‚É£ **Manufacturing Process Optimization**
- **Objective**: Test burn-in durations to balance infant mortality vs throughput  
- **Data**: 50K devices, field failure rates (0-90 days), burn-in costs  
- **Success Metric**: Reduce infant mortality by 30% with < 5% throughput loss  
- **Features**: Survival analysis, cost-benefit modeling, sequential experimentation  
- **Tech Stack**: JMP/Minitab for DOE, Python (lifelines), Tableau

### 5Ô∏è‚É£ **Recommendation Algorithm A/B Test**
- **Objective**: Test collaborative filtering vs content-based recommendations  
- **Data**: 1M+ users, click-through, watch time, purchase conversion  
- **Success Metric**: Increase engagement time by 15% (avg session: 12 min ‚Üí 14 min)  
- **Features**: Stratified randomization (by user tenure), long-term holdout, network effects correction  
- **Tech Stack**: Spark, MLflow, custom experimentation framework, Kafka

### 6Ô∏è‚É£ **Pricing Experimentation Platform**
- **Objective**: Test pricing tiers to maximize revenue per user  
- **Data**: 200K customers, price elasticity, churn rates, LTV  
- **Success Metric**: Increase ARPU (average revenue per user) by $5/month  
- **Features**: Conjoint analysis, demand curve estimation, competitive positioning  
- **Tech Stack**: Optimizely, Python (econometrics), Stripe integration, Mixpanel

### 7Ô∏è‚É£ **Mobile App Onboarding Flow Test**
- **Objective**: Optimize tutorial flow to maximize Day-7 retention  
- **Data**: 50K new users/week, onboarding completion, D1/D7/D30 retention  
- **Success Metric**: Improve D7 retention from 40% ‚Üí 48%  
- **Features**: Funnel analysis, sequential A/B testing, cohort comparison  
- **Tech Stack**: Firebase A/B Testing, Amplitude, Python (survival analysis)

### 8Ô∏è‚É£ **Search Ranking Algorithm Experiment**
- **Objective**: Test BM25 vs neural ranking model for search relevance  
- **Data**: 10M searches/month, click position, dwell time, conversion  
- **Success Metric**: Reduce "pogo-sticking" (return to search) by 20%  
- **Features**: Interleaving experiments, pairwise comparison, novelty/diversity metrics  
- **Tech Stack**: Elasticsearch, custom interleaving framework, ClickHouse, Grafana

## üéØ Key Takeaways

### What is Experimental Design?
Systematic planning of controlled experiments to establish causal relationships between interventions (treatments) and outcomes while minimizing bias and confounding variables.

### Why A/B Testing?
- **Causal Evidence**: Randomization eliminates confounding ‚Üí proves causation
- **Risk Management**: Test changes on small samples before full rollout
- **Data-Driven Decisions**: Quantify impact with statistical confidence
- **Continuous Improvement**: Iterate quickly with measured improvements

### Core Statistical Concepts

| **Concept** | **Definition** | **Typical Value** |
|------------|---------------|------------------|
| **Significance Level (Œ±)** | Probability of Type I error (false positive) | 0.05 (5%) |
| **Statistical Power (1-Œ≤)** | Probability of detecting true effect | 0.80 (80%) |
| **Effect Size** | Standardized magnitude of difference | Small: 0.2, Medium: 0.5, Large: 0.8 |
| **P-value** | Probability of observing data if H‚ÇÄ is true | < 0.05 for significance |
| **Confidence Interval** | Range likely containing true parameter | 95% CI common |

### Sample Size Formulas

**T-test (continuous metrics):**
$$n = \frac{2(Z_{\alpha/2} + Z_{\beta})^2 \sigma^2}{\delta^2}$$
- $Z_{\alpha/2}$ = 1.96 for Œ± = 0.05
- $Z_{\beta}$ = 0.84 for power = 0.8
- $\sigma$ = standard deviation
- $\delta$ = minimum detectable effect

**Proportion test (conversion rates):**
$$n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 [p_1(1-p_1) + p_2(1-p_2)]}{(p_1 - p_2)^2}$$
- $p_1, p_2$ = baseline and treatment proportions

### Test Selection Guide

| **Metric Type** | **Test** | **When to Use** | **Example** |
|----------------|---------|----------------|------------|
| **Continuous (mean)** | T-test | Normal distribution, equal variance | Test time, revenue, temperature |
| **Continuous (median)** | Mann-Whitney U | Non-normal, outliers | Skewed distributions |
| **Proportion** | Chi-square / Z-test | Binary outcome | Pass/fail, click/no-click |
| **Count data** | Poisson test | Rare events | Defects per wafer |
| **Survival** | Log-rank test | Time-to-event | Device lifetime, churn |

### Bayesian vs Frequentist A/B Testing

**Frequentist (Classical):**
- ‚úÖ Well-understood, industry standard
- ‚úÖ Fixed sample size, clear stopping rule
- ‚ùå P-values are confusing ("probability of data given H‚ÇÄ")
- ‚ùå Can't peek at results (inflates Type I error)

**Bayesian:**
- ‚úÖ Intuitive probabilities ("95% chance B is better")
- ‚úÖ Can update continuously, early stopping allowed
- ‚úÖ Incorporates prior knowledge
- ‚ùå Requires prior specification (can introduce bias)
- ‚ùå Computationally intensive for complex models

### Multi-Armed Bandits vs Fixed A/B

**Fixed A/B Test:**
- Allocate 50/50 traffic until reaching sample size
- Regret: Wastes 50% traffic on inferior variant
- Clear statistical guarantees

**Multi-Armed Bandit (Thompson Sampling):**
- Dynamically allocate more traffic to better variants
- Regret: Grows logarithmically (much better)
- No fixed stopping time (continuous optimization)
- Trade-off: Slower to converge on true best with certainty

**When to Use Bandits:**
- High traffic (can learn quickly)
- Short-term optimizations (email subject lines)
- Cost of using suboptimal variant is high

**When to Use Fixed A/B:**
- Low traffic (need clean statistical test)
- Long-term strategic decisions (product redesign)
- Regulatory/compliance requirements (clear p-value needed)

### Common Pitfalls

- ‚ùå **Peeking at Results**: Checking p-values repeatedly inflates false positive rate ‚Üí Use sequential testing or Bayesian methods
- ‚ùå **Underpowered Tests**: Small samples miss real effects ‚Üí Always do power analysis first
- ‚ùå **Multiple Testing**: Running 20 tests ‚Üí expect 1 false positive at Œ±=0.05 ‚Üí Use Bonferroni correction
- ‚ùå **Ignoring Novelty Effect**: New variants get temporary boost ‚Üí Run for full business cycle
- ‚ùå **Selection Bias**: Non-random assignment ‚Üí Use proper randomization
- ‚ùå **Network Effects**: User interactions affect each other ‚Üí Use cluster randomization

### Post-Silicon Experimentation Best Practices

**Device-Level Randomization:**
- Randomize at wafer/lot level to avoid tester bias
- Control for spatial effects (edge vs center dies)
- Match on process node, foundry, vintage

**Multi-Objective Optimization:**
- Balance test time, defect coverage, yield
- Use Pareto frontier for trade-off analysis
- Weight objectives by business value

**Sequential Testing:**
- Test program changes incrementally (insert ‚Üí reorder ‚Üí remove)
- Use Bonferroni correction for multiple comparisons
- Document assumptions for regulatory audit

### Production Implementation Checklist

- ‚úÖ **Pre-Experiment:**
  - Define primary metric and guardrail metrics
  - Calculate sample size (power analysis)
  - Set up randomization infrastructure
  - Plan for outlier/anomaly handling

- ‚úÖ **During Experiment:**
  - Monitor sample ratio mismatch (50/50 assignment working?)
  - Check guardrail metrics (no unexpected harm)
  - Log all assignment decisions for reproducibility

- ‚úÖ **Post-Experiment:**
  - Calculate confidence intervals, not just p-values
  - Segment analysis (does effect vary by user type?)
  - Long-term holdout to measure sustained impact

### Tool Ecosystem

**Experimentation Platforms:**
- **Optimizely, VWO**: Commercial A/B testing (web/mobile)
- **Google Optimize**: Free for small-scale tests
- **Statsig, Eppo**: Modern, Bayesian-focused platforms
- **Custom**: Python + feature flags (LaunchDarkly) + analytics DB

**Statistical Analysis:**
- **statsmodels**: Power analysis, hypothesis tests
- **scipy.stats**: T-tests, chi-square, distributions
- **PyMC/Stan**: Bayesian inference
- **JMP/Minitab**: Industrial DOE (design of experiments)

### Next Steps
- **Notebook 111**: Causal Inference (propensity scores, DiD, instrumental variables)
- **Notebook 112**: Bayesian Statistics (PyMC, hierarchical models)
- **Advanced**: Multi-armed bandits with contextual information, adaptive experimental design

---

**Remember**: *In God we trust, all others bring data... and proper experimental design!* üìä