```{contents}
```
## A/B Testing

---

### 1. Definition

**A/B testing in Generative AI** is a controlled experimental framework for comparing two or more versions of a generative system (models, prompts, pipelines, or policies) under real usage to determine which produces better outcomes according to defined metrics.

Formally:

> Given variants ( A ) and ( B ), expose users or tasks randomly and independently, measure performance, and perform statistical inference to select the superior system.

---

### 2. Why A/B Testing is Special for Generative AI

| Classical A/B        | Generative AI A/B                     |
| -------------------- | ------------------------------------- |
| Deterministic output | **Stochastic output**                 |
| Single scalar metric | **Multi-dimensional quality metrics** |
| Stable responses     | **High variance responses**           |
| Easy ground truth    | **Often no single ground truth**      |

Challenges unique to GenAI:

* Non-determinism from sampling (temperature, top-p)
* Human-in-the-loop evaluation
* Subjective quality metrics
* Prompt and context sensitivity
* Latency–quality–cost tradeoffs

---

### 3. What Can Be A/B Tested in GenAI

| Layer           | Examples                    |
| --------------- | --------------------------- |
| Model           | GPT-4 vs fine-tuned GPT-3.5 |
| Prompt          | Prompt A vs Prompt B        |
| Decoding        | temperature=0.7 vs 0.2      |
| Retrieval       | Vector DB v1 vs v2          |
| Tool policy     | With tools vs without tools |
| Safety filters  | Strict vs relaxed           |
| System pipeline | RAG vs non-RAG              |

---

### 4. Core Workflow

```
Design → Randomize → Deploy → Collect → Evaluate → Decide
```

### Step-by-step

1. **Define objective**

   * e.g., maximize helpfulness while keeping latency < 2s

2. **Select metrics**

| Category | Example Metrics        |
| -------- | ---------------------- |
| Quality  | Human rating, win-rate |
| Safety   | Toxicity score         |
| Cost     | Tokens / request       |
| Latency  | P95 response time      |
| Business | Conversion rate        |

3. **Random assignment**

[
P(\text{variant A}) = P(\text{variant B}) = 0.5
]

4. **Run experiment**
5. **Collect logs**
6. **Statistical analysis**
7. **Ship winner**

---

### 5. Metrics for Generative AI

| Metric             | Description               |
| ------------------ | ------------------------- |
| Human Preference   | A > B pairwise judgments  |
| Win Rate           | % times variant wins      |
| LLM-as-Judge       | Automated evaluator model |
| Task Success       | Did user goal complete    |
| Hallucination Rate | False factual statements  |
| Cost Efficiency    | Quality / $               |

**Important:** Always combine **automatic + human** evaluation.

---

### 6. Statistical Evaluation

Because outputs are noisy, GenAI A/B tests require:

* Large sample sizes
* Paired comparison when possible
* Non-parametric tests

Common tests:

| Scenario          | Test                  |
| ----------------- | --------------------- |
| Binary win/loss   | Chi-square            |
| Paired judgments  | Wilcoxon signed-rank  |
| Continuous scores | t-test / Mann-Whitney |
| Multiple variants | ANOVA                 |

---

### 7. Example: Prompt A/B Test (Code)

```python
import random
from scipy import stats

# Simulated human preference data
# 1 = A wins, 0 = B wins
results = [1,1,0,1,1,0,1,1,1,0,1,1,1,1,0]

win_rate_A = sum(results)/len(results)

# Hypothesis test
stat, p = stats.binomtest(sum(results), len(results), 0.5).statistic, \
          stats.binomtest(sum(results), len(results), 0.5).pvalue

print("Win rate A:", win_rate_A)
print("p-value:", p)
```

---

### 8. LLM-as-Judge Evaluation

```python
def judge(output_A, output_B, reference):
    prompt = f"""
    Compare two answers and choose the better one.

    Reference: {reference}
    A: {output_A}
    B: {output_B}

    Answer only: A or B
    """
    return llm(prompt)
```

This enables **scalable automated A/B testing** before human review.

---

### 9. Multi-Metric Decision Table

| Variant | Win Rate | Hallucination | Cost      | Latency  | Decision |
| ------- | -------- | ------------- | --------- | -------- | -------- |
| A       | 62%      | 3%            | $0.02     | 1.3s     | ❌        |
| B       | 58%      | **1%**        | **$0.01** | **0.9s** | ✅        |

Final choice optimizes **overall system utility**, not just quality.

---

### 10. Advanced Techniques

* **Contextual bandits**: adaptive traffic allocation
* **Sequential testing**: stop early when confident
* **Offline evaluation** with replay logs
* **Prompt ensembles**
* **Pareto frontier optimization**

---

### 11. Failure Modes

* Overfitting to short-term metrics
* Ignoring variance from sampling
* Inadequate sample size
* Single-metric optimization
* No human validation

---

### 12. Summary

> **A/B testing is the core scientific instrument for improving Generative AI systems in production.**

It enables:

* Objective model comparison
* Safe deployment
* Continuous improvement
* Measurable product gains

