# Stage 5: Experiments ‚Äî Data-Driven Agent Optimization

---

## üéØ What You'll Learn

By the end of this walkthrough, you'll understand:

1. **What experiments are** and why gut feelings aren't enough
2. **How to define variants** (different configurations to compare)
3. **How to run controlled experiments** with consistent test sets
4. **How to analyze results** and make data-driven decisions

---

## üìö Understanding Experiments

### The Problem: Too Many Knobs

When improving your agent, you face questions like:
- "Is GPT-4o worth the extra cost vs GPT-4o-mini?"
- "Does this new system prompt reduce hallucinations?"
- "What temperature gives the best quality/latency tradeoff?"

**You could guess.** Or you could **run experiments and know for sure.**

### What Are Experiments?

Experiments run the **same test suite across different agent configurations**, collecting:
- Pass rates
- Quality scores (rubrics)
- Latency
- Cost
- Tool usage patterns

Then you compare the results and pick the winner.

| Intuition-Based | Experiment-Based |
|-----------------|------------------|
| "I think GPT-4o is better" | "GPT-4o scores 4.5/5 vs 4.1/5" |
| "The new prompt seems good" | "New prompt: 95% pass vs 87%" |
| "Faster is always better" | "50ms faster but 5% less accurate" |

---

## üèóÔ∏è Where Experiments Fit: The Eval Maturity Model

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    EVAL FRAMEWORK MATURITY                      ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  ‚òÖ Stage 5: EXPERIMENTS ‚òÖ  ‚Üê Compare configurations (YOU ARE HERE)‚îÇ
‚îÇ      ‚ñ≤                                                          ‚îÇ
‚îÇ  Stage 4: Rubrics          ‚Üê Multi-dimensional scoring (done)  ‚îÇ
‚îÇ      ‚ñ≤                                                          ‚îÇ
‚îÇ  Stage 3: Replay Harnesses ‚Üê Reproducibility (done)            ‚îÇ
‚îÇ      ‚ñ≤                                                          ‚îÇ
‚îÇ  Stage 2: Labeled Scenarios‚Üê Coverage mapping (done)           ‚îÇ
‚îÇ      ‚ñ≤                                                          ‚îÇ
‚îÇ  Stage 1: Golden Sets      ‚Üê Baseline correctness (done)       ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**This is the top of the maturity model.** Experiments enable continuous improvement through data.

## üîç Behind the Scenes: How Experiments Work

Here's the experiment workflow:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    EXPERIMENT WORKFLOW                                ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                       ‚îÇ
‚îÇ  STEP 1: DEFINE VARIANTS                                             ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ  ‚îÇ variants.yaml:                                               ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ                                                              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   baseline:                                                  ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     model: gpt-4o-mini                                       ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     temperature: 0.1                                         ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     system_prompt: v1                                        ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ                                                              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   gpt4o_upgrade:                                             ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     model: gpt-4o         ‚Üê Change one thing                 ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     temperature: 0.1                                         ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     system_prompt: v1                                        ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ                                                              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   new_prompt:                                                ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     model: gpt-4o-mini                                       ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     temperature: 0.1                                         ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ     system_prompt: v2     ‚Üê Change one thing                 ‚îÇ     ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                            ‚îÇ                                          ‚îÇ
‚îÇ                            ‚ñº                                          ‚îÇ
‚îÇ  STEP 2: RUN TEST SUITE FOR EACH VARIANT                             ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ  ‚îÇ For each variant:                                            ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   1. Create agent with variant config                        ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   2. Run all golden set test cases                           ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   3. Collect: pass/fail, rubric scores, latency, tokens      ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   4. Save detailed results to JSON                           ‚îÇ     ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                            ‚îÇ                                          ‚îÇ
‚îÇ                            ‚ñº                                          ‚îÇ
‚îÇ  STEP 3: COMPARE RESULTS                                             ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ  ‚îÇ                                                              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  Variant        Pass %   Rubric   Latency    Cost            ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ           ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  baseline         87%     4.1/5    1.2s     $0.003           ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  gpt4o_upgrade    93%     4.5/5    2.1s     $0.015  ‚Üê Best!  ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ  new_prompt       91%     4.3/5    1.3s     $0.003           ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ                                                              ‚îÇ     ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                            ‚îÇ                                          ‚îÇ
‚îÇ                            ‚ñº                                          ‚îÇ
‚îÇ  STEP 4: DECIDE                                                      ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îÇ
‚îÇ  ‚îÇ  "GPT-4o is 6% more accurate and 0.4 higher rubric score,   ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   but costs 5x more and is 75% slower.                       ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ                                                              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ   Decision: Use GPT-4o for high-stakes queries,              ‚îÇ     ‚îÇ
‚îÇ  ‚îÇ            GPT-4o-mini for everything else."                 ‚îÇ     ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îÇ
‚îÇ                                                                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Key principle:** Change one variable at a time to understand its impact.

In [None]:
# Setup: Import required modules
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent / "setup_agent"))
sys.path.insert(0, str(Path.cwd().parent / "stage_4_rubrics"))

import yaml
from runner import load_variants, load_test_cases, run_experiment, print_comparison
from reporter import load_results, print_comparison_table

## üíª Hands-On: Exploring Variant Definitions

Let's look at how variants are defined in `variants.yaml`:

In [None]:
# Load and display variant definitions
variants_path = Path("variants.yaml")

with open(variants_path) as f:
    variants_config = yaml.safe_load(f)

print("üìã VARIANT DEFINITIONS")
print("=" * 60)

print("\nüîß Defaults (applied to all variants unless overridden):")
for key, value in variants_config.get("defaults", {}).items():
    print(f"   {key}: {value}")

print("\nüìä Variants:")
for name, config in variants_config.get("variants", {}).items():
    print(f"\n   {name}:")
    for key, value in config.items():
        print(f"      {key}: {value}")

## üìù Understanding Variant Parameters

Each variant can customize these parameters:

| Parameter | Description | Example Values |
|-----------|-------------|----------------|
| `model` | LLM model to use | `gpt-4o`, `gpt-4o-mini`, `gpt-3.5-turbo` |
| `temperature` | Response randomness | `0.0` (deterministic) to `1.0` (creative) |
| `system_prompt` | Which prompt version | `v1`, `v2` (files in `prompts/`) |
| `max_tokens` | Max response length | `500`, `1000`, `2000` |

### Best Practice: One Change at a Time

```yaml
# ‚ùå Bad: Multiple changes - can't tell what helped
experiment_v2:
  model: gpt-4o        # Changed
  temperature: 0.0     # Changed  
  system_prompt: v2    # Changed

# ‚úÖ Good: Single change - clear attribution
gpt4o_test:
  model: gpt-4o        # Changed
  temperature: 0.1     # Same
  system_prompt: v1    # Same
```

## üß™ Viewing Test Cases

Experiments run on test cases from previous stages. Let's see what's available:

In [None]:
# Load test cases from different sources
print("üìä TEST CASE SOURCES")
print("=" * 60)

for source in ["golden", "scenarios", "rubrics"]:
    try:
        cases = load_test_cases(source)
        print(f"\n{source.upper()}: {len(cases)} test cases")
        # Show first 3
        for case in cases[:3]:
            query = case.get("query", "N/A")[:50]
            print(f"   ‚Ä¢ {query}...")
    except Exception as e:
        print(f"\n{source.upper()}: Could not load - {e}")

## üèÉ Running an Experiment

Now let's run an experiment! This will:
1. Create an agent for each variant
2. Run all test cases
3. Collect metrics
4. Save results to disk

**‚ö†Ô∏è Note:** This makes real API calls and costs money. Use `--limit` for testing.

In [None]:
# Run a small experiment (3 test cases per variant to save costs)
# Uncomment to run:

# results = run_experiment(
#     variant_names=["baseline", "new_prompt"],  # Which variants to test
#     test_source="golden",                       # Use golden set test cases
#     limit=3,                                    # Only 3 cases (for demo)
#     include_rubrics=True,                       # Also score with rubrics
#     verbose=True
# )
#
# # Print comparison
# print_comparison(results)

print("üí° To run an experiment, uncomment the code above.")
print("   This will make real API calls (costs $).")
print()
print("   Or run from command line:")
print("   uv run python runner.py --test-source golden --limit 3")

## üìä Analyzing Results

After running experiments, the reporter generates comparison tables. Here's how to interpret them:

### The Comparison Table

```
Variant        Pass %   Rubric   Latency    Cost
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
baseline         87%     4.1/5    1.2s     $0.003
gpt4o_upgrade    93%     4.5/5    2.1s     $0.015  ‚Üê Best quality
new_prompt       91%     4.3/5    1.3s     $0.003  ‚Üê Best value
```

### What Each Column Means

| Column | Meaning | Good Value |
|--------|---------|------------|
| **Pass %** | Percentage of test cases passing | Higher is better |
| **Rubric** | Average quality score (1-5) | > 4.0 is good |
| **Latency** | Average response time | Lower is better |
| **Cost** | Estimated $ per query | Lower is better |

### Making Decisions

The "best" variant depends on your priorities:

| Priority | Choose | Why |
|----------|--------|-----|
| **Quality first** | Highest rubric score | Accept higher cost/latency |
| **Cost first** | Lowest cost with acceptable quality | gpt-4o-mini usually |
| **Balanced** | Best quality/cost ratio | Often new prompts help |

## üìà Viewing Past Experiment Results

Results are saved to the `results/` directory. Let's check for existing experiments:

In [None]:
# Check for existing experiment results
results_dir = Path("results")

if results_dir.exists():
    result_files = list(results_dir.glob("*.json"))
    print(f"üìÅ Found {len(result_files)} result files in results/")
    
    for f in result_files[:5]:  # Show first 5
        print(f"   ‚Ä¢ {f.name}")
    
    if result_files:
        print("\\nüìä Loading and displaying results...")
        try:
            summaries = load_results(str(results_dir))
            if summaries:
                print_comparison_table(summaries)
            else:
                print("   No valid summaries found in results.")
        except Exception as e:
            print(f"   Could not load results: {e}")
else:
    print("üìÅ No results directory found.")
    print("   Run an experiment first: uv run python runner.py")

## üî¨ Common Experiments to Run

Here are experiments you should run for your agent:

### 1. Model Comparison
```yaml
variants:
  mini: { model: gpt-4o-mini }
  full: { model: gpt-4o }
```
**Question:** Is the bigger model worth 10x the cost?

### 2. Temperature Tuning
```yaml
variants:
  deterministic: { temperature: 0.0 }
  balanced: { temperature: 0.3 }
  creative: { temperature: 0.7 }
```
**Question:** How much randomness is optimal?

### 3. Prompt Engineering
```yaml
variants:
  current: { system_prompt: v1 }
  detailed: { system_prompt: v2 }
  concise: { system_prompt: v3 }
```
**Question:** Does prompt wording affect quality?

### 4. Tool Selection
```yaml
variants:
  all_tools: { tools: ["vector", "sql", "jira", "slack"] }
  limited: { tools: ["vector", "sql"] }
```
**Question:** Do more tools help or hurt?

## üéì Key Takeaways

1. **Data beats intuition** ‚Äî Run experiments instead of guessing
2. **One change at a time** ‚Äî Isolate variables to understand impact
3. **Same test set** ‚Äî Compare apples to apples across variants
4. **Track all metrics** ‚Äî Quality, latency, and cost together
5. **Document decisions** ‚Äî Record why you chose a configuration

---

## üìã Experiment Checklist

Before running an experiment:
- [ ] Define clear hypothesis ("New prompt will improve accuracy")
- [ ] Create variant with single change
- [ ] Choose appropriate test set (golden for quick, scenarios for thorough)
- [ ] Set reasonable sample size (more = more confidence but more $)

After running:
- [ ] Compare all metrics, not just one
- [ ] Consider cost/quality tradeoffs
- [ ] Document the decision and rationale
- [ ] Update production config if winner is found
- [ ] Set up regression tests to protect the improvement

---

## üèÜ Congratulations!

You've completed all five stages of the Production Evals Cookbook:

| Stage | What You Learned |
|-------|-----------------|
| 1. Golden Sets | Baseline correctness with curated test cases |
| 2. Labeled Scenarios | Coverage mapping with categorized tests |
| 3. Replay Harnesses | Reproducibility with recorded sessions |
| 4. Rubrics | Quality scoring with multi-dimensional rubrics |
| 5. Experiments | Configuration optimization with controlled experiments |

### What's Next?

1. **Integrate into CI** ‚Äî Run golden sets on every commit
2. **Set up monitoring** ‚Äî Track rubric scores over time
3. **Build feedback loops** ‚Äî Turn production issues into test cases
4. **Iterate** ‚Äî Use experiments to continuously improve

Happy evaluating! üöÄ