# A/B Testing for Prompts

Learn how to systematically compare prompt variants to find the most effective approach.

## What You'll Learn
- Testing multiple prompt variants
- Analyzing response differences
- Statistical comparison methods
- Best practices for A/B testing

In [None]:
from prompt_playground.client import create_client, send_prompt, send_batch
from prompt_playground.analysis import compare_responses, calculate_metrics, analyze_tone, visualize_comparison
from rich import print as rprint
from rich.panel import Panel
from rich.table import Table
from rich.console import Console
import matplotlib.pyplot as plt

client = create_client()
console = Console()
rprint("[green]✓[/green] Ready for A/B testing")

## Why A/B Test Prompts?

A/B testing helps you:
- Find the most effective prompt structure
- Optimize for specific metrics (clarity, conciseness, tone)
- Make data-driven decisions
- Understand what works best for your use case

## Basic A/B Test

Compare two prompt variants for the same task:

In [None]:
task = "explaining recursion to beginners"

variant_a = "Explain recursion in simple terms."
variant_b = "Explain recursion using a real-world analogy that a beginner programmer would understand."

prompts = [variant_a, variant_b]
responses = send_batch(prompts=prompts, client=client)

for i, response in enumerate(responses, 1):
    rprint(f"\n[bold cyan]Variant {chr(64+i)}:[/bold cyan]")
    rprint(Panel(response['text'], expand=False))

## Comparing Metrics

In [None]:
import pandas as pd

df = compare_responses(responses)
df['variant'] = ['A', 'B']

display(df[['variant', 'word_count', 'sentence_count', 'output_tokens', 'estimated_cost']])

## Tone Analysis Comparison

In [None]:
table = Table(title="Tone Comparison")
table.add_column("Variant", style="cyan")
table.add_column("Formality", style="yellow")
table.add_column("Complexity", style="green")
table.add_column("Perspective", style="blue")

for i, response in enumerate(responses):
    tone = analyze_tone(response['text'])
    table.add_row(
        chr(65+i),
        tone['formality'],
        tone['complexity'],
        tone['perspective']
    )

console.print(table)

## Visual Comparison

In [None]:
fig = visualize_comparison(responses, metric='length')
plt.show()

fig = visualize_comparison(responses, metric='tokens')
plt.show()

## Multi-Variant Testing (A/B/C/D)

Test multiple approaches simultaneously:

In [None]:
variants = [
    "Write a product description for wireless headphones.",
    "Write a compelling product description for wireless headphones that highlights key benefits.",
    "Create a product description for wireless headphones. Focus on: sound quality, battery life, comfort. Use persuasive language.",
    "You are a product copywriter. Write an engaging description for wireless headphones that would appeal to music enthusiasts."
]

responses = send_batch(prompts=variants, temperature=0.7, client=client)

for i, response in enumerate(responses, 1):
    metrics = calculate_metrics(response)
    rprint(f"\n[bold]Variant {chr(64+i)}:[/bold] {metrics['word_count']} words, ${metrics['estimated_cost']:.6f}")
    rprint(response['text'][:200] + "...")

## Testing Prompt Structure

Compare different structural approaches:

In [None]:
structures = [
    "List 3 benefits of exercise.",
    
    """Task: List benefits of exercise
Format: Numbered list
Count: 3 items""",
    
    """Please list 3 key benefits of regular exercise.
    
For each benefit:
1. State the benefit
2. Explain why it matters"""
]

responses = send_batch(prompts=structures, temperature=0.3, client=client)
df = compare_responses(responses)

for i, response in enumerate(responses, 1):
    rprint(f"\n[bold cyan]Structure {i}:[/bold cyan]")
    rprint(Panel(response['text'], expand=False))

## Testing Temperature Effects

In [None]:
base_prompt = "Write a creative tagline for an eco-friendly water bottle."
temperatures = [0.3, 0.7, 1.0]

results = []
for temp in temperatures:
    response = send_prompt(prompt=base_prompt, temperature=temp, client=client)
    results.append({
        'temperature': temp,
        'response': response['text'],
        'tokens': response['output_tokens']
    })

for result in results:
    rprint(f"\n[cyan]Temperature {result['temperature']}:[/cyan]")
    rprint(result['response'])

## Best Practices

### 1. Test One Variable at a Time

In [None]:
base = "Explain photosynthesis."

test_specificity = [
    base,
    "Explain photosynthesis in simple terms.",
]

test_audience = [
    "Explain photosynthesis in simple terms.",
    "Explain photosynthesis in simple terms to a 5th grader.",
]

rprint("[green]✓[/green] Testing specificity first, then audience")

### 2. Use Consistent Parameters

In [None]:
test_params = {
    'temperature': 0.5,
    'max_tokens': 200,
    'client': client
}

variant_1 = send_prompt(prompt="Variant 1...", **test_params)
variant_2 = send_prompt(prompt="Variant 2...", **test_params)

rprint("[green]✓[/green] Same parameters ensure fair comparison")

### 3. Define Success Criteria

In [None]:
def evaluate_response(response, criteria):
    metrics = calculate_metrics(response)
    tone = analyze_tone(response['text'])
    
    score = 0
    
    if criteria.get('max_words') and metrics['word_count'] <= criteria['max_words']:
        score += 1
    
    if criteria.get('formality') and tone['formality'] == criteria['formality']:
        score += 1
    
    if criteria.get('max_cost') and metrics['estimated_cost'] <= criteria['max_cost']:
        score += 1
    
    return score

criteria = {
    'max_words': 100,
    'formality': 'formal',
    'max_cost': 0.01
}

rprint("[green]✓[/green] Defined evaluation criteria")

### 4. Run Multiple Tests

In [None]:
prompt_a = "Explain quantum entanglement briefly."
prompt_b = "Explain quantum entanglement in 2-3 sentences."

runs = 3
results_a = []
results_b = []

for i in range(runs):
    resp_a = send_prompt(prompt=prompt_a, temperature=0.7, client=client)
    resp_b = send_prompt(prompt=prompt_b, temperature=0.7, client=client)
    results_a.append(calculate_metrics(resp_a))
    results_b.append(calculate_metrics(resp_b))

avg_words_a = sum(r['word_count'] for r in results_a) / runs
avg_words_b = sum(r['word_count'] for r in results_b) / runs

rprint(f"\nAverage words - A: {avg_words_a:.1f}, B: {avg_words_b:.1f}")
rprint(f"[green]Winner:[/green] {'A' if avg_words_a < avg_words_b else 'B'} (more concise)")

## Summary

You've learned:
- ✓ Running basic A/B tests
- ✓ Comparing multiple variants (A/B/C/D)
- ✓ Analyzing metrics and tone
- ✓ Visualizing comparisons
- ✓ Testing different aspects (structure, temperature)
- ✓ Best practices for reliable testing

## Next Steps

- **04_batch_processing.ipynb**: Scale your A/B tests
- **05_evaluation_metrics.ipynb**: Define custom success metrics