# Week 6.3: LLM as Judge - Automated Evaluation with Language Models

In this notebook, we'll explore how to use Large Language Models (LLMs) as judges to evaluate the quality of other model outputs. This approach, popularized by Stanford's AlpacaEval, provides a fast, scalable, and reproducible alternative to human evaluation.

## Resource Requirements

This notebook can be run on:
- **Google Colab**: Free tier (no GPU required)
- **Local machine**: With Python 3.8+ and API access to OpenAI or other LLM providers
- **Estimated time**: 45-60 minutes
- **API costs**: Minimal (< $1 for all examples)

## Learning Objectives

By the end of this notebook, you will:

- ✅ Understand the fundamentals of LLM-as-Judge evaluation
- ✅ Implement a basic pairwise comparison evaluator
- ✅ Learn how to design effective evaluation prompts
- ✅ Recognize common biases and limitations (length bias, position bias, etc.)
- ✅ Understand length-controlled evaluation techniques
- ✅ Apply batch evaluation for efficiency
- ✅ Know when to use automated vs. human evaluation

## Practical Setup

First, let's install the required packages and set up our environment.

In [None]:
# Install required packages
!pip install openai pandas numpy matplotlib seaborn tqdm

import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from tqdm import tqdm
import time

# For API calls
import openai
from openai import OpenAI

# Set up visualization defaults
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

---

## 1. Introduction to LLM-as-Judge Evaluation

### What is LLM-as-Judge?

LLM-as-Judge is an evaluation paradigm where we use a powerful language model (like GPT-4 or Claude) to assess the quality of outputs from other language models. This approach has gained popularity because:

1. **Scalability**: Can evaluate thousands of examples quickly
2. **Consistency**: Same judge applies same criteria across all evaluations
3. **Cost-effectiveness**: Much cheaper than human evaluation
4. **Reproducibility**: Results can be replicated with same prompts and models

### The Core Innovation of AlpacaEval

AlpacaEval revolutionized LLM evaluation by introducing several key concepts:

#### 1. **Pairwise Comparison vs Absolute Scoring**
Instead of asking "Rate this response from 1-10", AlpacaEval asks "Which response is better, A or B?". This is crucial because:
- Humans (and LLMs) are better at relative comparisons than absolute scoring
- It reduces variability and increases agreement rates
- It matches how humans naturally evaluate quality

#### 2. **Reference-Based Evaluation**
AlpacaEval compares your model against strong reference models (like GPT-4):
```
Your Model Output <--compare--> GPT-4 Output
                       |
                   Judge LLM
                       |
                  Win Rate %
```

#### 3. **Win Rate as Primary Metric**
The win rate tells you: "What percentage of time does your model beat the reference?"
- 50% win rate = Your model is as good as GPT-4
- 30% win rate = Your model wins 30% of the time
- This single number is easy to interpret and track

### How AlpacaEval Works Under the Hood

The evaluation pipeline consists of three main stages:

```
Stage 1: Data Collection
├── 805 diverse instructions (questions/tasks)
├── Your model generates responses
└── Reference model responses (pre-computed)

Stage 2: Pairwise Evaluation
├── For each instruction:
│   ├── Create evaluation prompt
│   ├── Send to judge LLM (GPT-4)
│   └── Record verdict (win/loss/tie)
└── Handle edge cases (refusals, errors)

Stage 3: Aggregation
├── Calculate win rate
├── Compute confidence intervals
├── Apply length control (v2.0)
└── Generate leaderboard entry
```

In [None]:
# Let's visualize the AlpacaEval pipeline
import matplotlib.patches as patches
from matplotlib.patches import FancyBboxPatch, Rectangle

def visualize_alpaca_eval_pipeline():
    """Create a visual representation of the AlpacaEval pipeline."""
    fig, ax = plt.subplots(1, 1, figsize=(14, 8))
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 10)
    ax.axis('off')
    
    # Title
    ax.text(5, 9.5, 'AlpacaEval Pipeline', fontsize=20, fontweight='bold', ha='center')
    
    # Stage 1: Inputs
    input_box = FancyBboxPatch((0.5, 6.5), 2, 2, boxstyle="round,pad=0.1",
                               facecolor='lightblue', edgecolor='black', linewidth=2)
    ax.add_patch(input_box)
    ax.text(1.5, 7.5, '805 Test\nInstructions', ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Your Model
    your_model = FancyBboxPatch((0.5, 4), 2, 1.5, boxstyle="round,pad=0.1",
                                facecolor='lightgreen', edgecolor='black', linewidth=2)
    ax.add_patch(your_model)
    ax.text(1.5, 4.75, 'Your Model', ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Reference Model
    ref_model = FancyBboxPatch((0.5, 2), 2, 1.5, boxstyle="round,pad=0.1",
                               facecolor='lightcoral', edgecolor='black', linewidth=2)
    ax.add_patch(ref_model)
    ax.text(1.5, 2.75, 'Reference\n(GPT-4)', ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Arrows from instructions to models
    ax.arrow(1.5, 6.5, 0, -0.8, head_width=0.1, head_length=0.1, fc='black', ec='black')
    ax.arrow(1.5, 6.5, 0, -2.8, head_width=0.1, head_length=0.1, fc='black', ec='black')
    
    # Stage 2: Comparison
    compare_box = FancyBboxPatch((4, 3), 2.5, 3, boxstyle="round,pad=0.1",
                                 facecolor='lightyellow', edgecolor='black', linewidth=2)
    ax.add_patch(compare_box)
    ax.text(5.25, 5.5, 'Pairwise\nComparison', ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(5.25, 4.8, 'Output A vs B', ha='center', va='center', fontsize=9)
    ax.text(5.25, 4.3, 'for each\ninstruction', ha='center', va='center', fontsize=9)
    
    # Judge LLM
    judge_box = FancyBboxPatch((4.25, 0.5), 2, 1.5, boxstyle="round,pad=0.1",
                               facecolor='gold', edgecolor='black', linewidth=2)
    ax.add_patch(judge_box)
    ax.text(5.25, 1.25, 'Judge LLM\n(GPT-4)', ha='center', va='center', fontsize=10, fontweight='bold')
    
    # Arrows
    ax.arrow(2.5, 4.75, 1.3, -0.3, head_width=0.1, head_length=0.1, fc='black', ec='black')
    ax.arrow(2.5, 2.75, 1.3, 0.7, head_width=0.1, head_length=0.1, fc='black', ec='black')
    ax.arrow(5.25, 3, 0, -0.8, head_width=0.1, head_length=0.1, fc='black', ec='black')
    
    # Stage 3: Results
    results_box = FancyBboxPatch((7.5, 3), 2, 3, boxstyle="round,pad=0.1",
                                 facecolor='lightgray', edgecolor='black', linewidth=2)
    ax.add_patch(results_box)
    ax.text(8.5, 5.5, 'Results', ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(8.5, 4.8, 'Win Rate: X%', ha='center', va='center', fontsize=9)
    ax.text(8.5, 4.3, 'Confidence: ±Y%', ha='center', va='center', fontsize=9)
    ax.text(8.5, 3.8, 'Length-controlled', ha='center', va='center', fontsize=9)
    ax.text(8.5, 3.3, 'win rate: Z%', ha='center', va='center', fontsize=9)
    
    # Arrow from judge to results
    ax.arrow(6.25, 1.25, 1.1, 2.2, head_width=0.1, head_length=0.1, fc='black', ec='black')
    
    # Add annotations
    ax.text(3, 7.5, '1. Generate\nResponses', ha='center', fontsize=9, style='italic')
    ax.text(5.25, 7, '2. Compare\nPairwise', ha='center', fontsize=9, style='italic')
    ax.text(8.5, 7, '3. Aggregate\nResults', ha='center', fontsize=9, style='italic')
    
    plt.tight_layout()
    plt.show()

visualize_alpaca_eval_pipeline()

In [None]:
# First, let's set up our OpenAI client
# You'll need to set your API key as an environment variable
# For Google Colab: use Secrets tab or os.environ['OPENAI_API_KEY'] = 'your-key'

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY", "your-api-key-here")
)

# Let's create a simple example dataset
evaluation_examples = [
    {
        "instruction": "Explain what recursion is in programming.",
        "output_a": "Recursion is when a function calls itself to solve a smaller instance of the same problem. It continues until reaching a base case that stops the recursion.",
        "output_b": "Recursion in programming is a technique where a function calls itself. It's like a loop but uses function calls. Each recursive call works on a smaller piece of the problem until it reaches a simple case that can be solved directly. For example, calculating factorial: factorial(5) = 5 * factorial(4), and so on until factorial(1) = 1."
    },
    {
        "instruction": "What is the capital of France?",
        "output_a": "The capital of France is Paris.",
        "output_b": "Paris is the capital city of France. It's located in the north-central part of the country and is known for landmarks like the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral."
    },
    {
        "instruction": "Write a haiku about coding.",
        "output_a": "Lines of logic flow\nBugs hide in syntax errors\nDebugger finds peace",
        "output_b": "Code"
    }
]

print(f"Loaded {len(evaluation_examples)} evaluation examples")

---

## 2. Basic LLM-as-Judge Implementation

Let's implement a simple pairwise evaluator. We'll create a function that asks an LLM to compare two outputs and decide which one is better.

In [None]:
def create_evaluation_prompt(instruction: str, output_a: str, output_b: str) -> str:
    """
    Create a prompt for the judge LLM to evaluate two outputs.
    
    Args:
        instruction: The original instruction/question
        output_a: First model's response
        output_b: Second model's response
    
    Returns:
        Formatted prompt for the judge
    """
    prompt = f"""You are a helpful assistant that evaluates the quality of AI model responses.

Given an instruction and two responses, determine which response is better.

Instruction: {instruction}

Response A: {output_a}

Response B: {output_b}

Please evaluate which response is better by considering:
1. Accuracy and correctness
2. Helpfulness and completeness
3. Clarity and coherence
4. Following the instruction properly

Output only "A" if Response A is better, "B" if Response B is better, or "TIE" if they are equally good.

Your verdict:"""
    
    return prompt


def evaluate_pair(instruction: str, output_a: str, output_b: str, 
                  model: str = "gpt-3.5-turbo") -> str:
    """
    Use an LLM to judge which output is better.
    
    Returns: "A", "B", or "TIE"
    """
    prompt = create_evaluation_prompt(instruction, output_a, output_b)
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful evaluation assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,  # Use deterministic output
            max_tokens=10
        )
        
        verdict = response.choices[0].message.content.strip().upper()
        
        # Validate the response
        if verdict not in ["A", "B", "TIE"]:
            print(f"Warning: Unexpected verdict '{verdict}'. Defaulting to TIE.")
            return "TIE"
        
        return verdict
    
    except Exception as e:
        print(f"Error during evaluation: {e}")
        return "TIE"


# Test our evaluator on the first example
example = evaluation_examples[0]
verdict = evaluate_pair(
    example["instruction"], 
    example["output_a"], 
    example["output_b"]
)

print(f"Instruction: {example['instruction']}")
print(f"\\nVerdict: Output {verdict} is better")

### Running Evaluation on Multiple Examples

Now let's evaluate all our examples and see the results:

In [None]:
# Evaluate all examples
results = []

for i, example in enumerate(evaluation_examples):
    print(f"Evaluating example {i+1}/{len(evaluation_examples)}...")
    
    verdict = evaluate_pair(
        example["instruction"],
        example["output_a"],
        example["output_b"]
    )
    
    results.append({
        "instruction": example["instruction"],
        "output_a_length": len(example["output_a"]),
        "output_b_length": len(example["output_b"]),
        "verdict": verdict,
        "winner": "output_a" if verdict == "A" else ("output_b" if verdict == "B" else "tie")
    })
    
    # Add a small delay to avoid rate limits
    time.sleep(0.5)

# Convert to DataFrame for easier analysis
results_df = pd.DataFrame(results)
print("\\n" + "="*50)
print("Evaluation Results:")
print(results_df[["instruction", "verdict", "output_a_length", "output_b_length"]])

---

## 3. Understanding Evaluation Prompts - The Heart of AlpacaEval

The quality of LLM-as-Judge evaluation heavily depends on the prompt design. Let's explore the actual prompts used in AlpacaEval:

### AlpacaEval's Prompt Evolution

#### Version 1.0: Ranking-based Approach
AlpacaEval 1.0 asked the judge to rank outputs from best to worst. This had issues:
- Ties were ambiguous
- Position bias was strong (first option preferred)
- Hard to extract clean signal

#### Version 2.0: Binary Classification with Logprobs
The breakthrough came with using token probabilities:
- Ask for a single token response ('m' or 'M')
- Use logprobs to get confidence scores
- Enables length-controlled evaluation

### The Actual AlpacaEval Prompt Template

Here's a simplified version of the real prompt used in AlpacaEval 2.0:

In [None]:
# The actual AlpacaEval 2.0 style prompt template
ALPACA_EVAL_V2_PROMPT = """I need you to help me evaluate responses to instructions.

For the instruction below, which response is better?

Instruction: {instruction}

Response (a): {output_a}

Response (b): {output_b}

IMPORTANT: To ensure fair evaluation, please consider:
- Helpfulness: Does it solve the user's problem?
- Accuracy: Is the information correct?
- Clarity: Is it easy to understand?
- Completeness: Does it fully address the instruction?
- Conciseness: Avoid preferring longer answers just for being longer

Please respond with ONLY a single character:
- "a" if response (a) is better
- "b" if response (b) is better

Your answer: """

def create_alpaca_v2_style_prompt(instruction: str, output_a: str, output_b: str) -> str:
    """
    Create a prompt following AlpacaEval v2 style.
    Key innovations:
    1. Single token response for clean extraction
    2. Explicit instruction to avoid length bias
    3. Clear evaluation criteria
    """
    return ALPACA_EVAL_V2_PROMPT.format(
        instruction=instruction,
        output_a=output_a,
        output_b=output_b
    )

# The actual prompt used for batch evaluation in AlpacaEval
ALPACA_BATCH_PROMPT = """You are a helpful assistant that evaluates language model outputs.

You will see {batch_size} examples. For each example:
1. There is an instruction
2. There are two responses: (a) and (b)
3. You must choose which is better

## Evaluation Criteria
- Helpful: Addresses the user's needs
- Harmless: Avoids unsafe or inappropriate content  
- Honest: Provides accurate information
- Clear: Easy to understand
- Complete: Fully addresses the instruction

IMPORTANT: Do not prefer longer responses unless the length adds value.

{examples}

## Your Task
For each example, respond with just "a" or "b" on a new line.
No explanations needed.

Example 1:"""

# Let's see how the prompt structure affects evaluation
def compare_prompt_styles(instruction: str, output_a: str, output_b: str):
    """Compare different prompt styles and their effects."""
    
    # Style 1: Basic prompt (our original)
    basic_prompt = create_evaluation_prompt(instruction, output_a, output_b)
    
    # Style 2: AlpacaEval v2 style
    alpaca_prompt = create_alpaca_v2_style_prompt(instruction, output_a, output_b)
    
    # Style 3: Detailed criteria prompt
    detailed_prompt = f"""Evaluate these responses using specific criteria.

Instruction: {instruction}

Response A: {output_a}

Response B: {output_b}

Rate each response on:
1. Accuracy (factually correct?)
2. Helpfulness (solves the problem?)
3. Clarity (easy to understand?)
4. Safety (appropriate content?)
5. Completeness (fully addresses request?)

Consider all factors, then output only "A" or "B" for the better response.

Your verdict:"""
    
    return {
        "basic": basic_prompt,
        "alpaca_v2": alpaca_prompt,
        "detailed": detailed_prompt
    }

# Example to show prompt differences
example = {
    "instruction": "How do I make coffee?",
    "output_a": "Add hot water to coffee grounds.",
    "output_b": "To make coffee: 1) Boil water to 195-205°F, 2) Add 2 tablespoons of ground coffee per 6 oz of water, 3) Pour water over grounds, 4) Let steep for 4 minutes, 5) Filter and serve."
}

prompts = compare_prompt_styles(
    example["instruction"],
    example["output_a"],
    example["output_b"]
)

print("COMPARISON OF PROMPT STYLES")
print("="*60)
print("\n1. BASIC PROMPT (Simple):")
print("-"*40)
print(prompts["basic"][:200] + "...")

print("\n\n2. ALPACA V2 STYLE (Single token):")
print("-"*40)
print(prompts["alpaca_v2"][:300] + "...")

print("\n\n3. DETAILED CRITERIA:")
print("-"*40)
print(prompts["detailed"][:250] + "...")

---

## 4. Batch Evaluation for Efficiency

Evaluating examples one by one can be slow and expensive. AlpacaEval uses batch evaluation to process multiple examples in a single API call. Let's implement this approach:

---

## 4. Batch Evaluation for Efficiency - Real-World Optimization

In production, evaluating thousands of examples requires optimization. AlpacaEval uses several strategies:

### Key Optimization Strategies:

1. **Batching**: Process multiple examples in one API call
2. **Caching**: Store results to avoid re-evaluation
3. **Parallel Processing**: Use multiple API calls concurrently
4. **Smart Retries**: Handle failures gracefully
5. **Cost Optimization**: Balance batch size vs API limits

Let's implement a production-ready batch evaluator:

In [None]:
import hashlib
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Optional, Callable
import time

class OptimizedBatchEvaluator:
    """
    Production-ready batch evaluator with AlpacaEval-style optimizations.
    """
    
    def __init__(self, 
                 model: str = "gpt-3.5-turbo",
                 cache_dir: Optional[str] = "./eval_cache",
                 max_workers: int = 5,
                 max_retries: int = 3):
        self.model = model
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.cache = {}  # In-memory cache (could use Redis in production)
        self.max_workers = max_workers
        self.max_retries = max_retries
        
        # Batch size optimization based on model
        self.optimal_batch_sizes = {
            "gpt-3.5-turbo": 10,      # Can handle larger batches
            "gpt-4": 5,               # More expensive, smaller batches
            "gpt-4-turbo": 8,         # Balance of cost and capability
        }
    
    def _get_cache_key(self, instruction: str, output_a: str, output_b: str) -> str:
        """Generate deterministic cache key for an evaluation."""
        content = f"{instruction}|{output_a}|{output_b}|{self.model}"
        return hashlib.md5(content.encode()).hexdigest()
    
    def _create_optimized_batch_prompt(self, examples: List[Dict]) -> str:
        """Create an optimized batch prompt following AlpacaEval style."""
        # Use minimal tokens while maintaining clarity
        prompt = f"Evaluate {len(examples)} response pairs. Output only 'a' or 'b' for each.\n"
        prompt += "Criteria: helpful, accurate, clear. Avoid length bias.\n\n"
        
        for i, ex in enumerate(examples, 1):
            # Compact format to save tokens
            prompt += f"{i}. {ex['instruction'][:100]}{'...' if len(ex['instruction']) > 100 else ''}\n"
            prompt += f"a: {ex['output_a'][:200]}{'...' if len(ex['output_a']) > 200 else ''}\n"
            prompt += f"b: {ex['output_b'][:200]}{'...' if len(ex['output_b']) > 200 else ''}\n\n"
        
        prompt += "Verdicts (one per line):\n"
        return prompt
    
    def _evaluate_batch_with_retry(self, batch: List[Dict], attempt: int = 1) -> List[str]:
        """Evaluate a batch with retry logic."""
        if attempt > self.max_retries:
            return ["TIE"] * len(batch)
        
        try:
            prompt = self._create_optimized_batch_prompt(batch)
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are an expert evaluator. Be concise."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0,
                max_tokens=len(batch) * 3,  # Allow for some extra tokens
                timeout=30  # Add timeout
            )
            
            # Parse verdicts robustly
            response_text = response.choices[0].message.content.strip()
            verdicts = []
            
            for line in response_text.split('\n'):
                line = line.strip().lower()
                if line in ['a', 'b']:
                    verdicts.append(line.upper())
                elif 'a' in line and 'b' not in line:
                    verdicts.append('A')
                elif 'b' in line and 'a' not in line:
                    verdicts.append('B')
            
            # Pad with TIEs if needed
            while len(verdicts) < len(batch):
                verdicts.append('TIE')
            
            return verdicts[:len(batch)]
            
        except Exception as e:
            print(f"Batch evaluation failed (attempt {attempt}): {e}")
            time.sleep(2 ** attempt)  # Exponential backoff
            return self._evaluate_batch_with_retry(batch, attempt + 1)
    
    def evaluate_dataset_parallel(self, examples: List[Dict], 
                                 show_progress: bool = True) -> List[Dict]:
        """
        Evaluate dataset with parallel processing and optimizations.
        
        This mimics AlpacaEval's approach:
        1. Check cache first
        2. Batch remaining examples optimally
        3. Process batches in parallel
        4. Handle failures gracefully
        """
        results = [None] * len(examples)
        to_evaluate = []
        
        # Step 1: Check cache
        for i, example in enumerate(examples):
            cache_key = self._get_cache_key(
                example["instruction"],
                example["output_a"],
                example["output_b"]
            )
            
            if cache_key in self.cache:
                results[i] = self.cache[cache_key]
            else:
                to_evaluate.append((i, example))
        
        if show_progress:
            print(f"Found {len(examples) - len(to_evaluate)} cached results")
            print(f"Need to evaluate {len(to_evaluate)} examples")
        
        # Step 2: Optimal batching
        batch_size = self.optimal_batch_sizes.get(self.model, 5)
        batches = []
        
        for i in range(0, len(to_evaluate), batch_size):
            batch_indices = []
            batch_examples = []
            
            for j in range(i, min(i + batch_size, len(to_evaluate))):
                idx, example = to_evaluate[j]
                batch_indices.append(idx)
                batch_examples.append(example)
            
            batches.append((batch_indices, batch_examples))
        
        # Step 3: Parallel processing
        if show_progress:
            pbar = tqdm(total=len(batches), desc="Processing batches")
        
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            future_to_batch = {
                executor.submit(self._evaluate_batch_with_retry, batch_examples): (batch_indices, batch_examples)
                for batch_indices, batch_examples in batches
            }
            
            for future in as_completed(future_to_batch):
                batch_indices, batch_examples = future_to_batch[future]
                try:
                    verdicts = future.result()
                    
                    # Store results and update cache
                    for idx, example, verdict in zip(batch_indices, batch_examples, verdicts):
                        result = {
                            "instruction": example["instruction"],
                            "verdict": verdict,
                            "output_a_len": len(example["output_a"]),
                            "output_b_len": len(example["output_b"]),
                            "model": self.model
                        }
                        results[idx] = result
                        
                        # Update cache
                        cache_key = self._get_cache_key(
                            example["instruction"],
                            example["output_a"],
                            example["output_b"]
                        )
                        self.cache[cache_key] = result
                
                except Exception as e:
                    print(f"Batch failed completely: {e}")
                
                if show_progress:
                    pbar.update(1)
        
        if show_progress:
            pbar.close()
        
        return results
    
    def calculate_statistics(self, results: List[Dict]) -> Dict:
        """Calculate AlpacaEval-style statistics."""
        total = len(results)
        wins = sum(1 for r in results if r and r["verdict"] == "B")
        losses = sum(1 for r in results if r and r["verdict"] == "A")
        ties = sum(1 for r in results if r and r["verdict"] == "TIE")
        
        # Win rate calculation (excluding ties like AlpacaEval)
        win_rate = wins / (wins + losses) if (wins + losses) > 0 else 0
        
        # Calculate average length ratio
        length_ratios = []
        for r in results:
            if r and r["output_a_len"] > 0:
                length_ratios.append(r["output_b_len"] / r["output_a_len"])
        
        avg_length_ratio = np.mean(length_ratios) if length_ratios else 1.0
        
        return {
            "total_examples": total,
            "wins": wins,
            "losses": losses,
            "ties": ties,
            "win_rate": win_rate,
            "win_rate_percentage": win_rate * 100,
            "avg_length_ratio": avg_length_ratio,
            "evaluation_model": self.model
        }


# Demonstrate the optimized evaluator
print("OPTIMIZED BATCH EVALUATION DEMO")
print("="*60)

# Create a larger test dataset
test_dataset = []
instructions = [
    "Explain quantum computing",
    "What is machine learning?",
    "How do I make pasta?",
    "What is the meaning of life?",
    "Explain photosynthesis",
    "What is blockchain?",
    "How do neural networks work?",
    "What is climate change?",
    "Explain the water cycle",
    "What is artificial intelligence?"
]

for i, instruction in enumerate(instructions):
    test_dataset.append({
        "instruction": instruction,
        "output_a": f"Short answer to: {instruction}",
        "output_b": f"This is a detailed explanation about {instruction}. " * 3
    })

# Initialize and run evaluator
evaluator = OptimizedBatchEvaluator(
    model="gpt-3.5-turbo",
    max_workers=3
)

# Run evaluation
results = evaluator.evaluate_dataset_parallel(test_dataset)

# Calculate and display statistics
stats = evaluator.calculate_statistics(results)

print("\nEVALUATION STATISTICS")
print("="*40)
for key, value in stats.items():
    if isinstance(value, float):
        print(f"{key}: {value:.3f}")
    else:
        print(f"{key}: {value}")

# Show a few example results
print("\nSAMPLE RESULTS")
print("="*40)
for i in range(min(3, len(results))):
    r = results[i]
    if r:
        print(f"\nExample {i+1}:")
        print(f"  Instruction: {r['instruction'][:50]}...")
        print(f"  Verdict: {r['verdict']}")
        print(f"  Length ratio: {r['output_b_len']/r['output_a_len']:.2f}")

In [None]:
# Let's demonstrate common biases in LLM evaluation

# 1. Length Bias - LLMs often prefer longer responses
length_bias_examples = [
    {
        "instruction": "What is the capital of Japan?",
        "output_a": "Tokyo",
        "output_b": "The capital of Japan is Tokyo, which is located on the eastern coast of Honshu, the largest of Japan's four main islands. Tokyo has been Japan's capital since 1868, when it was renamed from Edo during the Meiji Restoration. It's not only the political center but also the economic and cultural heart of Japan, with a metropolitan area population exceeding 37 million people, making it the world's most populous metropolitan area."
    },
    {
        "instruction": "What color is the sky?",
        "output_a": "Blue",
        "output_b": "The sky appears blue during the day due to a phenomenon called Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides with gas molecules. Blue light waves are shorter than other colors, so they are scattered in all directions by the tiny molecules of air in Earth's atmosphere. This is why we perceive the sky as blue during clear daylight hours."
    }
]

# 2. Position Bias - Order can affect judgment
def test_position_bias(instruction: str, output_1: str, output_2: str):
    """Test if the order of responses affects evaluation."""
    # Test both orders
    verdict_ab = evaluate_pair(instruction, output_1, output_2)
    verdict_ba = evaluate_pair(instruction, output_2, output_1)
    
    # Convert verdicts for comparison
    if verdict_ba == "A":
        verdict_ba_converted = "B"
    elif verdict_ba == "B":
        verdict_ba_converted = "A"
    else:
        verdict_ba_converted = "TIE"
    
    return {
        "order_AB": verdict_ab,
        "order_BA_converted": verdict_ba_converted,
        "consistent": verdict_ab == verdict_ba_converted
    }

# 3. Analyze length bias
print("Testing Length Bias:")
print("="*50)
for example in length_bias_examples:
    verdict = evaluate_pair(example["instruction"], example["output_a"], example["output_b"])
    print(f"\\nInstruction: {example['instruction']}")
    print(f"Output A length: {len(example['output_a'])} chars")
    print(f"Output B length: {len(example['output_b'])} chars")
    print(f"Verdict: {verdict}")
    print(f"Longer response preferred: {'Yes' if verdict == 'B' else 'No'}")

# 4. Test position bias
print("\\n\\nTesting Position Bias:")
print("="*50)
position_example = {
    "instruction": "What is machine learning?",
    "output_1": "Machine learning is a subset of AI that enables systems to learn from data.",
    "output_2": "Machine learning allows computers to learn patterns from data without explicit programming."
}

position_result = test_position_bias(
    position_example["instruction"],
    position_example["output_1"],
    position_example["output_2"]
)

print(f"Instruction: {position_example['instruction']}")
print(f"When A is first: {position_result['order_AB']}")
print(f"When B is first (converted): {position_result['order_BA_converted']}")
print(f"Consistent across orders: {position_result['consistent']}")

### Analyzing Biases Statistically

Let's create a more comprehensive analysis of how output length affects evaluation outcomes:

In [None]:
# Create a dataset to analyze length bias
def create_length_bias_dataset():
    """Create examples with varying length differences."""
    examples = []
    
    # Same content, different lengths
    base_answers = [
        ("2+2=4", "The sum of 2 and 2 equals 4, which is a basic arithmetic fact."),
        ("Paris", "Paris is the capital city of France, known for the Eiffel Tower."),
        ("H2O", "Water has the chemical formula H2O, consisting of hydrogen and oxygen."),
        ("1969", "The moon landing occurred in 1969, a historic achievement for humanity."),
        ("Python", "Python is a popular programming language known for its readability.")
    ]
    
    instructions = [
        "What is 2+2?",
        "What is the capital of France?",
        "What is the chemical formula for water?",
        "When was the moon landing?",
        "Name a popular programming language."
    ]
    
    for i, (instruction, (short, long)) in enumerate(zip(instructions, base_answers)):
        examples.append({
            "instruction": instruction,
            "output_a": short,
            "output_b": long,
            "length_diff": len(long) - len(short)
        })
    
    return examples

# Analyze length bias
length_examples = create_length_bias_dataset()
length_results = []

print("Analyzing Length Bias Across Examples:")
print("="*50)

for example in length_examples:
    verdict = evaluate_pair(
        example["instruction"],
        example["output_a"],
        example["output_b"]
    )
    
    length_results.append({
        "instruction": example["instruction"],
        "verdict": verdict,
        "length_diff": example["length_diff"],
        "preferred_longer": verdict == "B"
    })
    
    time.sleep(0.5)  # Rate limiting

# Convert to DataFrame for analysis
length_df = pd.DataFrame(length_results)

# Calculate statistics
prefer_longer_rate = length_df["preferred_longer"].mean()
print(f"\\nProportion preferring longer response: {prefer_longer_rate:.2%}")
print(f"Average length difference: {length_df['length_diff'].mean():.1f} characters")

# Visualize the results
plt.figure(figsize=(10, 6))

# Plot 1: Bar chart of preferences
plt.subplot(1, 2, 1)
verdict_counts = length_df["verdict"].value_counts()
plt.bar(verdict_counts.index, verdict_counts.values)
plt.title("Distribution of Verdicts")
plt.xlabel("Verdict")
plt.ylabel("Count")

# Plot 2: Length difference vs preference
plt.subplot(1, 2, 2)
plt.scatter(length_df["length_diff"], 
           length_df["preferred_longer"].astype(int),
           alpha=0.6, s=100)
plt.xlabel("Length Difference (chars)")
plt.ylabel("Preferred Longer (1=Yes, 0=No)")
plt.title("Length Difference vs Preference")

plt.tight_layout()
plt.show()

# Additional analysis: List biases
print("\\n" + "="*50)
print("Other Common Biases in LLM Evaluation:")
print("="*50)
print("""
1. **Self-Preference Bias**: Models tend to prefer outputs similar to their training style
2. **Format Bias**: Preference for bullet points, numbered lists, or structured formats
3. **Verbosity Bias**: Beyond length, preference for elaborate explanations
4. **Recency Bias**: Later information in the prompt may be weighted more heavily
5. **Instruction-Following Bias**: Strict interpretation vs. helpful interpretation
""")

---

## 6. Understanding Logprobs and Length-Controlled Evaluation

### What are Logprobs?

Logprobs (log probabilities) are the natural logarithm of the probability that a model assigns to each token. They tell us how confident the model is about its choice.

**Key Concepts:**
- **Probability**: How likely the model thinks a token is (0 to 1)
- **Log Probability**: The natural log of probability (negative values)
- **Why logs?**: Prevents numerical underflow and makes math easier

```
Example:
Token "A" probability = 0.7  → logprob = ln(0.7) = -0.357
Token "B" probability = 0.3  → logprob = ln(0.3) = -1.204

Higher logprob = More confident
```

### How AlpacaEval Uses Logprobs

AlpacaEval 2.0's key innovation is using logprobs to control for length bias:

In [None]:
def detailed_length_controlled_evaluation(instruction: str, output_a: str, output_b: str,
                                         model: str = "gpt-3.5-turbo") -> Dict:
    """
    Detailed implementation showing how AlpacaEval uses logprobs for length control.
    
    The key insight: If the judge is very confident (high probability) about 
    preferring the longer output, we should be suspicious of length bias.
    """
    
    # AlpacaEval v2 uses single-token responses
    prompt = f"""I need you to help me evaluate responses to instructions.

Instruction: {instruction}

Response m: {output_a}

Response M: {output_b}

Please respond with just 'm' or 'M' to indicate which is better.
IMPORTANT: Focus on quality, not length.

Your answer:"""
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=1,
            logprobs=True,
            top_logprobs=5  # Get top 5 token probabilities
        )
        
        # Extract the verdict and logprobs
        verdict_token = response.choices[0].message.content.strip()
        
        # Get logprobs for both options
        logprobs_data = {}
        if hasattr(response.choices[0], 'logprobs') and response.choices[0].logprobs:
            for item in response.choices[0].logprobs.content[0].top_logprobs:
                token = item.token
                if token in ['m', 'M']:
                    logprobs_data[token] = {
                        'logprob': item.logprob,
                        'prob': np.exp(item.logprob)
                    }
        
        # Calculate length features
        len_a, len_b = len(output_a), len(output_b)
        length_ratio = len_b / len_a if len_a > 0 else 1.0
        log_length_ratio = np.log(length_ratio) if length_ratio > 0 else 0
        
        # AlpacaEval's length control formula (simplified)
        # The idea: Adjust confidence based on length difference
        if verdict_token == 'M' and length_ratio > 1:
            # Model prefers the longer output
            # Reduce confidence proportionally to length difference
            base_prob = logprobs_data.get('M', {}).get('prob', 0.5)
            
            # Length penalty factor (this is simplified; real formula is more complex)
            length_penalty = 1 / (1 + 0.5 * log_length_ratio)
            adjusted_prob = base_prob * length_penalty
            
        elif verdict_token == 'm' and length_ratio < 1:
            # Model prefers the longer output (A is longer)
            base_prob = logprobs_data.get('m', {}).get('prob', 0.5)
            length_penalty = 1 / (1 + 0.5 * abs(log_length_ratio))
            adjusted_prob = base_prob * length_penalty
            
        else:
            # No length adjustment needed
            base_prob = logprobs_data.get(verdict_token, {}).get('prob', 0.5)
            adjusted_prob = base_prob
            length_penalty = 1.0
        
        return {
            "verdict": "A" if verdict_token == 'm' else "B",
            "verdict_token": verdict_token,
            "logprobs_m": logprobs_data.get('m', {}).get('logprob', None),
            "logprobs_M": logprobs_data.get('M', {}).get('logprob', None),
            "prob_m": logprobs_data.get('m', {}).get('prob', None),
            "prob_M": logprobs_data.get('M', {}).get('prob', None),
            "base_confidence": base_prob,
            "adjusted_confidence": adjusted_prob,
            "length_penalty": length_penalty,
            "length_ratio": length_ratio,
            "log_length_ratio": log_length_ratio
        }
        
    except Exception as e:
        print(f"Error: {e}")
        return {"verdict": "TIE", "error": str(e)}


# Demonstrate with examples
print("DETAILED LOGPROB-BASED EVALUATION")
print("="*60)

test_cases = [
    {
        "name": "Clear quality difference",
        "instruction": "What is 2+2?",
        "output_a": "4",
        "output_b": "2+2 equals 4"
    },
    {
        "name": "Length bias test",
        "instruction": "What is the capital of France?",
        "output_a": "Paris",
        "output_b": "The capital of France is Paris, which is located in the northern part of the country along the Seine River. It has been the capital since 987 AD and is known for landmarks like the Eiffel Tower."
    }
]

for test in test_cases:
    print(f"\nTest: {test['name']}")
    print("-"*40)
    result = detailed_length_controlled_evaluation(
        test["instruction"],
        test["output_a"],
        test["output_b"]
    )
    
    if "error" not in result:
        print(f"Verdict: {result['verdict']} (token: '{result['verdict_token']}')")
        print(f"Length ratio (B/A): {result['length_ratio']:.2f}")
        print(f"\nProbabilities:")
        print(f"  P(m|context) = {result['prob_m']:.3f} (prefers A)")
        print(f"  P(M|context) = {result['prob_M']:.3f} (prefers B)")
        print(f"\nLength Control:")
        print(f"  Base confidence: {result['base_confidence']:.3f}")
        print(f"  Length penalty: {result['length_penalty']:.3f}")
        print(f"  Adjusted confidence: {result['adjusted_confidence']:.3f}")
    
    time.sleep(1)


# Visualize the relationship between length ratio and penalty
def visualize_length_penalty_formula():
    """Show how AlpacaEval's length penalty works."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Plot 1: Length penalty function
    length_ratios = np.linspace(0.1, 5, 100)
    log_ratios = np.log(length_ratios)
    penalties = 1 / (1 + 0.5 * np.abs(log_ratios))
    
    ax1.plot(length_ratios, penalties, linewidth=2, label='Length Penalty')
    ax1.axvline(x=1.0, color='red', linestyle='--', alpha=0.5, label='Equal Length')
    ax1.fill_between(length_ratios, penalties, alpha=0.3)
    ax1.set_xlabel('Length Ratio (Output B / Output A)')
    ax1.set_ylabel('Penalty Factor')
    ax1.set_title('AlpacaEval Length Penalty Function')
    ax1.grid(True, alpha=0.3)
    ax1.legend()
    ax1.set_xlim(0, 5)
    
    # Plot 2: Effect on confidence
    base_confidences = [0.6, 0.7, 0.8, 0.9]
    
    for base_conf in base_confidences:
        adjusted = base_conf * penalties
        ax2.plot(length_ratios, adjusted, linewidth=2, 
                label=f'Base: {base_conf}')
    
    ax2.axvline(x=1.0, color='red', linestyle='--', alpha=0.5)
    ax2.set_xlabel('Length Ratio (Output B / Output A)')
    ax2.set_ylabel('Adjusted Confidence')
    ax2.set_title('How Length Affects Final Confidence')
    ax2.grid(True, alpha=0.3)
    ax2.legend()
    ax2.set_xlim(0, 5)
    ax2.set_ylim(0, 1)
    
    plt.tight_layout()
    plt.show()

visualize_length_penalty_formula()

---

## 7. Practical Applications and Best Practices

Let's create a complete evaluation pipeline that incorporates everything we've learned:

In [None]:
@dataclass
class EvaluationConfig:
    """Configuration for LLM evaluation."""
    model: str = "gpt-3.5-turbo"
    temperature: float = 0.0
    use_length_control: bool = True
    check_position_bias: bool = True
    batch_size: int = 5
    
    
class LLMJudgeEvaluator:
    """Complete LLM-as-Judge evaluation system."""
    
    def __init__(self, config: EvaluationConfig):
        self.config = config
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.results = []
        
    def evaluate_single(self, instruction: str, output_a: str, output_b: str) -> Dict:
        """Evaluate a single example with all checks."""
        result = {
            "instruction": instruction,
            "output_a_length": len(output_a),
            "output_b_length": len(output_b),
        }
        
        # Basic evaluation
        basic_verdict = self._get_verdict(instruction, output_a, output_b)
        result["verdict"] = basic_verdict
        
        # Length-controlled evaluation
        if self.config.use_length_control:
            lc_result = self._length_controlled_eval(instruction, output_a, output_b)
            result.update(lc_result)
        
        # Position bias check
        if self.config.check_position_bias:
            position_check = self._check_position_bias(instruction, output_a, output_b)
            result["position_consistent"] = position_check
        
        return result
    
    def _get_verdict(self, instruction: str, output_a: str, output_b: str) -> str:
        """Get basic verdict."""
        prompt = create_alpaca_style_prompt(instruction, output_a, output_b)
        
        try:
            response = self.client.chat.completions.create(
                model=self.config.model,
                messages=[{"role": "user", "content": prompt}],
                temperature=self.config.temperature,
                max_tokens=10
            )
            
            verdict = response.choices[0].message.content.strip().upper()
            return verdict if verdict in ["A", "B"] else "TIE"
            
        except Exception as e:
            print(f"Error: {e}")
            return "TIE"
    
    def _length_controlled_eval(self, instruction: str, output_a: str, output_b: str) -> Dict:
        """Perform length-controlled evaluation."""
        return length_controlled_evaluation(instruction, output_a, output_b, self.config.model)
    
    def _check_position_bias(self, instruction: str, output_a: str, output_b: str) -> bool:
        """Check for position bias."""
        verdict_ab = self._get_verdict(instruction, output_a, output_b)
        verdict_ba = self._get_verdict(instruction, output_b, output_a)
        
        # Convert verdict_ba for comparison
        if verdict_ba == "A":
            verdict_ba = "B"
        elif verdict_ba == "B":
            verdict_ba = "A"
            
        return verdict_ab == verdict_ba
    
    def evaluate_dataset(self, examples: List[Dict]) -> pd.DataFrame:
        """Evaluate a full dataset."""
        results = []
        
        for i, example in enumerate(tqdm(examples, desc="Evaluating")):
            result = self.evaluate_single(
                example["instruction"],
                example["output_a"],
                example["output_b"]
            )
            results.append(result)
            
            # Rate limiting
            time.sleep(0.5)
        
        self.results = results
        return pd.DataFrame(results)
    
    def generate_report(self) -> Dict:
        """Generate evaluation report with statistics."""
        df = pd.DataFrame(self.results)
        
        report = {
            "total_examples": len(df),
            "verdict_distribution": df["verdict"].value_counts().to_dict(),
            "avg_length_ratio": (df["output_b_length"] / df["output_a_length"]).mean(),
        }
        
        if "position_consistent" in df.columns:
            report["position_consistency_rate"] = df["position_consistent"].mean()
        
        if "length_adjusted" in df.columns:
            report["length_adjustments_made"] = df["length_adjusted"].sum()
        
        # Analyze preference by length
        df["longer_output"] = df.apply(
            lambda x: "A" if x["output_a_length"] > x["output_b_length"] else "B", 
            axis=1
        )
        df["preferred_longer"] = df["verdict"] == df["longer_output"]
        report["prefer_longer_rate"] = df["preferred_longer"].mean()
        
        return report


# Create and test the complete evaluator
config = EvaluationConfig(
    use_length_control=True,
    check_position_bias=False,  # Disable for speed in demo
    batch_size=3
)

evaluator = LLMJudgeEvaluator(config)

# Create test dataset
test_dataset = [
    {
        "instruction": "What is Python?",
        "output_a": "Python is a high-level programming language.",
        "output_b": "Python is a versatile, high-level programming language known for its simple syntax and readability. It supports multiple programming paradigms and has a vast ecosystem."
    },
    {
        "instruction": "Calculate 15 + 27",
        "output_a": "15 + 27 = 42",
        "output_b": "42"
    },
    {
        "instruction": "Name three primary colors",
        "output_a": "Red, blue, and yellow are the three primary colors.",
        "output_b": "The primary colors are:\n1. Red\n2. Blue\n3. Yellow"
    }
]

# Run evaluation
print("Running Complete Evaluation Pipeline...")
print("="*50)
results_df = evaluator.evaluate_dataset(test_dataset)

# Generate report
report = evaluator.generate_report()

print("\\nEvaluation Report:")
print("="*50)
for key, value in report.items():
    print(f"{key}: {value}")

# Show detailed results
print("\\nDetailed Results:")
print(results_df[["instruction", "verdict", "output_a_length", "output_b_length"]].to_string())

---

## 8. When to Use LLM vs Human Evaluation

Understanding when to use each evaluation method is crucial for effective model development:

In [None]:
# Create a comparison framework
evaluation_comparison = pd.DataFrame({
    "Aspect": [
        "Speed",
        "Cost", 
        "Consistency",
        "Scalability",
        "Nuanced Understanding",
        "Cultural Sensitivity",
        "Domain Expertise",
        "Bias Detection",
        "Reproducibility",
        "Feedback Quality"
    ],
    "LLM Evaluation": [
        "Very Fast (seconds)",
        "Low ($0.001-0.01 per eval)",
        "High (deterministic)",
        "Excellent (thousands/hour)",
        "Good for general tasks",
        "Limited",
        "General knowledge only",
        "Can perpetuate biases",
        "Perfect",
        "Structured but generic"
    ],
    "Human Evaluation": [
        "Slow (minutes-hours)",
        "High ($0.10-1.00 per eval)",
        "Variable (inter-rater differences)",
        "Poor (dozens/day)",
        "Excellent",
        "High with diverse evaluators",
        "Can use domain experts",
        "Better bias awareness",
        "Difficult",
        "Rich and specific"
    ]
})

print("LLM vs Human Evaluation Comparison")
print("="*80)
print(evaluation_comparison.to_string(index=False))

# Decision framework
print("\\n\\nDecision Framework:")
print("="*50)
print("""
Use LLM Evaluation When:
✓ Rapid iteration during development
✓ Large-scale evaluation needed (1000+ examples)
✓ Comparing similar models/prompts
✓ Budget constraints
✓ Need reproducible results
✓ General quality assessment

Use Human Evaluation When:
✓ Final model validation
✓ Safety-critical applications
✓ Creative or subjective tasks
✓ Domain-specific evaluation
✓ Detecting subtle biases
✓ Understanding user preferences

Hybrid Approach (Recommended):
1. Use LLM evaluation for rapid development cycles
2. Validate with human evaluation on a subset
3. Calibrate LLM judge based on human feedback
4. Use human evaluation for final validation
""")

# Practical example: Agreement analysis
def analyze_human_llm_agreement(human_verdicts: List[str], llm_verdicts: List[str]) -> Dict:
    """Analyze agreement between human and LLM evaluators."""
    assert len(human_verdicts) == len(llm_verdicts)
    
    agreements = [h == l for h, l in zip(human_verdicts, llm_verdicts)]
    agreement_rate = sum(agreements) / len(agreements)
    
    # Calculate Cohen's Kappa (simple version)
    from collections import Counter
    
    # Observed agreement
    po = agreement_rate
    
    # Expected agreement by chance
    human_counts = Counter(human_verdicts)
    llm_counts = Counter(llm_verdicts)
    n = len(human_verdicts)
    
    pe = sum((human_counts[k] * llm_counts[k]) / (n * n) for k in set(human_verdicts))
    
    # Cohen's Kappa
    kappa = (po - pe) / (1 - pe) if pe < 1 else 0
    
    return {
        "agreement_rate": agreement_rate,
        "cohens_kappa": kappa,
        "total_examples": len(human_verdicts),
        "disagreements": sum(1 for a in agreements if not a)
    }

# Simulate human vs LLM agreement
np.random.seed(42)
n_examples = 20

# Simulate verdicts (with some correlation)
human_verdicts = np.random.choice(["A", "B"], size=n_examples, p=[0.6, 0.4])
llm_verdicts = []

for h in human_verdicts:
    if np.random.random() < 0.8:  # 80% agreement
        llm_verdicts.append(h)
    else:
        llm_verdicts.append("B" if h == "A" else "A")

# Analyze agreement
agreement_stats = analyze_human_llm_agreement(
    human_verdicts.tolist(), 
    llm_verdicts
)

print("\\n\\nHuman-LLM Agreement Analysis:")
print("="*50)
for key, value in agreement_stats.items():
    print(f"{key}: {value:.3f}" if isinstance(value, float) else f"{key}: {value}")

# Visualize agreement
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Confusion matrix style plot
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(human_verdicts, llm_verdicts, labels=["A", "B"])

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax1,
            xticklabels=["A", "B"], yticklabels=["A", "B"])
ax1.set_xlabel('LLM Verdict')
ax1.set_ylabel('Human Verdict')
ax1.set_title('Human vs LLM Agreement Matrix')

# Agreement over examples
agreements = [h == l for h, l in zip(human_verdicts, llm_verdicts)]
ax2.bar(range(len(agreements)), agreements, alpha=0.7)
ax2.set_xlabel('Example Index')
ax2.set_ylabel('Agreement (1=Yes, 0=No)')
ax2.set_title('Agreement Pattern Across Examples')
ax2.axhline(y=agreement_stats["agreement_rate"], color='r', 
            linestyle='--', label=f'Avg: {agreement_stats["agreement_rate"]:.2f}')
ax2.legend()

plt.tight_layout()
plt.show()

---

## Summary and Key Takeaways

In this notebook, we've explored the fundamentals of LLM-as-Judge evaluation:

### What We Learned:

1. **Basic Implementation**: How to use LLMs to evaluate other model outputs through pairwise comparison
2. **Prompt Engineering**: The importance of well-designed evaluation prompts for consistent results
3. **Batch Processing**: Efficient evaluation of multiple examples in single API calls
4. **Common Biases**: 
   - Length bias (preference for longer outputs)
   - Position bias (order affects judgment)
   - Self-preference bias (models prefer their own style)
5. **Mitigation Strategies**: Length-controlled evaluation using confidence scores
6. **Practical Applications**: When to use LLM vs human evaluation

### Key Insights:

- **LLM evaluation correlates highly with human judgment** (0.98 for AlpacaEval with ChatBot Arena)
- **Biases are systematic and predictable**, allowing for correction
- **Hybrid approaches work best**: Use LLM evaluation for development, human evaluation for validation
- **Length control is crucial** for fair evaluation of concise vs verbose models

### Best Practices for LLM-as-Judge Evaluationi:

1. Always test for position bias by evaluating both orders
2. Validate LLM judgments with human evaluation on a subset
3. Monitor for drift - LLM judges can change behavior over time
4. Use multiple judges when possible to reduce single-model bias
5. Use length-controlled metrics when comparing models

---

## Additional Resources

For further exploration of LLM-as-Judge evaluation:

### Papers and Research:
- **AlpacaEval GitHub**: ["AlpacaEval: An Automatic Evaluator for Instruction-following Language Models"](https://github.com/tatsu-lab/alpaca_eval)
- **Length-Controlled AlpacaEval**: Introduces techniques to mitigate length bias
- **ChatBot Arena**: Human evaluation platform that validates LLM judge correlations

### Next Steps:
- Experiment with different judge models (GPT-4, Claude, etc.)
- Try evaluating your own model outputs
- Implement more sophisticated bias mitigation techniques
- Explore multi-turn dialogue evaluation
- Build domain-specific evaluators for your use cases

Remember: LLM-as-Judge is a powerful tool, but it's not a complete replacement for human evaluation. Use it wisely as part of a comprehensive evaluation strategy!