# Elo Ranking Tutorial: Dynamic Leaderboards for LLM Evaluation

## Learning Objectives

By the end of this tutorial, you will:
- ‚úÖ Understand the Elo rating formula and its intuition
- ‚úÖ Implement Elo ranking from scratch
- ‚úÖ Record pairwise matches and update rankings dynamically
- ‚úÖ Visualize leaderboard evolution over time
- ‚úÖ Calculate confidence intervals for rankings
- ‚úÖ Detect transitivity violations in comparisons

## Execution Details

- **Execution Time:** <5 minutes
- **Cost:** $0 (simulation-based, no API calls)
- **Prerequisites:** Understanding of comparative evaluation (see `comparative_evaluation_guide.md`)

## Background

The **Elo rating system** was developed by Arpad Elo for chess in the 1960s. It's now used for:
- Chess rankings (FIDE)
- Game matchmaking (League of Legends, Overwatch)
- **LLM evaluation** (Chatbot Arena, AlpacaEval 2.0)

**Why Elo for LLMs?**
- Simple and interpretable
- Online updates (process comparisons one at a time)
- Self-correcting (ratings converge to true skill)
- Handles new models easily (just assign initial rating)


In [1]:
# Cell 2: Setup and imports
import json
import sys
from collections import defaultdict
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np

# Add backend to path
sys.path.insert(0, str(Path.cwd().parent / "backend"))

from comparative_evaluation import EloRanking

# Load pairwise comparisons dataset
data_path = Path("data/pairwise_comparisons.json")
with open(data_path) as f:
    comparisons = json.load(f)

print(f"‚úÖ Loaded {len(comparisons)} pairwise comparisons")
print("‚úÖ EloRanking class imported successfully")
print("\nFirst comparison example:")
print(f"Query: {comparisons[0]['query'][:80]}...")
print(f"Winner: {comparisons[0]['winner']}")
print(f"Dimension: {comparisons[0]['dimension']}")

‚úÖ Loaded 100 pairwise comparisons
‚úÖ EloRanking class imported successfully

First comparison example:
Query: How do I make gluten-free pasta from scratch?...
Winner: A
Dimension: helpfulness


## Elo Formula and Intuition

### The Math

**Step 1: Calculate expected score (win probability)**
```
E_A = 1 / (1 + 10^((R_B - R_A) / 400))
```
- `R_A`, `R_B`: Current ratings for models A and B
- `E_A`: Expected probability that A beats B
- The 400 constant comes from chess (10% rating difference ‚âà 64% win probability)

**Step 2: Update rating based on actual outcome**
```
R_A' = R_A + K * (S_A - E_A)
```
- `S_A`: Actual score (1 if A wins, 0 if B wins, 0.5 if tie)
- `K`: Learning rate / K-factor (how much ratings can change)
- `(S_A - E_A)`: Prediction error (surprise!)

### The Intuition

**Example 1: Evenly matched models**
- Model A: 1500 Elo, Model B: 1500 Elo
- Expected: `E_A = 0.5` (50% chance A wins)
- A wins: `R_A' = 1500 + 32 * (1 - 0.5) = 1516`
- B loses: `R_B' = 1500 + 32 * (0 - 0.5) = 1484`
- **Rating change: ¬±16 points**

**Example 2: Underdog wins (upset!)**
- Model A: 1300 Elo, Model B: 1600 Elo
- Expected: `E_A = 0.09` (only 9% chance A wins)
- A wins: `R_A' = 1300 + 32 * (1 - 0.09) = 1329`
- B loses: `R_B' = 1600 + 32 * (0 - 0.91) = 1571`
- **Rating change: +29 for A, -29 for B (big surprise!)**

**Example 3: Favorite wins (expected)**
- Model A: 1600 Elo, Model B: 1300 Elo
- Expected: `E_A = 0.91` (91% chance A wins)
- A wins: `R_A' = 1600 + 32 * (1 - 0.91) = 1603`
- B loses: `R_B' = 1300 + 32 * (0 - 0.09) = 1297`
- **Rating change: ¬±3 points (no surprise)**


In [None]:
# Cell 4: Implement and demonstrate Elo ranking
# Initialize Elo ranking system
elo = EloRanking(initial_rating=1500, k_factor=32)

# Demonstrate the three examples from above
print("=" * 60)
print("Example 1: Evenly matched models")
print("=" * 60)
elo_demo1 = EloRanking(initial_rating=1500, k_factor=32)
elo_demo1.ratings = {"Model_A": 1500, "Model_B": 1500}
print(f"Before: A={elo_demo1.ratings['Model_A']}, B={elo_demo1.ratings['Model_B']}")
elo_demo1.record_match("Model_A", "Model_B", result=1.0)
print(f"After A wins: A={elo_demo1.ratings['Model_A']:.0f}, B={elo_demo1.ratings['Model_B']:.0f}")
print(f"Rating change: ¬±{abs(elo_demo1.ratings['Model_A'] - 1500):.0f} points\n")

print("=" * 60)
print("Example 2: Underdog wins (upset!)")
print("=" * 60)
elo_demo2 = EloRanking(initial_rating=1500, k_factor=32)
elo_demo2.ratings = {"Model_A": 1300, "Model_B": 1600}
expected_a = elo_demo2._calculate_expected_score(1300, 1600)
print(f"Before: A={elo_demo2.ratings['Model_A']}, B={elo_demo2.ratings['Model_B']}")
print(f"Expected win probability for A: {expected_a:.1%}")
elo_demo2.record_match("Model_A", "Model_B", result=1.0)
print(f"After A wins: A={elo_demo2.ratings['Model_A']:.0f}, B={elo_demo2.ratings['Model_B']:.0f}")
print(f"Rating change: +{elo_demo2.ratings['Model_A'] - 1300:.0f} for A, {elo_demo2.ratings['Model_B'] - 1600:.0f} for B\n")

print("=" * 60)
print("Example 3: Favorite wins (expected)")
print("=" * 60)
elo_demo3 = EloRanking(initial_rating=1500, k_factor=32)
elo_demo3.ratings = {"Model_A": 1600, "Model_B": 1300}
expected_a = elo_demo3._calculate_expected_score(1600, 1300)
print(f"Before: A={elo_demo3.ratings['Model_A']}, B={elo_demo3.ratings['Model_B']}")
print(f"Expected win probability for A: {expected_a:.1%}")
elo_demo3.record_match("Model_A", "Model_B", result=1.0)
print(f"After A wins: A={elo_demo3.ratings['Model_A']:.0f}, B={elo_demo3.ratings['Model_B']:.0f}")
print(f"Rating change: +{elo_demo3.ratings['Model_A'] - 1600:.0f} for A, {elo_demo3.ratings['Model_B'] - 1300:.0f} for B")

print("\n‚úÖ Elo formula demonstrated successfully!")

In [None]:
# Cell 5: Record all pairwise matches from dataset
# Reset Elo ranking system
elo = EloRanking(initial_rating=1500, k_factor=32)

# Track rating evolution
rating_history = defaultdict(list)
match_count = defaultdict(int)

# Process each comparison
for i, comp in enumerate(comparisons):
    # Extract winner and loser from comparison
    # In our dataset, responses are labeled as response_a and response_b
    # We'll treat them as different "models" for this tutorial
    model_a = "Response_A"
    model_b = "Response_B"
    
    # Determine result (1.0 if A wins, 0.0 if B wins, 0.5 if tie)
    if comp['winner'] == 'A':
        result = 1.0
    elif comp['winner'] == 'B':
        result = 0.0
    else:
        result = 0.5  # Tie
    
    # Record match
    elo.record_match(model_a, model_b, result)
    
    # Track history
    for model in elo.ratings:
        rating_history[model].append(elo.ratings[model])
        match_count[model] += 1

# Get final leaderboard
leaderboard = elo.get_leaderboard()

print("=" * 60)
print("Final Elo Leaderboard")
print("=" * 60)
for i, (model, rating) in enumerate(leaderboard, 1):
    print(f"{i}. {model:20s} {rating:7.1f} Elo ({match_count[model]} matches)")

print(f"\n‚úÖ Processed {len(comparisons)} pairwise comparisons")
print(f"‚úÖ {len(leaderboard)} models ranked")

In [None]:
# Cell 6: Visualize leaderboard evolution over time
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Rating evolution over matches
ax1 = axes[0]
for model, history in rating_history.items():
    ax1.plot(history, label=model, linewidth=2, marker='o', markersize=3, alpha=0.7)

ax1.axhline(y=1500, color='gray', linestyle='--', alpha=0.5, label='Initial rating')
ax1.set_xlabel('Match Number', fontsize=11)
ax1.set_ylabel('Elo Rating', fontsize=11)
ax1.set_title('Elo Rating Evolution Over Time', fontsize=13, fontweight='bold')
ax1.legend(loc='best', fontsize=9)
ax1.grid(True, alpha=0.3)

# Plot 2: Final ratings bar chart
ax2 = axes[1]
models = [m for m, _ in leaderboard]
ratings = [r for _, r in leaderboard]
colors = ['#2ecc71' if r > 1500 else '#e74c3c' for r in ratings]

bars = ax2.barh(models, ratings, color=colors, alpha=0.7, edgecolor='black')
ax2.axvline(x=1500, color='gray', linestyle='--', alpha=0.5, label='Initial rating')
ax2.set_xlabel('Elo Rating', fontsize=11)
ax2.set_title('Final Elo Rankings', fontsize=13, fontweight='bold')
ax2.legend(loc='best', fontsize=9)
ax2.grid(True, alpha=0.3, axis='x')

# Add rating values on bars
for i, (bar, rating) in enumerate(zip(bars, ratings)):
    ax2.text(rating + 5, i, f'{rating:.0f}', va='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('results/elo_evolution.png', dpi=150, bbox_inches='tight')
print("\n‚úÖ Visualization saved to results/elo_evolution.png")
plt.show()

In [None]:
# Cell 7: Calculate confidence intervals for rankings
# Elo doesn't natively provide uncertainty, so we'll estimate it using rating volatility
# Method: Calculate standard deviation of rating over last N matches

def calculate_rating_uncertainty(rating_history, window=10):
    """Estimate rating uncertainty from recent volatility."""
    if len(rating_history) < window:
        window = len(rating_history)
    recent = rating_history[-window:]
    return np.std(recent) if len(recent) > 1 else 50.0  # Default uncertainty

# Calculate 95% confidence intervals (¬±1.96 * std)
print("=" * 70)
print("Elo Rankings with 95% Confidence Intervals")
print("=" * 70)
print(f"{'Rank':<6} {'Model':<20} {'Rating':<10} {'95% CI':<20} {'Uncertainty'}")
print("=" * 70)

uncertainties = {}
for i, (model, rating) in enumerate(leaderboard, 1):
    uncertainty = calculate_rating_uncertainty(rating_history[model])
    uncertainties[model] = uncertainty
    ci_lower = rating - 1.96 * uncertainty
    ci_upper = rating + 1.96 * uncertainty
    print(f"{i:<6} {model:<20} {rating:7.1f}    [{ci_lower:6.1f}, {ci_upper:6.1f}]    ¬±{uncertainty:.1f}")

print("\nüìä Interpretation:")
print("- Lower uncertainty = more stable/confident ranking")
print("- Higher uncertainty = volatile ranking (needs more comparisons)")
print("- Overlapping CIs = rankings not statistically different")

# Visualize uncertainty
fig, ax = plt.subplots(figsize=(10, 6))
models = [m for m, _ in leaderboard]
ratings = [r for _, r in leaderboard]
errors = [1.96 * uncertainties[m] for m in models]

y_pos = np.arange(len(models))
ax.barh(y_pos, ratings, xerr=errors, color='steelblue', alpha=0.7, 
        ecolor='black', capsize=5, error_kw={'linewidth': 2})
ax.set_yticks(y_pos)
ax.set_yticklabels(models)
ax.set_xlabel('Elo Rating (with 95% CI)', fontsize=11)
ax.set_title('Elo Rankings with Uncertainty', fontsize=13, fontweight='bold')
ax.axvline(x=1500, color='gray', linestyle='--', alpha=0.5, label='Initial rating')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('results/elo_uncertainty.png', dpi=150, bbox_inches='tight')
print("\n‚úÖ Uncertainty visualization saved to results/elo_uncertainty.png")
plt.show()

In [None]:
# Cell 8: Analyze transitivity violations
# Transitivity: If A > B and B > C, then A > C should hold
# Violations indicate inconsistency in judgments

def find_transitivity_violations(comparisons):
    """Find cycles in pairwise comparisons (A > B > C > A)."""
    # Build win graph
    wins = defaultdict(set)  # wins[A] = set of models that A beat
    
    for comp in comparisons:
        if comp['winner'] == 'A':
            wins['Response_A'].add('Response_B')
        elif comp['winner'] == 'B':
            wins['Response_B'].add('Response_A')
    
    # Find simple cycles (A > B > C > A)
    violations = []
    models = list(wins.keys())
    
    for a in models:
        for b in wins[a]:  # A beat B
            if b in wins:
                for c in wins[b]:  # B beat C
                    if c in wins and a in wins[c]:  # C beat A (cycle!)
                        violations.append((a, b, c))
    
    return violations

violations = find_transitivity_violations(comparisons)

print("=" * 60)
print("Transitivity Violation Analysis")
print("=" * 60)

if violations:
    print(f"‚ö†Ô∏è  Found {len(violations)} transitivity violations (cycles):\n")
    for i, (a, b, c) in enumerate(violations[:5], 1):  # Show first 5
        print(f"{i}. {a} > {b} > {c} > {a}")
    if len(violations) > 5:
        print(f"   ... and {len(violations) - 5} more")
    
    violation_rate = len(violations) / len(comparisons)
    print(f"\nViolation rate: {violation_rate:.1%}")
    
    if violation_rate > 0.1:
        print("\n‚ö†Ô∏è  High violation rate (>10%) suggests:")
        print("   - Inconsistent judge behavior")
        print("   - Query-dependent preferences")
        print("   - Forced wins (should have been ties)")
    else:
        print("\n‚úÖ Acceptable violation rate (<10%)")
        print("   Some cycles are expected due to measurement noise")
else:
    print("‚úÖ No transitivity violations found!")
    print("   All comparisons are consistent (A > B > C implies A > C)")

print("\nüìä Interpretation:")
print("- Elo rankings are robust to some violations (averages out noise)")
print("- Bradley-Terry model assumes transitivity (may fit poorly if many violations)")
print("- Investigate violations to improve judge consistency")

## Summary and Key Takeaways

### What You Learned

1. **Elo Formula**: 
   - Expected score: `E_A = 1 / (1 + 10^((R_B - R_A) / 400))`
   - Rating update: `R_A' = R_A + K * (S_A - E_A)`
   - Larger updates for surprising results (upsets)

2. **Dynamic Updates**:
   - Process comparisons incrementally (online learning)
   - Ratings converge to true skill over time
   - Easy to add new models (just assign initial rating)

3. **Uncertainty Estimation**:
   - Elo doesn't provide native uncertainty
   - Can estimate from rating volatility
   - More matches = lower uncertainty

4. **Transitivity Violations**:
   - Some cycles expected due to noise
   - >10% violation rate indicates judge inconsistency
   - Elo is robust to violations (Bradley-Terry is not)

### When to Use Elo

‚úÖ **Use Elo when:**
- Building live leaderboard with continuous updates (e.g., Chatbot Arena)
- New models added frequently
- Want simple, interpretable rankings
- Order of comparisons doesn't matter for final use

‚ùå **Don't use Elo when:**
- Need uncertainty estimates (use Bradley-Terry)
- Want batch analysis of fixed model set (use Bradley-Terry)
- Order-independence is critical for fairness

### Practical Tips

1. **K-factor tuning:**
   - K=32: Standard (balanced convergence speed)
   - K=64: New/volatile models (faster updates)
   - K=16: Mature/stable models (slower updates)

2. **Cold start problem:**
   - New models start at initial rating (e.g., 1500)
   - First ~20 matches have high volatility
   - Consider higher K-factor for new models

3. **Rating interpretation:**
   - 100 Elo difference ‚âà 64% win probability
   - 200 Elo difference ‚âà 76% win probability
   - 400 Elo difference ‚âà 91% win probability

### Next Steps

- üìä [Bradley-Terry Ranking Tutorial](bradley_terry_ranking_tutorial.ipynb) - Probabilistic alternative with uncertainty
- üî¨ [A/B Testing vs Comparative Eval](ab_testing_vs_comparative_eval.ipynb) - When to use comparative evaluation
- üìñ [Comparative Evaluation Guide](comparative_evaluation_guide.md) - Comprehensive methodology overview

### Resources

- [Elo Rating System (Wikipedia)](https://en.wikipedia.org/wiki/Elo_rating_system)
- [Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard) - Live Elo rankings
- [Backend Implementation](../backend/comparative_evaluation.py) - EloRanking class source code

---

**üéâ Congratulations!** You've mastered Elo ranking for LLM evaluation. You can now build dynamic leaderboards and understand rating evolution over time.
