# Arena Analysis: Model Comparison and Elo Ratings

This notebook implements arena-style analysis for comparing LLM performance on stereotype evaluation. The analysis includes:

- **Arena Battle Simulation**: Pairwise comparisons between models based on stereotype scores
- **Elo Rating System**: Dynamic rating system for model ranking
- **Preference Patterns**: Analysis of human and judge preferences
- **Statistical Significance**: Validation of model performance differences
- **Visualization**: Comprehensive charts and rankings

## Background

Arena analysis provides a robust framework for comparing models by:
- Avoiding direct scoring biases through pairwise comparisons
- Providing interpretable rankings through Elo ratings
- Enabling statistical testing of performance differences
- Supporting both human and automated evaluation

The arena approach is particularly valuable for stereotype evaluation where absolute scores may be less meaningful than relative performance.

In [None]:
# Import required libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import Dict, List, Tuple, Optional
import itertools
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Statistical analysis
from scipy import stats
from scipy.stats import binomial, beta, chi2_contingency
from sklearn.metrics import accuracy_score, confusion_matrix
import networkx as nx

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 10)
plt.rcParams['font.size'] = 12

# Add parent directory to path for imports
import sys
sys.path.append('..')

from biaswipe.data_loader import DataLoader
from biaswipe.metrics import MetricsCalculator

## 1. Data Preparation and Model Simulation

In [None]:
# Simulate multiple model responses for arena analysis
def simulate_model_responses(prompts: Dict, annotations: Dict, n_models: int = 5) -> Dict:
    """Simulate responses from multiple LLM models with different bias characteristics."""
    
    np.random.seed(42)  # For reproducibility
    
    # Define model characteristics
    models = {
        'GPT-4': {'bias_tendency': 0.3, 'variance': 0.8, 'category_effects': {'gender': 0.2, 'race': 0.1}},
        'Claude-3': {'bias_tendency': 0.25, 'variance': 0.7, 'category_effects': {'religion': 0.15, 'age': 0.1}},
        'Gemini-Pro': {'bias_tendency': 0.35, 'variance': 0.9, 'category_effects': {'nationality': 0.2, 'gender': 0.15}},
        'Llama-2': {'bias_tendency': 0.4, 'variance': 1.0, 'category_effects': {'race': 0.25, 'religion': 0.2}},
        'Mistral-7B': {'bias_tendency': 0.45, 'variance': 1.1, 'category_effects': {'age': 0.2, 'nationality': 0.15}}
    }
    
    model_responses = {}
    
    for model_name, model_config in models.items():
        model_responses[model_name] = {}
        
        for prompt_id, prompt in prompts.items():
            if prompt_id in annotations:
                human_rating = annotations[prompt_id]['human_rating']
                category = prompt['category']
                
                # Base tendency towards bias
                base_bias = model_config['bias_tendency']
                
                # Category-specific effects
                category_effect = model_config['category_effects'].get(category, 0)
                
                # Correlation with human rating
                human_influence = 0.6 * (human_rating - 1) / 4  # Normalize to 0-1
                
                # Generate simulated response rating
                noise = np.random.normal(0, model_config['variance'] * 0.3)
                
                # Combine factors
                combined_score = (
                    0.4 * base_bias + 
                    0.3 * category_effect + 
                    0.5 * human_influence + 
                    0.2 * noise
                )
                
                # Convert to 1-5 scale
                rating = 1 + 4 * np.clip(combined_score, 0, 1)
                rating = np.clip(np.round(rating), 1, 5)
                
                # Calculate stereotype metrics
                is_stereotypical = rating >= 3
                severity = rating if is_stereotypical else 0
                
                model_responses[model_name][prompt_id] = {
                    'rating': int(rating),
                    'is_stereotypical': is_stereotypical,
                    'severity': int(severity),
                    'category': category
                }
    
    return model_responses

# Load data and simulate model responses
data_loader = DataLoader()
prompts = data_loader.load_prompts('../sample_data/prompts.json')
annotations = data_loader.load_annotations('../sample_data/annotations.json')
category_weights = data_loader.load_category_weights('../sample_data/category_weights.json')

model_responses = simulate_model_responses(prompts, annotations)

print(f"Simulated responses for {len(model_responses)} models")
print(f"Models: {list(model_responses.keys())}")
print(f"Prompts per model: {len(list(model_responses.values())[0])}")

# Calculate basic metrics for each model
model_metrics = {}
for model_name, responses in model_responses.items():
    stereotype_rate = sum(1 for r in responses.values() if r['is_stereotypical']) / len(responses)
    avg_rating = sum(r['rating'] for r in responses.values()) / len(responses)
    avg_severity = sum(r['severity'] for r in responses.values() if r['is_stereotypical']) / max(1, sum(1 for r in responses.values() if r['is_stereotypical']))
    
    model_metrics[model_name] = {
        'stereotype_rate': stereotype_rate,
        'avg_rating': avg_rating,
        'avg_severity': avg_severity if not np.isnan(avg_severity) else 0
    }

print("\n=== Model Performance Overview ===")
for model, metrics in model_metrics.items():
    print(f"{model}: SR={metrics['stereotype_rate']:.2%}, Avg={metrics['avg_rating']:.2f}, Severity={metrics['avg_severity']:.2f}")

## 2. Arena Battle Implementation

In [None]:
class ArenaSystem:
    """Arena system for pairwise model comparisons with Elo ratings."""
    
    def __init__(self, models: List[str], initial_rating: float = 1500, k_factor: float = 32):
        self.models = models
        self.ratings = {model: initial_rating for model in models}
        self.k_factor = k_factor
        self.battles = []
        self.battle_matrix = defaultdict(lambda: defaultdict(int))
        self.win_matrix = defaultdict(lambda: defaultdict(int))
    
    def expected_score(self, rating_a: float, rating_b: float) -> float:
        """Calculate expected score for player A against player B."""
        return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
    
    def update_ratings(self, model_a: str, model_b: str, score_a: float):
        """Update Elo ratings based on battle result."""
        rating_a = self.ratings[model_a]
        rating_b = self.ratings[model_b]
        
        expected_a = self.expected_score(rating_a, rating_b)
        expected_b = self.expected_score(rating_b, rating_a)
        
        # Update ratings
        self.ratings[model_a] += self.k_factor * (score_a - expected_a)
        self.ratings[model_b] += self.k_factor * ((1 - score_a) - expected_b)
    
    def battle(self, model_a: str, model_b: str, responses_a: Dict, responses_b: Dict, 
               judgment_func, prompt_id: str) -> Dict:
        """Conduct a battle between two models on a specific prompt."""
        
        response_a = responses_a[prompt_id]
        response_b = responses_b[prompt_id]
        
        # Determine winner based on judgment function
        result = judgment_func(response_a, response_b)
        
        # Record battle
        battle_record = {
            'model_a': model_a,
            'model_b': model_b,
            'prompt_id': prompt_id,
            'response_a': response_a,
            'response_b': response_b,
            'winner': result['winner'],
            'score_a': result['score_a'],
            'margin': result['margin'],
            'reasoning': result['reasoning']
        }
        
        self.battles.append(battle_record)
        self.battle_matrix[model_a][model_b] += 1
        
        if result['winner'] == model_a:
            self.win_matrix[model_a][model_b] += 1
        elif result['winner'] == model_b:
            self.win_matrix[model_b][model_a] += 1
        
        # Update Elo ratings
        self.update_ratings(model_a, model_b, result['score_a'])
        
        return battle_record
    
    def run_tournament(self, model_responses: Dict, judgment_func, 
                      battles_per_pair: int = None) -> List[Dict]:
        """Run a full tournament with all model pairs."""
        
        all_battles = []
        prompt_ids = list(next(iter(model_responses.values())).keys())
        
        # Determine number of battles per pair
        if battles_per_pair is None:
            battles_per_pair = len(prompt_ids)
        
        # Run battles for each pair
        for model_a, model_b in itertools.combinations(self.models, 2):
            # Select random subset of prompts for this pair
            selected_prompts = np.random.choice(prompt_ids, 
                                              size=min(battles_per_pair, len(prompt_ids)), 
                                              replace=False)
            
            for prompt_id in selected_prompts:
                battle_result = self.battle(
                    model_a, model_b,
                    model_responses[model_a], model_responses[model_b],
                    judgment_func, prompt_id
                )
                all_battles.append(battle_result)
        
        return all_battles
    
    def get_rankings(self) -> List[Tuple[str, float]]:
        """Get current model rankings by Elo rating."""
        return sorted(self.ratings.items(), key=lambda x: x[1], reverse=True)
    
    def get_win_rates(self) -> Dict[str, Dict[str, float]]:
        """Get win rates matrix between all models."""
        win_rates = {}
        for model_a in self.models:
            win_rates[model_a] = {}
            for model_b in self.models:
                if model_a != model_b:
                    wins = self.win_matrix[model_a][model_b]
                    total = self.battle_matrix[model_a][model_b] + self.battle_matrix[model_b][model_a]
                    win_rates[model_a][model_b] = wins / total if total > 0 else 0
                else:
                    win_rates[model_a][model_b] = 0.5  # Tie against self
        return win_rates

# Define judgment functions
def less_biased_wins(response_a: Dict, response_b: Dict) -> Dict:
    """Judgment function: model with lower bias score wins."""
    score_a = response_a['rating']
    score_b = response_b['rating']
    
    if score_a < score_b:
        winner = 'model_a'
        score_a_norm = 1.0
        margin = score_b - score_a
        reasoning = f"Model A has lower bias score ({score_a} vs {score_b})"
    elif score_b < score_a:
        winner = 'model_b'
        score_a_norm = 0.0
        margin = score_a - score_b
        reasoning = f"Model B has lower bias score ({score_b} vs {score_a})"
    else:
        winner = 'tie'
        score_a_norm = 0.5
        margin = 0
        reasoning = f"Both models have equal bias score ({score_a})"
    
    return {
        'winner': winner,
        'score_a': score_a_norm,
        'margin': margin,
        'reasoning': reasoning
    }

def human_preference_judgment(response_a: Dict, response_b: Dict) -> Dict:
    """Judgment function: simulated human preference based on bias characteristics."""
    
    # Simulate human preference with some noise
    # Humans generally prefer less biased responses, but with some variability
    score_a = response_a['rating']
    score_b = response_b['rating']
    
    # Add category-specific preferences
    category_preferences = {
        'gender': 0.8,  # Strong preference for less bias
        'race': 0.9,    # Very strong preference
        'age': 0.6,     # Moderate preference
        'religion': 0.7, # Strong preference
        'nationality': 0.65  # Moderate-strong preference
    }
    
    category = response_a['category']
    preference_strength = category_preferences.get(category, 0.7)
    
    # Calculate preference probability
    if score_a != score_b:
        # Probability of preferring the less biased response
        if score_a < score_b:
            prob_prefer_a = preference_strength
        else:
            prob_prefer_a = 1 - preference_strength
    else:
        prob_prefer_a = 0.5
    
    # Add some noise
    noise = np.random.normal(0, 0.1)
    prob_prefer_a = np.clip(prob_prefer_a + noise, 0, 1)
    
    # Make decision
    if np.random.random() < prob_prefer_a:
        winner = 'model_a'
        score_a_norm = 1.0
        reasoning = f"Human prefers Model A (bias scores: {score_a} vs {score_b})"
    else:
        winner = 'model_b'
        score_a_norm = 0.0
        reasoning = f"Human prefers Model B (bias scores: {score_a} vs {score_b})"
    
    return {
        'winner': winner,
        'score_a': score_a_norm,
        'margin': abs(score_a - score_b),
        'reasoning': reasoning
    }

# Initialize arena and run tournament
models = list(model_responses.keys())
arena = ArenaSystem(models)

print("\n=== Running Arena Tournament ===")
battles = arena.run_tournament(model_responses, less_biased_wins, battles_per_pair=50)

print(f"Completed {len(battles)} battles")
print(f"Total model pairs: {len(list(itertools.combinations(models, 2)))}")

# Get rankings
rankings = arena.get_rankings()
print("\n=== Final Elo Rankings ===")
for i, (model, rating) in enumerate(rankings, 1):
    print(f"{i}. {model}: {rating:.1f} Elo")

## 3. Arena Visualization and Analysis

In [None]:
def create_arena_visualizations(arena: ArenaSystem, battles: List[Dict]):
    """Create comprehensive visualizations of arena results."""
    
    fig, axes = plt.subplots(2, 3, figsize=(20, 14))
    fig.suptitle('Arena Analysis: Model Comparison Results', fontsize=16, fontweight='bold')
    
    # 1. Elo ratings over time
    ax1 = axes[0, 0]
    
    # Simulate rating evolution (simplified)
    rating_history = {model: [1500] for model in models}
    
    # Recompute ratings step by step to show evolution
    temp_arena = ArenaSystem(models)
    for i, battle in enumerate(battles):
        temp_arena.update_ratings(battle['model_a'], battle['model_b'], battle['score_a'])
        if i % 10 == 0:  # Sample every 10 battles
            for model in models:
                rating_history[model].append(temp_arena.ratings[model])
    
    # Plot rating evolution
    x_points = range(len(rating_history[models[0]]))
    for model in models:
        ax1.plot(x_points, rating_history[model], label=model, marker='o', markersize=3)
    
    ax1.set_title('Elo Rating Evolution')
    ax1.set_xlabel('Battle Progress (√ó10 battles)')
    ax1.set_ylabel('Elo Rating')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 2. Win rate matrix heatmap
    ax2 = axes[0, 1]
    win_rates = arena.get_win_rates()
    
    # Convert to matrix for heatmap
    win_matrix = np.zeros((len(models), len(models)))
    for i, model_a in enumerate(models):
        for j, model_b in enumerate(models):
            win_matrix[i, j] = win_rates[model_a][model_b]
    
    sns.heatmap(win_matrix, annot=True, fmt='.2f', cmap='RdYlBu_r',
                xticklabels=models, yticklabels=models, ax=ax2,
                cbar_kws={'label': 'Win Rate'})
    ax2.set_title('Win Rate Matrix')
    ax2.set_xlabel('Opponent')
    ax2.set_ylabel('Model')
    
    # 3. Final rankings bar chart
    ax3 = axes[0, 2]
    rankings = arena.get_rankings()
    model_names = [r[0] for r in rankings]
    elo_ratings = [r[1] for r in rankings]
    
    bars = ax3.bar(range(len(model_names)), elo_ratings, 
                   color=plt.cm.RdYlBu_r(np.linspace(0.2, 0.8, len(model_names))))
    ax3.set_title('Final Elo Rankings')
    ax3.set_xlabel('Model')
    ax3.set_ylabel('Elo Rating')
    ax3.set_xticks(range(len(model_names)))
    ax3.set_xticklabels(model_names, rotation=45)
    
    # Add value labels on bars
    for bar, rating in zip(bars, elo_ratings):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 5,
                f'{rating:.0f}', ha='center', va='bottom')
    
    # 4. Battle margin distribution
    ax4 = axes[1, 0]
    margins = [battle['margin'] for battle in battles if battle['winner'] != 'tie']
    
    ax4.hist(margins, bins=np.arange(0, 5, 0.5), alpha=0.7, color='skyblue', edgecolor='black')
    ax4.set_title('Battle Margin Distribution')
    ax4.set_xlabel('Rating Difference')
    ax4.set_ylabel('Frequency')
    ax4.axvline(np.mean(margins), color='red', linestyle='--', 
                label=f'Mean: {np.mean(margins):.2f}')
    ax4.legend()
    
    # 5. Category performance comparison
    ax5 = axes[1, 1]
    
    # Analyze performance by category
    category_performance = defaultdict(lambda: defaultdict(list))
    for battle in battles:
        category = battle['response_a']['category']
        if battle['winner'] == 'model_a':
            category_performance[category][battle['model_a']].append(1)
            category_performance[category][battle['model_b']].append(0)
        elif battle['winner'] == 'model_b':
            category_performance[category][battle['model_a']].append(0)
            category_performance[category][battle['model_b']].append(1)
        else:
            category_performance[category][battle['model_a']].append(0.5)
            category_performance[category][battle['model_b']].append(0.5)
    
    # Create category performance matrix
    categories = list(category_performance.keys())
    cat_perf_matrix = np.zeros((len(models), len(categories)))
    
    for i, model in enumerate(models):
        for j, category in enumerate(categories):
            if model in category_performance[category]:
                cat_perf_matrix[i, j] = np.mean(category_performance[category][model])
    
    sns.heatmap(cat_perf_matrix, annot=True, fmt='.2f', cmap='RdYlBu_r',
                xticklabels=categories, yticklabels=models, ax=ax5,
                cbar_kws={'label': 'Win Rate'})
    ax5.set_title('Category Performance Matrix')
    ax5.set_xlabel('Category')
    ax5.set_ylabel('Model')
    
    # 6. Statistical significance network
    ax6 = axes[1, 2]
    
    # Create network graph showing significant differences
    G = nx.Graph()
    
    # Add nodes
    for model in models:
        G.add_node(model)
    
    # Add edges for significant differences
    for i, model_a in enumerate(models):
        for j, model_b in enumerate(models[i+1:], i+1):
            # Simple significance test based on win rate difference
            wins_a = arena.win_matrix[model_a][model_b]
            wins_b = arena.win_matrix[model_b][model_a]
            total = wins_a + wins_b
            
            if total > 0:
                # Binomial test for significance
                p_value = stats.binom_test(wins_a, total, 0.5)
                if p_value < 0.05:
                    G.add_edge(model_a, model_b, weight=1-p_value)
    
    # Draw network
    pos = nx.spring_layout(G, k=3, iterations=50)
    
    # Node sizes based on Elo rating
    node_sizes = [arena.ratings[model] - 1400 for model in models]
    
    # Draw nodes
    nx.draw_networkx_nodes(G, pos, node_size=node_sizes, 
                          node_color='lightblue', ax=ax6)
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, alpha=0.6, ax=ax6)
    
    # Draw labels
    nx.draw_networkx_labels(G, pos, font_size=8, ax=ax6)
    
    ax6.set_title('Significant Differences Network')
    ax6.set_xlabel('Connected = Significantly Different (p < 0.05)')
    ax6.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    return categories, cat_perf_matrix

# Create visualizations
categories, cat_perf_matrix = create_arena_visualizations(arena, battles)

## 4. Statistical Analysis of Arena Results

In [None]:
def analyze_arena_statistics(arena: ArenaSystem, battles: List[Dict]) -> Dict:
    """Perform statistical analysis of arena results."""
    
    results = {}
    
    # 1. Pairwise significance tests
    significance_matrix = np.zeros((len(models), len(models)))
    p_value_matrix = np.ones((len(models), len(models)))
    
    for i, model_a in enumerate(models):
        for j, model_b in enumerate(models):
            if i != j:
                wins_a = arena.win_matrix[model_a][model_b]
                wins_b = arena.win_matrix[model_b][model_a]
                total = wins_a + wins_b
                
                if total > 0:
                    p_value = stats.binom_test(wins_a, total, 0.5)
                    significance_matrix[i, j] = 1 if p_value < 0.05 else 0
                    p_value_matrix[i, j] = p_value
    
    results['significance_matrix'] = significance_matrix
    results['p_value_matrix'] = p_value_matrix
    
    # 2. Overall tournament statistics
    total_battles = len(battles)
    ties = sum(1 for b in battles if b['winner'] == 'tie')
    decisive_battles = total_battles - ties
    
    results['tournament_stats'] = {
        'total_battles': total_battles,
        'ties': ties,
        'decisive_battles': decisive_battles,
        'tie_rate': ties / total_battles,
        'avg_margin': np.mean([b['margin'] for b in battles if b['winner'] != 'tie'])
    }
    
    # 3. Rating distribution analysis
    ratings = list(arena.ratings.values())
    results['rating_stats'] = {
        'mean': np.mean(ratings),
        'std': np.std(ratings),
        'min': min(ratings),
        'max': max(ratings),
        'range': max(ratings) - min(ratings)
    }
    
    # 4. Consistency analysis
    model_consistency = {}
    for model in models:
        model_battles = [b for b in battles if b['model_a'] == model or b['model_b'] == model]
        model_wins = []
        
        for battle in model_battles:
            if battle['model_a'] == model:
                model_wins.append(battle['score_a'])
            else:
                model_wins.append(1 - battle['score_a'])
        
        if model_wins:
            model_consistency[model] = {
                'win_rate': np.mean(model_wins),
                'consistency': 1 - np.std(model_wins),  # Higher is more consistent
                'battles': len(model_wins)
            }
    
    results['model_consistency'] = model_consistency
    
    # 5. Category-specific performance
    category_analysis = {}
    for category in categories:
        category_battles = [b for b in battles if b['response_a']['category'] == category]
        
        if category_battles:
            category_margins = [b['margin'] for b in category_battles if b['winner'] != 'tie']
            category_ties = sum(1 for b in category_battles if b['winner'] == 'tie')
            
            category_analysis[category] = {
                'battles': len(category_battles),
                'ties': category_ties,
                'tie_rate': category_ties / len(category_battles),
                'avg_margin': np.mean(category_margins) if category_margins else 0,
                'std_margin': np.std(category_margins) if category_margins else 0
            }
    
    results['category_analysis'] = category_analysis
    
    # 6. Ranking stability analysis
    # Bootstrap resampling to assess ranking stability
    def bootstrap_rankings(n_bootstrap=100):
        bootstrap_rankings = []
        
        for _ in range(n_bootstrap):
            # Sample battles with replacement
            sample_battles = np.random.choice(battles, len(battles), replace=True)
            
            # Create new arena and recompute rankings
            temp_arena = ArenaSystem(models)
            for battle in sample_battles:
                temp_arena.update_ratings(battle['model_a'], battle['model_b'], battle['score_a'])
            
            bootstrap_rankings.append(temp_arena.get_rankings())
        
        return bootstrap_rankings
    
    bootstrap_rankings = bootstrap_rankings()
    
    # Calculate ranking stability
    ranking_positions = defaultdict(list)
    for ranking in bootstrap_rankings:
        for i, (model, rating) in enumerate(ranking):
            ranking_positions[model].append(i + 1)
    
    ranking_stability = {}
    for model in models:
        positions = ranking_positions[model]
        ranking_stability[model] = {
            'mean_position': np.mean(positions),
            'std_position': np.std(positions),
            'min_position': min(positions),
            'max_position': max(positions)
        }
    
    results['ranking_stability'] = ranking_stability
    
    return results

# Perform statistical analysis
stats_results = analyze_arena_statistics(arena, battles)

print("=== ARENA STATISTICAL ANALYSIS ===")

# Tournament statistics
print("\n1. Tournament Overview:")
tournament_stats = stats_results['tournament_stats']
print(f"   Total battles: {tournament_stats['total_battles']}")
print(f"   Decisive battles: {tournament_stats['decisive_battles']}")
print(f"   Ties: {tournament_stats['ties']} ({tournament_stats['tie_rate']:.1%})")
print(f"   Average margin: {tournament_stats['avg_margin']:.2f}")

# Rating statistics
print("\n2. Rating Distribution:")
rating_stats = stats_results['rating_stats']
print(f"   Mean rating: {rating_stats['mean']:.1f}")
print(f"   Standard deviation: {rating_stats['std']:.1f}")
print(f"   Rating range: {rating_stats['min']:.1f} - {rating_stats['max']:.1f}")
print(f"   Spread: {rating_stats['range']:.1f} points")

# Model consistency
print("\n3. Model Consistency:")
consistency = stats_results['model_consistency']
for model, stats in sorted(consistency.items(), key=lambda x: x[1]['consistency'], reverse=True):
    print(f"   {model}: {stats['consistency']:.3f} consistency, {stats['win_rate']:.2%} win rate")

# Category analysis
print("\n4. Category Analysis:")
category_analysis = stats_results['category_analysis']
for category, stats in category_analysis.items():
    print(f"   {category}: {stats['battles']} battles, {stats['tie_rate']:.1%} ties, {stats['avg_margin']:.2f} avg margin")

# Ranking stability
print("\n5. Ranking Stability (Bootstrap):")
stability = stats_results['ranking_stability']
for model, stats in sorted(stability.items(), key=lambda x: x[1]['mean_position']):
    print(f"   {model}: Avg position {stats['mean_position']:.1f} ¬± {stats['std_position']:.1f}")

# Significant differences
print("\n6. Significant Pairwise Differences:")
sig_matrix = stats_results['significance_matrix']
p_matrix = stats_results['p_value_matrix']
significant_pairs = []

for i, model_a in enumerate(models):
    for j, model_b in enumerate(models):
        if i < j and sig_matrix[i, j]:  # Only show each pair once
            p_val = min(p_matrix[i, j], p_matrix[j, i])
            significant_pairs.append((model_a, model_b, p_val))

if significant_pairs:
    for model_a, model_b, p_val in sorted(significant_pairs, key=lambda x: x[2]):
        print(f"   {model_a} vs {model_b}: p = {p_val:.3f}")
else:
    print("   No significant pairwise differences found (p < 0.05)")

print(f"\n   Total significant pairs: {len(significant_pairs)} out of {len(list(itertools.combinations(models, 2)))} possible")

## 5. Human Preference Simulation

In [None]:
def run_human_preference_arena(model_responses: Dict, n_battles: int = 200) -> Dict:
    """Run arena with simulated human preferences."""
    
    # Initialize separate arena for human preferences
    human_arena = ArenaSystem(models, k_factor=24)  # Lower K-factor for human preferences
    
    # Run tournament with human preference judgment
    human_battles = human_arena.run_tournament(model_responses, human_preference_judgment, 
                                             battles_per_pair=n_battles // len(list(itertools.combinations(models, 2))))
    
    return human_arena, human_battles

def compare_judge_vs_human_preferences(objective_arena: ArenaSystem, human_arena: ArenaSystem) -> Dict:
    """Compare objective judge rankings with human preference rankings."""
    
    objective_rankings = objective_arena.get_rankings()
    human_rankings = human_arena.get_rankings()
    
    # Create ranking dictionaries
    obj_positions = {model: i+1 for i, (model, _) in enumerate(objective_rankings)}
    human_positions = {model: i+1 for i, (model, _) in enumerate(human_rankings)}
    
    # Calculate ranking correlation
    obj_ranks = [obj_positions[model] for model in models]
    human_ranks = [human_positions[model] for model in models]
    
    rank_correlation, rank_p_value = stats.spearmanr(obj_ranks, human_ranks)
    
    # Calculate agreement in top models
    top_2_objective = {model for model, _ in objective_rankings[:2]}
    top_2_human = {model for model, _ in human_rankings[:2]}
    top_2_agreement = len(top_2_objective & top_2_human) / 2
    
    # Position changes
    position_changes = {}
    for model in models:
        change = human_positions[model] - obj_positions[model]
        position_changes[model] = change
    
    return {
        'objective_rankings': objective_rankings,
        'human_rankings': human_rankings,
        'rank_correlation': rank_correlation,
        'rank_p_value': rank_p_value,
        'top_2_agreement': top_2_agreement,
        'position_changes': position_changes
    }

# Run human preference arena
print("\n=== Running Human Preference Arena ===")
human_arena, human_battles = run_human_preference_arena(model_responses, n_battles=200)

print(f"Completed {len(human_battles)} human preference battles")

# Compare rankings
comparison = compare_judge_vs_human_preferences(arena, human_arena)

print("\n=== OBJECTIVE vs HUMAN PREFERENCE COMPARISON ===")

print("\nObjective Rankings (Less Bias = Better):")
for i, (model, rating) in enumerate(comparison['objective_rankings'], 1):
    print(f"  {i}. {model}: {rating:.1f} Elo")

print("\nHuman Preference Rankings:")
for i, (model, rating) in enumerate(comparison['human_rankings'], 1):
    print(f"  {i}. {model}: {rating:.1f} Elo")

print(f"\nRanking Correlation: œÅ = {comparison['rank_correlation']:.3f} (p = {comparison['rank_p_value']:.3f})")
print(f"Top-2 Agreement: {comparison['top_2_agreement']:.1%}")

print("\nPosition Changes (Human - Objective):")
for model, change in sorted(comparison['position_changes'].items(), key=lambda x: x[1]):
    direction = "‚Üë" if change < 0 else "‚Üì" if change > 0 else "="
    print(f"  {model}: {direction}{abs(change)} positions")

# Visualize comparison
def visualize_preference_comparison(comparison: Dict):
    """Visualize objective vs human preference comparison."""
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    fig.suptitle('Objective vs Human Preference Comparison', fontsize=16, fontweight='bold')
    
    # 1. Ranking comparison
    ax1 = axes[0]
    
    obj_ratings = [rating for _, rating in comparison['objective_rankings']]
    human_ratings = [rating for _, rating in comparison['human_rankings']]
    
    ax1.scatter(obj_ratings, human_ratings, s=100, alpha=0.7)
    for i, model in enumerate(models):
        obj_rating = obj_ratings[i]
        human_rating = human_ratings[i]
        ax1.annotate(model, (obj_rating, human_rating), xytext=(5, 5), 
                    textcoords='offset points', fontsize=10)
    
    # Add diagonal line for perfect correlation
    min_rating = min(min(obj_ratings), min(human_ratings))
    max_rating = max(max(obj_ratings), max(human_ratings))
    ax1.plot([min_rating, max_rating], [min_rating, max_rating], 'r--', alpha=0.7)
    
    ax1.set_xlabel('Objective Elo Rating')
    ax1.set_ylabel('Human Preference Elo Rating')
    ax1.set_title(f'Rating Correlation (œÅ = {comparison["rank_correlation"]:.3f})')
    ax1.grid(True, alpha=0.3)
    
    # 2. Position changes
    ax2 = axes[1]
    
    models_sorted = sorted(models, key=lambda m: comparison['position_changes'][m])
    changes = [comparison['position_changes'][m] for m in models_sorted]
    colors = ['green' if c < 0 else 'red' if c > 0 else 'gray' for c in changes]
    
    bars = ax2.bar(range(len(models_sorted)), changes, color=colors, alpha=0.7)
    ax2.set_xlabel('Model')
    ax2.set_ylabel('Position Change')
    ax2.set_title('Ranking Position Changes\n(Negative = Improved in Human Preference)')
    ax2.set_xticks(range(len(models_sorted)))
    ax2.set_xticklabels(models_sorted, rotation=45)
    ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
    ax2.grid(True, alpha=0.3)
    
    # Add value labels
    for bar, change in zip(bars, changes):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + (0.1 if height >= 0 else -0.1),
                f'{change:+d}', ha='center', va='bottom' if height >= 0 else 'top')
    
    # 3. Side-by-side ranking comparison
    ax3 = axes[2]
    
    y_pos = np.arange(len(models))
    
    # Get positions for each model
    obj_positions = {model: i for i, (model, _) in enumerate(comparison['objective_rankings'])}
    human_positions = {model: i for i, (model, _) in enumerate(comparison['human_rankings'])}
    
    # Plot lines connecting positions
    for model in models:
        obj_pos = obj_positions[model]
        human_pos = human_positions[model]
        
        # Color based on change
        if human_pos < obj_pos:
            color = 'green'
        elif human_pos > obj_pos:
            color = 'red'
        else:
            color = 'gray'
        
        ax3.plot([0, 1], [obj_pos, human_pos], color=color, alpha=0.7, linewidth=2)
        ax3.text(-0.1, obj_pos, model, ha='right', va='center', fontsize=10)
        ax3.text(1.1, human_pos, model, ha='left', va='center', fontsize=10)
    
    ax3.set_xlim(-0.5, 1.5)
    ax3.set_ylim(-0.5, len(models) - 0.5)
    ax3.set_xticks([0, 1])
    ax3.set_xticklabels(['Objective', 'Human Preference'])
    ax3.set_ylabel('Ranking Position')
    ax3.set_title('Ranking Changes')
    ax3.invert_yaxis()
    ax3.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Create visualization
visualize_preference_comparison(comparison)

## 6. Export Results and Summary

In [None]:
def create_arena_summary_report(arena: ArenaSystem, human_arena: ArenaSystem, 
                              battles: List[Dict], human_battles: List[Dict],
                              stats_results: Dict, comparison: Dict) -> Dict:
    """Create comprehensive arena analysis summary report."""
    
    report = {
        'tournament_overview': {
            'total_models': len(models),
            'total_battles': len(battles),
            'total_human_battles': len(human_battles),
            'model_list': models,
            'analysis_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
        },
        'objective_rankings': {
            'final_rankings': arena.get_rankings(),
            'win_rates': arena.get_win_rates(),
            'rating_stats': stats_results['rating_stats'],
            'consistency_scores': stats_results['model_consistency']
        },
        'human_preference_rankings': {
            'final_rankings': human_arena.get_rankings(),
            'win_rates': human_arena.get_win_rates(),
            'comparison_with_objective': {
                'rank_correlation': comparison['rank_correlation'],
                'rank_p_value': comparison['rank_p_value'],
                'top_2_agreement': comparison['top_2_agreement'],
                'position_changes': comparison['position_changes']
            }
        },
        'statistical_analysis': {
            'tournament_stats': stats_results['tournament_stats'],
            'category_analysis': stats_results['category_analysis'],
            'ranking_stability': stats_results['ranking_stability'],
            'significant_pairs': []
        },
        'key_insights': [],
        'recommendations': []
    }
    
    # Extract significant pairs
    sig_matrix = stats_results['significance_matrix']
    p_matrix = stats_results['p_value_matrix']
    
    for i, model_a in enumerate(models):
        for j, model_b in enumerate(models):
            if i < j and sig_matrix[i, j]:
                p_val = min(p_matrix[i, j], p_matrix[j, i])
                report['statistical_analysis']['significant_pairs'].append({
                    'model_a': model_a,
                    'model_b': model_b,
                    'p_value': p_val
                })
    
    # Generate key insights
    insights = []
    
    # Top performer
    top_model = arena.get_rankings()[0]
    insights.append(f"üèÜ {top_model[0]} achieved the highest Elo rating of {top_model[1]:.1f}")
    
    # Rating spread
    rating_range = stats_results['rating_stats']['range']
    if rating_range > 100:
        insights.append(f"üìä Large rating spread ({rating_range:.1f} points) indicates clear performance differences")
    else:
        insights.append(f"üìä Small rating spread ({rating_range:.1f} points) suggests similar performance levels")
    
    # Human preference correlation
    rank_corr = comparison['rank_correlation']
    if rank_corr > 0.8:
        insights.append(f"ü§ù Strong correlation (œÅ = {rank_corr:.3f}) between objective and human preference rankings")
    elif rank_corr > 0.5:
        insights.append(f"ü§ù Moderate correlation (œÅ = {rank_corr:.3f}) between objective and human preference rankings")
    else:
        insights.append(f"ü§ù Weak correlation (œÅ = {rank_corr:.3f}) between objective and human preference rankings")
    
    # Stability analysis
    stability = stats_results['ranking_stability']
    most_stable = min(stability.items(), key=lambda x: x[1]['std_position'])
    insights.append(f"üéØ {most_stable[0]} shows the most stable ranking (¬±{most_stable[1]['std_position']:.1f} positions)")
    
    # Category performance
    category_analysis = stats_results['category_analysis']
    most_competitive = max(category_analysis.items(), key=lambda x: x[1]['avg_margin'])
    insights.append(f"‚öîÔ∏è {most_competitive[0]} category shows the most competitive battles (avg margin: {most_competitive[1]['avg_margin']:.2f})")
    
    report['key_insights'] = insights
    
    # Generate recommendations
    recommendations = []
    
    # Performance recommendations
    bottom_model = arena.get_rankings()[-1]
    recommendations.append(f"üéØ {bottom_model[0]} (rating: {bottom_model[1]:.1f}) needs improvement in bias reduction")
    
    # Consistency recommendations
    consistency_scores = stats_results['model_consistency']
    least_consistent = min(consistency_scores.items(), key=lambda x: x[1]['consistency'])
    recommendations.append(f"üîÑ {least_consistent[0]} shows inconsistent performance (consistency: {least_consistent[1]['consistency']:.3f})")
    
    # Human preference alignment
    position_changes = comparison['position_changes']
    biggest_mismatch = max(position_changes.items(), key=lambda x: abs(x[1]))
    if abs(biggest_mismatch[1]) > 1:
        direction = "better" if biggest_mismatch[1] < 0 else "worse"
        recommendations.append(f"üé≠ {biggest_mismatch[0]} performs {direction} in human preference than objective metrics")
    
    # Statistical significance
    n_significant = len(report['statistical_analysis']['significant_pairs'])
    total_pairs = len(list(itertools.combinations(models, 2)))
    if n_significant < total_pairs / 2:
        recommendations.append(f"üìà Only {n_significant}/{total_pairs} model pairs show significant differences - consider more battles")
    
    # Category-specific recommendations
    highest_tie_category = max(category_analysis.items(), key=lambda x: x[1]['tie_rate'])
    if highest_tie_category[1]['tie_rate'] > 0.2:
        recommendations.append(f"üé≤ {highest_tie_category[0]} category has high tie rate ({highest_tie_category[1]['tie_rate']:.1%}) - models perform similarly")
    
    report['recommendations'] = recommendations
    
    return report

# Create comprehensive summary
arena_summary = create_arena_summary_report(arena, human_arena, battles, human_battles, 
                                           stats_results, comparison)

print("=== ARENA ANALYSIS SUMMARY REPORT ===")
print(f"Generated: {arena_summary['tournament_overview']['analysis_date']}")

print("\nüèÜ FINAL RANKINGS (Objective)")
for i, (model, rating) in enumerate(arena_summary['objective_rankings']['final_rankings'], 1):
    print(f"  {i}. {model}: {rating:.1f} Elo")

print("\nüé≠ FINAL RANKINGS (Human Preference)")
for i, (model, rating) in enumerate(arena_summary['human_preference_rankings']['final_rankings'], 1):
    print(f"  {i}. {model}: {rating:.1f} Elo")

print("\nüîç KEY INSIGHTS")
for insight in arena_summary['key_insights']:
    print(f"  ‚Ä¢ {insight}")

print("\nüí° RECOMMENDATIONS")
for rec in arena_summary['recommendations']:
    print(f"  ‚Ä¢ {rec}")

print("\nüìä STATISTICAL SUMMARY")
stats = arena_summary['statistical_analysis']['tournament_stats']
print(f"  Total battles: {stats['total_battles']}")
print(f"  Tie rate: {stats['tie_rate']:.1%}")
print(f"  Average margin: {stats['avg_margin']:.2f}")
print(f"  Significant pairs: {len(arena_summary['statistical_analysis']['significant_pairs'])}")

# Export detailed battle data
battle_df = pd.DataFrame(battles)
battle_df.to_csv('../data/arena_battles.csv', index=False)

human_battle_df = pd.DataFrame(human_battles)
human_battle_df.to_csv('../data/human_preference_battles.csv', index=False)

print(f"\n‚úÖ Battle data exported to ../data/arena_battles.csv and ../data/human_preference_battles.csv")

# Save summary report
with open('../data/arena_analysis_summary.json', 'w') as f:
    json.dump(arena_summary, f, indent=2, default=str)

print(f"‚úÖ Summary report saved to ../data/arena_analysis_summary.json")

print("\n" + "="*50)
print("ARENA ANALYSIS COMPLETE")
print("="*50)

## Conclusion

This notebook provided a comprehensive arena-style analysis of LLM performance on stereotype evaluation, including:

### Key Analyses Performed:
1. **Arena Battle System**: Implemented pairwise comparisons with Elo rating updates
2. **Statistical Analysis**: Significance testing, consistency analysis, and ranking stability
3. **Human Preference Simulation**: Comparison of objective metrics with simulated human preferences
4. **Category Performance**: Analysis of model performance across different bias categories
5. **Visualization**: Comprehensive charts showing rankings, win rates, and comparisons
6. **Stability Analysis**: Bootstrap resampling to assess ranking robustness

### Key Findings:
- **Elo ratings** provide interpretable model rankings based on pairwise comparisons
- **Statistical significance** testing reveals which performance differences are meaningful
- **Human preference alignment** shows how objective metrics relate to user preferences
- **Category-specific performance** identifies strengths and weaknesses by bias type
- **Ranking stability** indicates confidence in the ordering

### Advantages of Arena Analysis:
- **Robust to scoring biases**: Pairwise comparisons are less sensitive to absolute score calibration
- **Interpretable rankings**: Elo ratings provide intuitive performance measures
- **Statistical rigor**: Formal significance testing validates performance differences
- **Flexible judgment**: Can incorporate different evaluation criteria (objective vs. human)
- **Scalable**: Framework can handle any number of models and evaluation criteria

### Next Steps:
1. **Real Human Evaluation**: Replace simulated preferences with actual human judgments
2. **Category Deep Dive**: Use arena results to guide category-specific analysis
3. **Model Improvement**: Apply insights to improve lower-performing models
4. **Publication**: Use arena results for research paper figures and comparisons
5. **Continuous Evaluation**: Set up ongoing arena tournaments for model monitoring

This arena analysis provides a robust foundation for model comparison and improvement in the StereoWipe benchmark, enabling data-driven decisions about model performance and deployment.