# Autorater Calibration Tutorial

**Lesson 14 - Agent Evaluation Methodology (Task 4.11)**

## Learning Objectives

By the end of this tutorial, you will:

1. Understand how to build LLM-as-Judge autoraters for agent evaluation
2. Design effective autorater prompts with multi-dimensional scoring
3. Calibrate autoraters against human annotations
4. Measure agreement metrics (correlation, Cohen's kappa, MAE)
5. Identify and analyze systematic biases in autorater evaluations
6. Visualize autorater vs. human score comparisons

## Tutorial Overview

This notebook demonstrates:
- **Part 1**: Introduction & mode selection (DEMO/FULL)
- **Part 2**: Load agent responses and human annotations
- **Part 3**: Define evaluation criteria (accuracy, completeness, tool usage, reasoning quality)
- **Part 4**: Build autorater prompts with scoring rubrics
- **Part 5**: Run autorater evaluations
- **Part 6**: Calibrate with human feedback
- **Part 7**: Visualize correlation and bias
- **Part 8**: Save results and validate

## Execution Modes

- **DEMO mode**: 10 responses √ó 4 criteria = 40 evaluations (~$0.60-$0.80, <3 min)
- **FULL mode**: 50 responses √ó 4 criteria = 200 evaluations (~$2.50-$3.00, <10 min)

## Prerequisites

- Completed `agent_evaluation_fundamentals.md`
- Completed `autorater_final_response_eval.md`
- Anthropic API key configured
- Basic understanding of LLM-as-Judge methodology

---
## Part 1: Introduction & Setup

In [None]:
# Import required libraries
import json
import os
from pathlib import Path
from typing import Any, Dict, List, Tuple
import anthropic
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.metrics import cohen_kappa_score, mean_absolute_error
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Libraries imported successfully")

In [None]:
# Mode selection
MODE = "DEMO"  # Change to "FULL" for complete evaluation

# Configuration
CONFIG = {
    "DEMO": {
        "num_responses": 10,
        "estimated_cost": "$0.60-$0.80",
        "estimated_time": "2-3 minutes"
    },
    "FULL": {
        "num_responses": 50,
        "estimated_cost": "$2.50-$3.00",
        "estimated_time": "8-10 minutes"
    }
}

print(f"üéØ Mode: {MODE}")
print(f"üìä Responses to evaluate: {CONFIG[MODE]['num_responses']}")
print(f"üí∞ Estimated cost: {CONFIG[MODE]['estimated_cost']}")
print(f"‚è±Ô∏è  Estimated time: {CONFIG[MODE]['estimated_time']}")
print("\n" + "="*80)

In [None]:
# Validate Anthropic API key
api_key = os.getenv("ANTHROPIC_API_KEY")
if not api_key:
    raise ValueError(
        "ANTHROPIC_API_KEY not found. Please set it in your environment:\n"
        "export ANTHROPIC_API_KEY='your-api-key-here'"
    )

client = anthropic.Anthropic(api_key=api_key)
print("‚úÖ Anthropic API client initialized")

# Test API connection
try:
    test_response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=10,
        messages=[{"role": "user", "content": "Hello"}]
    )
    print("‚úÖ API connection successful")
except Exception as e:
    raise ConnectionError(f"Failed to connect to Anthropic API: {e}")

---
## Part 2: Load Agent Response Data

In [None]:
# Load agent responses
responses_path = Path("data/agent_responses_sample.json")
annotations_path = Path("data/human_annotations.json")

if not responses_path.exists():
    raise FileNotFoundError(f"Agent responses file not found: {responses_path}")
if not annotations_path.exists():
    raise FileNotFoundError(f"Human annotations file not found: {annotations_path}")

with open(responses_path, 'r') as f:
    responses_data = json.load(f)

with open(annotations_path, 'r') as f:
    annotations_data = json.load(f)

# Extract responses and filter by mode
all_responses = responses_data['responses']
all_annotations = annotations_data['annotations']

num_to_evaluate = CONFIG[MODE]['num_responses']
responses = all_responses[:num_to_evaluate]
annotations = all_annotations[:num_to_evaluate]

print(f"üì• Loaded {len(responses)} agent responses")
print(f"üì• Loaded {len(annotations)} human annotations")
print(f"\nSample response:")
print(f"  Query: {responses[0]['query']}")
print(f"  Type: {responses[0]['query_type']}")
print(f"  Tools used: {', '.join(responses[0]['tools_used'])}")
print(f"  Response length: {len(responses[0]['agent_output'])} chars")

In [None]:
# Inspect data distribution
query_types = [r['query_type'] for r in responses]
from collections import Counter

type_counts = Counter(query_types)

print("üìä Query type distribution:")
for qtype, count in sorted(type_counts.items(), key=lambda x: -x[1]):
    print(f"  {qtype}: {count} ({count/len(responses)*100:.1f}%)")

# Check annotation completeness
assert len(responses) == len(annotations), "Mismatch between responses and annotations"
for i, (resp, annot) in enumerate(zip(responses, annotations)):
    assert resp['response_id'] == annot['response_id'], f"ID mismatch at index {i}"

print("\n‚úÖ Data validation passed")

---
## Part 3: Define Evaluation Criteria

In [None]:
# Define 4 evaluation criteria with detailed rubrics
CRITERIA = {
    "accuracy": {
        "name": "Accuracy",
        "description": "Factual correctness of the information provided",
        "rubric": {
            1: "Multiple factual errors or completely incorrect information",
            2: "Some factual errors or outdated information",
            3: "Mostly accurate with minor errors or ambiguities",
            4: "Accurate information with proper context",
            5: "Completely accurate with excellent supporting details"
        }
    },
    "completeness": {
        "name": "Completeness",
        "description": "How well the response addresses all aspects of the query",
        "rubric": {
            1: "Fails to address the main query or missing critical information",
            2: "Addresses query partially, missing important aspects",
            3: "Addresses main query but could include more relevant details",
            4: "Thoroughly addresses query with good coverage",
            5: "Comprehensively addresses all aspects with excellent depth"
        }
    },
    "tool_usage": {
        "name": "Tool Usage",
        "description": "Appropriate selection and use of tools/functions",
        "rubric": {
            1: "Wrong tools used or critical tools missing",
            2: "Suboptimal tool selection or execution",
            3: "Appropriate tools used with minor inefficiencies",
            4: "Well-chosen tools used effectively",
            5: "Optimal tool selection and execution"
        }
    },
    "reasoning_quality": {
        "name": "Reasoning Quality",
        "description": "Logical coherence and clarity of the reasoning process",
        "rubric": {
            1: "Illogical or incoherent reasoning",
            2: "Weak reasoning with logical gaps",
            3: "Generally sound reasoning with minor issues",
            4: "Clear and logical reasoning throughout",
            5: "Exceptional reasoning with excellent step-by-step clarity"
        }
    }
}

print("üìã Evaluation Criteria:")
print("="*80)
for criterion_key, criterion in CRITERIA.items():
    print(f"\n{criterion['name'].upper()}")
    print(f"Description: {criterion['description']}")
    print("\nScoring Rubric (1-5):")
    for score, description in criterion['rubric'].items():
        print(f"  {score} - {description}")

---
## Part 4: Build Autorater Prompts

In [None]:
# Autorater prompt template
def build_autorater_prompt(query: str, response: str, tools: List[str], 
                          reasoning: List[str], criterion_key: str) -> str:
    """Build autorater prompt for a specific criterion.
    
    Args:
        query: User query
        response: Agent response
        tools: Tools used by agent
        reasoning: Reasoning trace
        criterion_key: Which criterion to evaluate (accuracy, completeness, etc.)
    
    Returns:
        Formatted prompt for LLM-as-Judge
    """
    criterion = CRITERIA[criterion_key]
    
    prompt = f"""You are an expert evaluator assessing an AI agent's response to a user query.

**USER QUERY:**
{query}

**AGENT RESPONSE:**
{response}

**TOOLS USED:**
{', '.join(tools) if tools else 'None'}

**REASONING TRACE:**
{chr(10).join(f'{i+1}. {step}' for i, step in enumerate(reasoning))}

---

**EVALUATION CRITERION: {criterion['name'].upper()}**

Definition: {criterion['description']}

**SCORING RUBRIC (1-5):**
{chr(10).join(f'{score}. {description}' for score, description in criterion['rubric'].items())}

---

**INSTRUCTIONS:**

1. Carefully analyze the agent's response against the scoring rubric above
2. Provide a score from 1-5 based on the rubric
3. Explain your reasoning with specific evidence from the response
4. Be objective and consistent in your evaluation

**OUTPUT FORMAT (JSON):**

{{
  "score": <integer 1-5>,
  "reasoning": "<2-3 sentence explanation>",
  "evidence": "<specific quote or example from the response>"
}}

Provide ONLY the JSON output, no additional text."""
    
    return prompt

# Test the prompt template
test_prompt = build_autorater_prompt(
    query=responses[0]['query'],
    response=responses[0]['agent_output'],
    tools=responses[0]['tools_used'],
    reasoning=responses[0]['reasoning_trace'],
    criterion_key='accuracy'
)

print("üìù Sample Autorater Prompt:")
print("="*80)
print(test_prompt[:500] + "...\n[truncated]")
print("\n‚úÖ Prompt template created successfully")

---
## Part 5: Run Autorater Evaluation

In [None]:
def evaluate_with_autorater(query: str, response: str, tools: List[str], 
                           reasoning: List[str], criterion_key: str) -> Dict[str, Any]:
    """Evaluate a single response on a single criterion using LLM-as-Judge.
    
    Args:
        query: User query
        response: Agent response
        tools: Tools used
        reasoning: Reasoning trace
        criterion_key: Criterion to evaluate
    
    Returns:
        Evaluation result with score, reasoning, evidence
    """
    prompt = build_autorater_prompt(query, response, tools, reasoning, criterion_key)
    
    try:
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            temperature=0.0,  # Deterministic for consistency
            messages=[{"role": "user", "content": prompt}]
        )
        
        content = message.content[0].text
        
        # Parse JSON response
        result = json.loads(content)
        
        # Validate score is in range
        if not isinstance(result['score'], int) or not 1 <= result['score'] <= 5:
            raise ValueError(f"Invalid score: {result['score']}")
        
        return {
            "score": result['score'],
            "reasoning": result.get('reasoning', ''),
            "evidence": result.get('evidence', ''),
            "input_tokens": message.usage.input_tokens,
            "output_tokens": message.usage.output_tokens
        }
        
    except json.JSONDecodeError as e:
        print(f"‚ö†Ô∏è  JSON parsing error: {e}")
        print(f"Raw response: {content[:200]}")
        return {"score": 3, "reasoning": "Error in parsing", "evidence": "", 
                "input_tokens": 0, "output_tokens": 0}
    except Exception as e:
        print(f"‚ö†Ô∏è  Evaluation error: {e}")
        return {"score": 3, "reasoning": "Error in evaluation", "evidence": "",
                "input_tokens": 0, "output_tokens": 0}

print("‚úÖ Autorater evaluation function created")

In [None]:
# Run autorater evaluation on all responses
print(f"üöÄ Starting autorater evaluation ({MODE} mode)")
print(f"Evaluating {len(responses)} responses √ó {len(CRITERIA)} criteria = {len(responses) * len(CRITERIA)} total evaluations\n")

evaluation_results = []
total_cost = 0.0
total_input_tokens = 0
total_output_tokens = 0

# Cost per 1M tokens (Claude 3.5 Sonnet)
INPUT_COST_PER_1M = 3.00
OUTPUT_COST_PER_1M = 15.00

for i, (resp, annot) in enumerate(tqdm(zip(responses, annotations), 
                                       total=len(responses),
                                       desc="Evaluating responses")):
    
    response_eval = {
        "response_id": resp['response_id'],
        "query": resp['query'],
        "query_type": resp['query_type'],
        "autorater_scores": {},
        "human_scores": {
            "accuracy": annot['accuracy'],
            "completeness": annot['completeness'],
            "tool_usage": annot['tool_usage'],
            "reasoning_quality": annot['reasoning_quality']
        },
        "human_annotator": annot['annotator_id'],
        "human_confidence": annot['confidence']
    }
    
    # Evaluate on each criterion
    for criterion_key in CRITERIA.keys():
        eval_result = evaluate_with_autorater(
            query=resp['query'],
            response=resp['agent_output'],
            tools=resp['tools_used'],
            reasoning=resp['reasoning_trace'],
            criterion_key=criterion_key
        )
        
        response_eval['autorater_scores'][criterion_key] = {
            "score": eval_result['score'],
            "reasoning": eval_result['reasoning'],
            "evidence": eval_result['evidence']
        }
        
        # Track token usage and cost
        total_input_tokens += eval_result['input_tokens']
        total_output_tokens += eval_result['output_tokens']
    
    evaluation_results.append(response_eval)

# Calculate total cost
total_cost = (
    (total_input_tokens / 1_000_000) * INPUT_COST_PER_1M +
    (total_output_tokens / 1_000_000) * OUTPUT_COST_PER_1M
)

print(f"\n‚úÖ Evaluation complete!")
print(f"üìä Total evaluations: {len(evaluation_results) * len(CRITERIA)}")
print(f"üî¢ Input tokens: {total_input_tokens:,}")
print(f"üî¢ Output tokens: {total_output_tokens:,}")
print(f"üí∞ Total cost: ${total_cost:.2f}")

In [None]:
# Inspect sample evaluation
sample_eval = evaluation_results[0]

print("üìã Sample Evaluation Result:")
print("="*80)
print(f"Query: {sample_eval['query']}")
print(f"Query Type: {sample_eval['query_type']}")
print(f"\nScores Comparison:")
print(f"{'Criterion':<20} {'Autorater':<12} {'Human':<12} {'Difference'}")
print("-" * 60)

for criterion_key in CRITERIA.keys():
    auto_score = sample_eval['autorater_scores'][criterion_key]['score']
    human_score = sample_eval['human_scores'][criterion_key]
    diff = auto_score - human_score
    
    print(f"{CRITERIA[criterion_key]['name']:<20} {auto_score:<12} {human_score:<12} {diff:+.1f}")

print(f"\nAutorater Reasoning (Accuracy):")
print(sample_eval['autorater_scores']['accuracy']['reasoning'])

---
## Part 6: Calibration with Human Feedback

In [None]:
# Calculate agreement metrics
def calculate_agreement_metrics(evaluation_results: List[Dict]) -> Dict[str, Any]:
    """Calculate agreement metrics between autorater and human scores.
    
    Args:
        evaluation_results: List of evaluation results
    
    Returns:
        Dictionary of agreement metrics
    """
    metrics = {
        "overall": {},
        "by_criterion": {}
    }
    
    # Collect all scores
    all_autorater_scores = []
    all_human_scores = []
    
    criterion_scores = {crit: {"autorater": [], "human": []} for crit in CRITERIA.keys()}
    
    for eval_result in evaluation_results:
        for criterion_key in CRITERIA.keys():
            auto_score = eval_result['autorater_scores'][criterion_key]['score']
            human_score = eval_result['human_scores'][criterion_key]
            
            all_autorater_scores.append(auto_score)
            all_human_scores.append(human_score)
            
            criterion_scores[criterion_key]['autorater'].append(auto_score)
            criterion_scores[criterion_key]['human'].append(human_score)
    
    # Overall metrics
    metrics['overall']['pearson_correlation'] = stats.pearsonr(all_autorater_scores, all_human_scores)[0]
    metrics['overall']['spearman_correlation'] = stats.spearmanr(all_autorater_scores, all_human_scores)[0]
    metrics['overall']['mean_absolute_error'] = mean_absolute_error(all_human_scores, all_autorater_scores)
    
    # Convert to categorical for Cohen's kappa
    metrics['overall']['cohens_kappa'] = cohen_kappa_score(all_human_scores, all_autorater_scores)
    
    # Mean difference (bias)
    differences = np.array(all_autorater_scores) - np.array(all_human_scores)
    metrics['overall']['mean_difference'] = float(np.mean(differences))
    metrics['overall']['std_difference'] = float(np.std(differences))
    
    # Per-criterion metrics
    for criterion_key, scores in criterion_scores.items():
        auto = scores['autorater']
        human = scores['human']
        
        metrics['by_criterion'][criterion_key] = {
            "pearson_correlation": stats.pearsonr(auto, human)[0],
            "spearman_correlation": stats.spearmanr(auto, human)[0],
            "mean_absolute_error": mean_absolute_error(human, auto),
            "mean_difference": float(np.mean(np.array(auto) - np.array(human))),
            "autorater_mean": float(np.mean(auto)),
            "human_mean": float(np.mean(human))
        }
    
    return metrics

agreement_metrics = calculate_agreement_metrics(evaluation_results)

print("üìä AGREEMENT METRICS")
print("="*80)
print("\nOVERALL METRICS:")
print(f"  Pearson Correlation:  {agreement_metrics['overall']['pearson_correlation']:.3f}")
print(f"  Spearman Correlation: {agreement_metrics['overall']['spearman_correlation']:.3f}")
print(f"  Cohen's Kappa:        {agreement_metrics['overall']['cohens_kappa']:.3f}")
print(f"  Mean Absolute Error:  {agreement_metrics['overall']['mean_absolute_error']:.3f}")
print(f"  Mean Difference:      {agreement_metrics['overall']['mean_difference']:+.3f}")
print(f"  Std Difference:       {agreement_metrics['overall']['std_difference']:.3f}")

print("\nPER-CRITERION CORRELATIONS:")
print(f"{'Criterion':<20} {'Pearson':<10} {'Spearman':<10} {'MAE':<10} {'Bias'}")
print("-" * 70)
for criterion_key, metrics in agreement_metrics['by_criterion'].items():
    print(f"{CRITERIA[criterion_key]['name']:<20} "
          f"{metrics['pearson_correlation']:<10.3f} "
          f"{metrics['spearman_correlation']:<10.3f} "
          f"{metrics['mean_absolute_error']:<10.3f} "
          f"{metrics['mean_difference']:+.3f}")

In [None]:
# Analyze bias patterns
print("üîç BIAS ANALYSIS")
print("="*80)

mean_diff = agreement_metrics['overall']['mean_difference']

if abs(mean_diff) < 0.1:
    bias_interpretation = "No systematic bias detected"
elif mean_diff > 0:
    bias_interpretation = f"Autorater tends to score {mean_diff:.2f} points HIGHER than humans"
else:
    bias_interpretation = f"Autorater tends to score {abs(mean_diff):.2f} points LOWER than humans"

print(f"\nOverall Bias: {bias_interpretation}")

print("\nPer-Criterion Bias:")
for criterion_key, metrics in agreement_metrics['by_criterion'].items():
    diff = metrics['mean_difference']
    if abs(diff) < 0.1:
        bias = "Neutral"
    elif diff > 0:
        bias = f"Over-scores by {diff:.2f}"
    else:
        bias = f"Under-scores by {abs(diff):.2f}"
    
    print(f"  {CRITERIA[criterion_key]['name']:<20}: {bias}")

# Identify disagreement cases
large_disagreements = []
for eval_result in evaluation_results:
    for criterion_key in CRITERIA.keys():
        auto_score = eval_result['autorater_scores'][criterion_key]['score']
        human_score = eval_result['human_scores'][criterion_key]
        diff = abs(auto_score - human_score)
        
        if diff >= 2:  # Disagreement of 2+ points
            large_disagreements.append({
                "response_id": eval_result['response_id'],
                "query": eval_result['query'],
                "criterion": CRITERIA[criterion_key]['name'],
                "autorater": auto_score,
                "human": human_score,
                "difference": auto_score - human_score
            })

print(f"\nLarge Disagreements (‚â•2 points): {len(large_disagreements)}")
if large_disagreements:
    print("\nTop 3 Disagreements:")
    for i, disagreement in enumerate(sorted(large_disagreements, 
                                            key=lambda x: abs(x['difference']), 
                                            reverse=True)[:3], 1):
        print(f"\n{i}. {disagreement['criterion']} - {disagreement['query'][:60]}...")
        print(f"   Autorater: {disagreement['autorater']}, Human: {disagreement['human']} "
              f"(Diff: {disagreement['difference']:+d})")

---
## Part 7: Visualizations

In [None]:
# Correlation scatter plots
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

for idx, (criterion_key, criterion) in enumerate(CRITERIA.items()):
    ax = axes[idx]
    
    # Extract scores
    auto_scores = [e['autorater_scores'][criterion_key]['score'] for e in evaluation_results]
    human_scores = [e['human_scores'][criterion_key] for e in evaluation_results]
    
    # Scatter plot with jitter to show overlapping points
    jitter = 0.1
    auto_jittered = np.array(auto_scores) + np.random.uniform(-jitter, jitter, len(auto_scores))
    human_jittered = np.array(human_scores) + np.random.uniform(-jitter, jitter, len(human_scores))
    
    ax.scatter(human_jittered, auto_jittered, alpha=0.6, s=80, edgecolors='black', linewidth=0.5)
    
    # Perfect agreement line
    ax.plot([1, 5], [1, 5], 'r--', linewidth=2, label='Perfect Agreement', alpha=0.7)
    
    # Regression line
    z = np.polyfit(human_scores, auto_scores, 1)
    p = np.poly1d(z)
    ax.plot([1, 5], [p(1), p(5)], 'b-', linewidth=2, label='Regression', alpha=0.7)
    
    # Formatting
    ax.set_xlabel('Human Score', fontsize=11, fontweight='bold')
    ax.set_ylabel('Autorater Score', fontsize=11, fontweight='bold')
    ax.set_title(f"{criterion['name']}\n(r={agreement_metrics['by_criterion'][criterion_key]['pearson_correlation']:.3f})",
                fontsize=12, fontweight='bold')
    ax.set_xlim(0.5, 5.5)
    ax.set_ylim(0.5, 5.5)
    ax.set_xticks(range(1, 6))
    ax.set_yticks(range(1, 6))
    ax.grid(True, alpha=0.3)
    ax.legend(loc='upper left', fontsize=9)
    ax.set_aspect('equal')

plt.suptitle('Autorater vs Human Score Correlation by Criterion', 
            fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("üìä Correlation plots generated")

In [None]:
# Confusion matrices for each criterion
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

for idx, (criterion_key, criterion) in enumerate(CRITERIA.items()):
    ax = axes[idx]
    
    # Extract scores
    auto_scores = [e['autorater_scores'][criterion_key]['score'] for e in evaluation_results]
    human_scores = [e['human_scores'][criterion_key] for e in evaluation_results]
    
    # Create confusion matrix
    confusion = np.zeros((5, 5))
    for h, a in zip(human_scores, auto_scores):
        confusion[h-1, a-1] += 1
    
    # Plot heatmap
    sns.heatmap(confusion, annot=True, fmt='g', cmap='Blues', ax=ax,
               xticklabels=range(1, 6), yticklabels=range(1, 6),
               cbar_kws={'label': 'Count'})
    
    ax.set_xlabel('Autorater Score', fontsize=11, fontweight='bold')
    ax.set_ylabel('Human Score', fontsize=11, fontweight='bold')
    ax.set_title(f"{criterion['name']} Confusion Matrix", fontsize=12, fontweight='bold')

plt.suptitle('Score Agreement Confusion Matrices', fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("üìä Confusion matrices generated")

In [None]:
# Radar chart comparing mean scores
import matplotlib.pyplot as plt
import numpy as np

criteria_names = [CRITERIA[k]['name'] for k in CRITERIA.keys()]
autorater_means = [agreement_metrics['by_criterion'][k]['autorater_mean'] for k in CRITERIA.keys()]
human_means = [agreement_metrics['by_criterion'][k]['human_mean'] for k in CRITERIA.keys()]

# Number of criteria
num_criteria = len(criteria_names)
angles = np.linspace(0, 2 * np.pi, num_criteria, endpoint=False).tolist()
autorater_means += autorater_means[:1]  # Complete the circle
human_means += human_means[:1]
angles += angles[:1]

fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))

ax.plot(angles, autorater_means, 'o-', linewidth=2, label='Autorater', color='blue')
ax.fill(angles, autorater_means, alpha=0.25, color='blue')

ax.plot(angles, human_means, 'o-', linewidth=2, label='Human', color='red')
ax.fill(angles, human_means, alpha=0.25, color='red')

ax.set_xticks(angles[:-1])
ax.set_xticklabels(criteria_names, fontsize=11)
ax.set_ylim(0, 5)
ax.set_yticks([1, 2, 3, 4, 5])
ax.set_yticklabels(['1', '2', '3', '4', '5'])
ax.set_title('Mean Score Comparison: Autorater vs Human', 
            fontsize=14, fontweight='bold', pad=20)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1), fontsize=11)
ax.grid(True)

plt.tight_layout()
plt.show()

print("üìä Radar chart generated")

In [None]:
# Bias heatmap by criterion and score level
fig, ax = plt.subplots(figsize=(12, 6))

# Calculate bias for each criterion at each score level
bias_matrix = []
for criterion_key in CRITERIA.keys():
    criterion_bias = []
    for score_level in range(1, 6):
        # Get cases where human scored this level
        auto_scores = [e['autorater_scores'][criterion_key]['score'] 
                      for e in evaluation_results 
                      if e['human_scores'][criterion_key] == score_level]
        
        if auto_scores:
            mean_auto = np.mean(auto_scores)
            bias = mean_auto - score_level
        else:
            bias = 0
        
        criterion_bias.append(bias)
    
    bias_matrix.append(criterion_bias)

bias_matrix = np.array(bias_matrix)

sns.heatmap(bias_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
           xticklabels=[f'Score {i}' for i in range(1, 6)],
           yticklabels=[CRITERIA[k]['name'] for k in CRITERIA.keys()],
           cbar_kws={'label': 'Autorater Bias (positive = over-scores)'},
           vmin=-1, vmax=1, ax=ax)

ax.set_xlabel('Human Score Level', fontsize=12, fontweight='bold')
ax.set_ylabel('Criterion', fontsize=12, fontweight='bold')
ax.set_title('Autorater Bias by Criterion and Human Score Level', 
            fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("üìä Bias heatmap generated")

---
## Part 8: Save Results & Validation

In [None]:
# Prepare results for saving
import time

results_output = {
    "metadata": {
        "mode": MODE,
        "num_responses": len(responses),
        "num_criteria": len(CRITERIA),
        "total_evaluations": len(responses) * len(CRITERIA),
        "model": "claude-3-5-sonnet-20241022",
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "total_cost_usd": round(total_cost, 2),
        "input_tokens": total_input_tokens,
        "output_tokens": total_output_tokens
    },
    "criteria": CRITERIA,
    "agreement_metrics": agreement_metrics,
    "evaluation_results": evaluation_results,
    "large_disagreements": large_disagreements
}

# Save to results directory
results_dir = Path("results")
results_dir.mkdir(exist_ok=True)

results_path = results_dir / "autorater_calibration_results.json"

with open(results_path, 'w') as f:
    json.dump(results_output, f, indent=2)

print(f"‚úÖ Results saved to: {results_path}")
print(f"\nüìä FINAL SUMMARY")
print("="*80)
print(f"Mode:                    {MODE}")
print(f"Responses evaluated:     {len(responses)}")
print(f"Total evaluations:       {len(responses) * len(CRITERIA)}")
print(f"Total cost:              ${total_cost:.2f}")
print(f"\nAgreement Metrics:")
print(f"  Pearson correlation:   {agreement_metrics['overall']['pearson_correlation']:.3f}")
print(f"  Spearman correlation:  {agreement_metrics['overall']['spearman_correlation']:.3f}")
print(f"  Cohen's kappa:         {agreement_metrics['overall']['cohens_kappa']:.3f}")
print(f"  Mean absolute error:   {agreement_metrics['overall']['mean_absolute_error']:.3f}")
print(f"  Mean bias:             {agreement_metrics['overall']['mean_difference']:+.3f}")

In [None]:
# Validation checks
print("üîç VALIDATION CHECKS")
print("="*80)

checks_passed = 0
checks_total = 0

# Check 1: Cost within budget
checks_total += 1
if MODE == "DEMO":
    cost_limit = 1.00
else:
    cost_limit = 3.00

if total_cost <= cost_limit:
    print(f"‚úÖ Cost check passed: ${total_cost:.2f} <= ${cost_limit:.2f}")
    checks_passed += 1
else:
    print(f"‚ùå Cost check failed: ${total_cost:.2f} > ${cost_limit:.2f}")

# Check 2: All evaluations completed
checks_total += 1
expected_evals = len(responses) * len(CRITERIA)
actual_evals = len(evaluation_results) * len(CRITERIA)

if actual_evals == expected_evals:
    print(f"‚úÖ Evaluation completeness: {actual_evals}/{expected_evals}")
    checks_passed += 1
else:
    print(f"‚ùå Evaluation completeness: {actual_evals}/{expected_evals}")

# Check 3: Correlation is reasonable (>0.5)
checks_total += 1
correlation = agreement_metrics['overall']['pearson_correlation']
if correlation >= 0.5:
    print(f"‚úÖ Correlation check: {correlation:.3f} >= 0.500")
    checks_passed += 1
else:
    print(f"‚ö†Ô∏è  Correlation check: {correlation:.3f} < 0.500 (autorater may need calibration)")

# Check 4: Results file exists
checks_total += 1
if results_path.exists():
    print(f"‚úÖ Results file created: {results_path}")
    checks_passed += 1
else:
    print(f"‚ùå Results file missing: {results_path}")

# Check 5: No NaN values in metrics
checks_total += 1
has_nan = any(np.isnan(v) if isinstance(v, (int, float)) else False 
             for v in agreement_metrics['overall'].values())

if not has_nan:
    print(f"‚úÖ Data integrity check: No NaN values")
    checks_passed += 1
else:
    print(f"‚ùå Data integrity check: NaN values detected")

print("\n" + "="*80)
print(f"üéØ Validation: {checks_passed}/{checks_total} checks passed")

if checks_passed == checks_total:
    print("‚úÖ ALL VALIDATION CHECKS PASSED - Tutorial complete!")
else:
    print(f"‚ö†Ô∏è  {checks_total - checks_passed} validation check(s) failed")

---
## Key Takeaways

### What We Learned

1. **Autorater Design**: Effective autoraters require clear rubrics, detailed prompts, and multi-dimensional evaluation

2. **Calibration Process**: 
   - Collect human annotations on diverse examples
   - Calculate agreement metrics (correlation, kappa, MAE)
   - Identify systematic biases
   - Iterate on prompts to improve alignment

3. **Agreement Metrics Interpretation**:
   - **Pearson r > 0.7**: Good correlation
   - **Cohen's kappa > 0.6**: Substantial agreement
   - **MAE < 0.5**: Scores are close on average

4. **Common Biases**:
   - Over-scoring or under-scoring specific criteria
   - Difficulty with edge cases or ambiguous responses
   - Inconsistency across different query types

5. **Production Recommendations**:
   - Start with DEMO mode for rapid iteration
   - Use FULL mode for final calibration
   - Re-calibrate periodically with new human annotations
   - Monitor autorater performance in production
   - Use HITL evaluation for critical decisions

### Next Steps

- **Tutorial**: Complete `benchmark_evaluation.ipynb` (Task 4.12)
- **Concept**: Read `human_in_the_loop_evaluation.md` for HITL workflows
- **Practice**: Build custom autoraters for your domain
- **Integration**: Use `backend/trajectory_evaluation.py` for full agent evaluation pipeline

---

## Common Pitfalls

1. **Insufficient Human Annotations**: Need at least 50+ examples for reliable calibration
2. **Vague Rubrics**: Autoraters perform poorly without clear, detailed scoring criteria
3. **Temperature > 0**: Use temperature=0 for deterministic, consistent evaluations
4. **Ignoring Bias**: Systematically over/under-scoring can invalidate results
5. **No Re-calibration**: Autorater performance drifts over time‚Äîre-calibrate regularly

---

## FAQ

**Q: How many human annotations do I need?**
A: Minimum 30 for initial calibration, 100+ for production use. More diverse examples = better calibration.

**Q: What if correlation is low (<0.5)?**
A: Refine your rubrics, add more detailed examples in prompts, or consider multi-annotator consensus for ground truth.

**Q: Should I use autoraters or human evaluation?**
A: Use both! Autoraters for rapid iteration and scale, humans for critical decisions and calibration.

**Q: How often should I re-calibrate?**
A: Monthly for active projects, or whenever you detect performance drift (correlation drops >0.1).

**Q: Can I use this for multi-turn agent conversations?**
A: Yes! Adapt the rubrics to evaluate full trajectories instead of single responses. See `trajectory_evaluation_tutorial.ipynb`.