# Evaluating LLM-as-Judge: Essential Metrics for AI Quality Assessment

## Overview

When deploying AI assistants at scale, manual quality review becomes impractical. Organizations increasingly rely on **autograders**—LLMs that evaluate other LLM outputs—to maintain quality standards.

This notebook introduces the core statistical metrics used to validate autograder reliability. These metrics originate from psychometrics and inter-rater reliability research, adapted here for AI evaluation contexts.

**Metrics covered:**

| Metric | Origin | Primary Use |
|--------|--------|-------------|
| Cohen's Kappa | Jacob Cohen, 1960 | Agreement beyond chance |
| Kendall's Tau | Maurice Kendall, 1938 | Rank correlation |
| Spearman's Rho | Charles Spearman, 1904 | Rank correlation |
| Pearson's R | Karl Pearson, 1896 | Linear correlation |
| Mean Bias | Descriptive statistics | Systematic error detection |

### Practical Scenario Guide

| Scenario | Use This Metric | Why | Avoid |
|----------|-----------------|-----|-------|
| **Validate autograder for deployment** | Quadratic Kappa | Measures true agreement accounting for chance; best for quality gates | Pearson (detects correlation, not bias) |
| **A/B test: Which model is better?** | Kendall's Tau | Ranks responses correctly; detects ordering agreement | Kappa (exact scores matter less here) |
| **Check systematic over/under-scoring** | Mean Bias | Finds leniency/harshness; essential for threshold calibration | Pearson (won't detect +1 systematic bias) |
| **Model comparison with continuous scores** | Spearman's Rho | Robust to outliers, ordinal-friendly; easier to interpret | Pearson (assumes normality) |
| **Ordinal categories (Poor/Good/Excellent)** | Quadratic Kappa or Spearman | Both handle partial credit for near-misses | Pearson (violates assumptions) |
| **Quick ranking check** | Kendall's Tau | Intuitive (concordant/discordant pairs); theoretically sound | Pearson (misses non-linear ordering) |
| **Confirm no hidden bias before production** | Mean Bias + Kappa | Together catch both accuracy and calibration issues | Pearson or Tau alone (incomplete view) |

---
## When to Run This Validation

Run autograder validation at these checkpoints:

| Trigger | Purpose |
|---------|--------|
| **Launch readiness** | Confirm autograder meets production thresholds before deployment |
| **Post-prompt change** | Regression check after modifying autograder prompts or rubrics |
| **Model upgrade** | Verify consistency when underlying LLM is updated |
| **Safety red-team refresh** | Revalidate after adversarial testing reveals new failure modes |
| **Quarterly audit** | Detect drift in autograder behavior over time |

---
## Setup

In [None]:
import numpy as np
import json
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from scipy.stats import kendalltau, spearmanr, pearsonr

---
## Rubric and Evaluation Prompt

Autograder reliability depends heavily on rubric clarity and prompt design. Small rubric changes can significantly alter score distributions and agreement metrics.

### Rubric (Human and Autograder)

Both human evaluators and the autograder use this rubric for the query: **"How do I reset my password?"**

| Score | Label | Criteria |
|-------|-------|----------|
| **5** | Complete | Provides specific navigation path, all required steps, and confirmation of completion |
| **4** | Good | Correct approach with minor omissions (e.g., missing final "Save" step or specific menu location) |
| **3** | Partial | Directionally correct but vague (e.g., "check settings" without specific path) |
| **2** | Incomplete | States the obvious or provides irrelevant guidance (e.g., "you need to change your password") |
| **1** | Failure | Incorrect, refuses valid request, hallucinated UI, or unnecessary escalation |

### Autograder Evaluation Prompt

```
You are evaluating an AI assistant's response to a user question.

User question: "How do I reset my password?"

AI response: {response}

Rate the response quality on a 1-5 scale using these criteria:
- 5: Complete, specific navigation path with all steps and confirmation
- 4: Correct approach, minor omissions (missing final step or specific location)
- 3: Directionally correct but vague (no specific navigation path)
- 2: Incomplete or irrelevant (states obvious without guidance)
- 1: Incorrect, refusal, hallucination, or unnecessary escalation

Output only the numeric score (1-5).
```

### Why Rubrics Matter

Low Kappa can indicate:
1. **Poor autograder alignment** — The model misunderstands scoring criteria
2. **Rubric ambiguity** — Multiple reasonable interpretations exist
3. **Rubric drift** — Human evaluators apply different standards over time

Before attributing low agreement to autograder failure, measure **human-human Kappa** to establish an upper bound.

---
## Evaluation Dataset

The following dataset represents 20 AI assistant responses to the query: **"How do I reset my password?"**

Each response was scored by a human evaluator and an autograder on a 1-5 scale.

### Response Quality Fragmentation

In production, response quality fragments across several dimensions:

| Fragmentation Type | Cause | Example |
|-------------------|-------|----------|
| **Completeness** | Model stops early or omits steps | Missing "click Save" at end |
| **Specificity** | Generic vs. actionable guidance | "Check settings" vs. "Go to Settings > Security" |
| **Confidence calibration** | Hedging when unnecessary | "I think you might try..." |
| **Hallucination** | Fabricated UI elements or steps | References non-existent menu |
| **Context loss** | Fails to use conversation history | Ignores user already tried step 1 |

### Sample Size Caveat

**Note:** This dataset uses N=20 for illustration. At this sample size, confidence intervals for Kappa and Tau are wide (typically ±0.15–0.20). These examples are illustrative, not production-ready. See the Sample Size Guidelines section for production minimums.

In [None]:
# Production-realistic responses with quality fragmentation
responses = [
    # High quality (5) - Complete, specific, actionable
    "Go to Settings > Security > Password. Click 'Change Password', enter your current password, then your new password twice. Click Save.",
    "Navigate to your account settings at account.example.com/security. Select 'Reset Password' and follow the email verification steps.",
    
    # Good quality (4) - Correct but minor omissions
    "Go to Settings, then Security, and select Change Password. You'll need your current password.",  # Missing: where to find Settings
    "Click your profile icon, go to Security settings, and there's a password reset option.",  # Missing: confirmation step
    "You can reset it from the Security page in Settings. There's a Change Password button.",  # Missing: specific navigation
    
    # Partial quality (3) - Vague but directionally correct  
    "There should be an option in your account settings somewhere.",  # Fragmentation: lacks specificity
    "Try looking in Settings for security options.",  # Fragmentation: no concrete steps
    "Usually you can find password options in your profile or settings area.",  # Fragmentation: hedging
    "Check the settings menu, I believe there's a security section.",  # Fragmentation: uncertain language
    
    # Poor quality (2) - Incomplete or unhelpful
    "You need to change your password.",  # Fragmentation: states obvious, no guidance
    "Go to settings.",  # Fragmentation: incomplete instruction
    "I'd recommend updating your password regularly.",  # Fragmentation: doesn't answer question
    "Have you tried the forgot password link?",  # Fragmentation: wrong approach for logged-in user
    
    # Failure (1) - Incorrect, harmful, or non-responsive
    "I don't have access to your account information.",  # Fragmentation: refuses valid request
    "Contact support at 1-800-XXX-XXXX.",  # Fragmentation: unnecessary escalation
    "Your password is stored securely and cannot be viewed.",  # Fragmentation: misunderstands request
    "Click on the Admin Console and modify user credentials.",  # Fragmentation: hallucinated UI
    
    # Edge cases - Where autograders often struggle
    "Settings > Security > Password. Enter current, then new password.",  # Terse but correct
    "I can help you reset your password! First, you'll want to...",  # Friendly but incomplete
    "For security reasons, password changes require email verification after updating in Settings > Security.",  # Correct but complex
]

# Metadata for stratified analysis
surfaces = np.array(["web", "web", "mobile", "mobile", "web", "web", "mobile", "mobile", "web", 
                     "admin", "admin", "mobile", "web", "admin", "web", "mobile", "admin",
                     "mobile", "web", "web"])

risk_levels = np.array(["low", "low", "low", "low", "low", "low", "low", "low", "low",
                        "medium", "medium", "medium", "medium", "high", "high", "high", "high",
                        "low", "medium", "low"])

# Human evaluator scores
human_scores = np.array([5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 4, 2, 4])

# Autograder scores (realistic disagreement patterns)
autograder = np.array([5, 5, 4, 4, 3, 3, 3, 4, 3, 2, 2, 3, 2, 2, 1, 1, 2, 3, 3, 4])

print("Sample responses with scores:")
print("=" * 80)
for i in [0, 5, 9, 13, 17]:  # Show one from each quality tier
    print(f"\nResponse: \"{responses[i][:70]}{'...' if len(responses[i]) > 70 else ''}\"")
    print(f"Human: {human_scores[i]}  |  Autograder: {autograder[i]}  |  {'Match' if human_scores[i] == autograder[i] else 'Disagree'}")
    print(f"Surface: {surfaces[i]}  |  Risk: {risk_levels[i]}")

---
# Agreement Metrics

Agreement metrics quantify how often two raters assign the same (or similar) scores to the same items.

---
## Cohen's Kappa (κ)

**Origin:** Introduced by psychologist Jacob Cohen in 1960 to measure inter-rater reliability while accounting for chance agreement.

**Formula:**
$$\kappa = \frac{P_o - P_e}{1 - P_e}$$

Where $P_o$ is observed agreement and $P_e$ is expected agreement by chance.

**Why it matters for AI evaluation:**

Raw percent agreement inflates reliability when score distributions are imbalanced. If 90% of AI responses are high-quality, an autograder that always outputs "5" achieves 90% agreement while providing zero discriminative value.

Kappa corrects for this by measuring agreement *beyond* what random chance would produce.

**Quadratic weighting:** For ordinal scales (1-5), quadratic-weighted Kappa applies partial credit—a 4 vs. 5 disagreement is penalized less than a 1 vs. 5 disagreement.

In [None]:
kappa_unweighted = cohen_kappa_score(human_scores, autograder)
kappa_linear = cohen_kappa_score(human_scores, autograder, weights='linear')
kappa_quadratic = cohen_kappa_score(human_scores, autograder, weights='quadratic')

print("COHEN'S KAPPA")
print("=" * 50)
print(f"Unweighted:          {kappa_unweighted:.3f}")
print(f"Linear weighted:     {kappa_linear:.3f}")
print(f"Quadratic weighted:  {kappa_quadratic:.3f}  <-- Recommended for ordinal scales")
print()
print("Interpretation (Landis & Koch, 1977):")
print("  < 0.20  Poor")
print("  0.21-0.40  Fair")
print("  0.41-0.60  Moderate")
print("  0.61-0.80  Substantial")
print("  0.81-1.00  Almost perfect")
print()
print("NOTE: These bands are widely used heuristics but may be too lenient")
print("      for high-stakes AI safety applications. Adjust to domain risk.")
print()
print(f"Assessment: {'Acceptable for production' if kappa_quadratic >= 0.70 else 'Requires improvement'}")
print()
print("(Reminder: These numbers are from a toy N=20 demo, not production-ready.)")

### Human-Human Baseline

An autograder cannot be expected to exceed human-human agreement. Before diagnosing autograder problems, measure agreement between multiple human annotators.

**Example interpretation:**
- Human-human Kappa: 0.82
- Autograder-human Kappa: 0.78
- **Conclusion:** Autograder performs near human-level (96% of human ceiling)

If human-human Kappa is low (e.g., 0.55), the problem is likely **rubric ambiguity**, not autograder quality.

In [None]:
# Simulated second human annotator (for demonstration)
human2_scores = np.array([5, 5, 4, 4, 3, 3, 3, 3, 3, 2, 2, 2, 3, 1, 1, 1, 1, 4, 2, 4])

human_human_kappa = cohen_kappa_score(human_scores, human2_scores, weights='quadratic')
auto_human_kappa = cohen_kappa_score(human_scores, autograder, weights='quadratic')

print("HUMAN-HUMAN BASELINE")
print("=" * 50)
print(f"Human-Human Kappa:    {human_human_kappa:.3f}  (upper bound)")
print(f"Autograder-Human:     {auto_human_kappa:.3f}")
print(f"Ratio:                {(auto_human_kappa/human_human_kappa)*100:.1f}% of human ceiling")
print()
if auto_human_kappa >= 0.95 * human_human_kappa:
    print("Assessment: Autograder performs at human-level")
elif auto_human_kappa >= 0.85 * human_human_kappa:
    print("Assessment: Autograder performs near human-level")
else:
    print("Assessment: Significant gap from human-level; investigate autograder prompt/rubric")

### When to use Cohen's Kappa

**If you only remember one thing:** Kappa tells you whether your autograder and humans actually *agree*, not just whether their scores move together. High correlation with systematic bias still yields low Kappa.

| Appropriate | Not appropriate |
|-------------|------------------|
| Comparing two raters | Three or more raters (use Fleiss' Kappa) |
| Categorical or ordinal data | Continuous measurements |
| Balanced importance of all categories | When certain errors are catastrophic |

### Limitations

- **Kappa paradox:** Can yield low values even with high agreement when prevalence is extreme or marginal distributions differ substantially between raters (Feinstein & Cicchetti, 1990)
- Assumes both raters have similar marginal distributions
- Does not indicate *direction* of disagreement
- Sensitive to number of categories

---
# Correlation Metrics

Correlation metrics measure the strength and direction of association between two variables. For autograder validation, they answer: **"When human scores go up, do autograder scores go up proportionally?"**

---
## Kendall's Tau (τ)

**Origin:** Developed by British statistician Maurice Kendall in 1938 as a non-parametric measure of rank correlation.

**Concept:** Counts concordant and discordant pairs. For any two responses:
- **Concordant:** Both raters agree on which is better
- **Discordant:** Raters disagree on which is better

$$\tau_b = \frac{(\text{concordant pairs}) - (\text{discordant pairs})}{\sqrt{(n_0 - n_1)(n_0 - n_2)}}$$

Where $n_0$ = total pairs, $n_1$ = ties in first variable, $n_2$ = ties in second variable.

Note: Scipy's `kendalltau` computes Tau-b, which corrects for ties. The simpler Tau-a (without tie correction) is rarely used in practice.

**Why it matters for AI evaluation:**

In A/B testing or response ranking, exact scores matter less than relative ordering. Kendall's Tau validates whether an autograder can reliably identify which of two responses is superior—critical for model comparison and response selection.

In [None]:
tau, tau_pvalue = kendalltau(human_scores, autograder)

print("KENDALL'S TAU")
print("=" * 50)
print(f"Tau coefficient:     {tau:.3f}")
print(f"P-value:             {tau_pvalue:.4f}")
print()
print("Interpretation (heuristic bands, adjust to domain):")
print("  0.0-0.3   Weak correlation")
print("  0.3-0.6   Moderate correlation")
print("  0.6-0.8   Strong correlation")
print("  0.8-1.0   Very strong correlation")
print()
print(f"Assessment: {'Reliable for ranking tasks' if tau >= 0.60 else 'Ranking reliability insufficient'}")
print()
print("(Reminder: These numbers are from a toy N=20 demo, not production-ready.)")

### Practical example: Concordant vs. discordant pairs

In [None]:
print("Pairwise comparison example:")
print("=" * 60)
print()

# Compare responses 0 (high quality) and 9 (poor quality)
i, j = 0, 9
print(f"Response A: \"{responses[i][:50]}...\"")
print(f"Response B: \"{responses[j][:50]}...\"")
print()
print(f"Human scores:      A={human_scores[i]}, B={human_scores[j]} --> A is better")
print(f"Autograder scores: A={autograder[i]}, B={autograder[j]} --> A is better")
print()
print("Result: CONCORDANT (both agree A > B)")
print()

# Find a discordant pair
print("-" * 60)
i, j = 4, 7  # Response 4 (human=4, auto=3) vs Response 7 (human=3, auto=4)
print(f"Response A: \"{responses[i][:50]}...\"")
print(f"Response B: \"{responses[j][:50]}...\"")
print()
print(f"Human scores:      A={human_scores[i]}, B={human_scores[j]} --> A is better")
print(f"Autograder scores: A={autograder[i]}, B={autograder[j]} --> B is better")
print()
print("Result: DISCORDANT (human says A > B, autograder says B > A)")

### When to use Kendall's Tau

**If you only remember one thing:** Tau validates ranking—whether the autograder correctly identifies which response is *better*, regardless of exact scores. Essential for A/B testing and model comparison.

| Appropriate | Not appropriate |
|-------------|------------------|
| Ordinal data | Continuous measurements requiring exact correlation |
| Ranking or comparison tasks | When absolute score accuracy matters |
| Small sample sizes | Large datasets (computationally intensive) |
| Data with many ties | — |

### Limitations

- More conservative than Spearman (typically yields lower absolute values)
- O(n²) computational complexity for naive implementation (O(n log n) with optimized algorithms)
- Does not detect non-monotonic relationships
- With many ties, confidence intervals widen

---
## Spearman's Rho (ρ)

**Origin:** Developed by psychologist Charles Spearman in 1904 while studying intelligence testing.

**Concept:** Converts scores to ranks, then computes Pearson correlation on the ranks. Measures monotonic relationship strength.

**Relationship to Kendall's Tau:**
- Both measure rank correlation
- Both are robust to outliers (unlike Pearson)
- Spearman typically yields higher absolute values than Tau
- Tau has more intuitive interpretation (proportion of concordant pairs)
- Tau is preferred for small samples; Spearman for larger datasets

In [None]:
rho, rho_pvalue = spearmanr(human_scores, autograder)

print("SPEARMAN'S RHO")
print("=" * 50)
print(f"Rho coefficient:     {rho:.3f}")
print(f"P-value:             {rho_pvalue:.4f}")
print()
print("Interpretation:")
print("  0.0-0.3   Weak correlation")
print("  0.3-0.6   Moderate correlation")
print("  0.6-0.8   Strong correlation")
print("  0.8-1.0   Very strong correlation")
print()
print(f"Assessment: {'Acceptable' if rho >= 0.70 else 'Below threshold'}")

### When to use Spearman's Rho

| Appropriate | Not appropriate |
|-------------|------------------|
| Ordinal or non-normal continuous data | Strictly categorical data |
| Detecting monotonic relationships | Non-monotonic relationships |
| When outlier resistance is needed | When Pearson assumptions are met |

### Tau vs. Rho: Which to report?

Report both. Academic conventions vary by field:
- Psychology/Education: Often prefer Spearman
- NLP/AI research: Increasingly favor Kendall's Tau

---
## Pearson's R

**Origin:** Formalized by Karl Pearson in 1896, building on work by Francis Galton.

**Concept:** Measures linear correlation between two continuous variables.

$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

**Caution for AI evaluation:**

Pearson assumes:
1. Linear relationship
2. Continuous, normally distributed variables
3. Homoscedasticity (constant variance)

Likert-scale ratings (1-5) technically violate these assumptions. While Pearson is reasonably robust to mild violations, it critically fails to detect systematic bias—an autograder consistently scoring +1 higher will show r ≈ 1.0 despite poor agreement.

In [None]:
r, r_pvalue = pearsonr(human_scores, autograder)

print("PEARSON'S R")
print("=" * 50)
print(f"R coefficient:       {r:.3f}")
print(f"P-value:             {r_pvalue:.4f}")
print()
print("Interpretation:")
print("  0.0-0.3   Weak correlation")
print("  0.3-0.6   Moderate correlation")
print("  0.6-0.8   Strong correlation")
print("  0.8-1.0   Very strong correlation")
print()
print("Note: Use with caution for ordinal rating scales.")
print("      Spearman or Kendall are generally preferred.")

### Pearson's R: A cautionary example

Pearson can be misleading when systematic bias exists:

In [None]:
# Autograder that is always 1 point higher than human
biased_autograder = np.clip(human_scores + 1, 1, 5)

r_biased, _ = pearsonr(human_scores, biased_autograder)
kappa_biased = cohen_kappa_score(human_scores, biased_autograder, weights='quadratic')

print("Systematically biased autograder (+1 to all scores):")
print(f"  Pearson R:         {r_biased:.3f}  <-- Looks excellent!")
print(f"  Quadratic Kappa:   {kappa_biased:.3f}  <-- Reveals the problem")
print()
print("Pearson only measures correlation, not agreement.")
print("A biased autograder can have perfect correlation with zero accuracy.")

### When to use Pearson's R

| Appropriate | Not appropriate |
|-------------|------------------|
| Continuous, normally distributed data | Ordinal rating scales |
| Checking linear relationship | Data with outliers |
| As supplementary metric | As primary agreement metric |

---
# Bias Metrics

Bias metrics detect systematic over- or under-scoring by the autograder.

---
## Mean Bias

**Concept:** Average difference between autograder and human scores.

$$\text{Bias} = \frac{1}{n}\sum_{i=1}^{n}(\text{autograder}_i - \text{human}_i)$$

**If you only remember one thing:** Bias catches systematic over/under-scoring that correlation metrics miss entirely. An autograder scoring +1 on everything has perfect correlation but terrible bias.

**Interpretation:**
- **Positive bias:** Autograder is lenient (scores higher than humans)
- **Negative bias:** Autograder is harsh (scores lower than humans)

**Operational impact:**

| Bias | Risk |
|------|------|
| +0.5 or higher | Low-quality responses pass quality gates |
| -0.5 or lower | High-quality responses flagged unnecessarily |

In [None]:
bias = np.mean(autograder - human_scores)
mae = np.mean(np.abs(autograder - human_scores))

print("BIAS METRICS")
print("=" * 50)
print(f"Mean Bias:           {bias:+.3f}")
print(f"Mean Absolute Error: {mae:.3f}")
print()
print("Interpretation:")
print("  Bias within ±0.30:  Acceptable")
print("  MAE below 0.50:     Good accuracy")
print()
if abs(bias) <= 0.3:
    print("Assessment: No significant systematic bias detected")
elif bias > 0.3:
    print("Assessment: Autograder is systematically LENIENT")
    print("            Risk: Low-quality responses may pass quality gates")
else:
    print("Assessment: Autograder is systematically HARSH")
    print("            Risk: Excessive false positives in quality flagging")
print()
print("(Reminder: These numbers are from a toy N=20 demo, not production-ready.)")

---
# Comprehensive Evaluation Report

The following function consolidates all metrics into a production-ready evaluation report.

In [None]:
def evaluate_autograder(
    human: np.ndarray, 
    auto: np.ndarray, 
    name: str = "Autograder",
    thresholds: dict = None
) -> dict:
    """
    Generate comprehensive evaluation metrics for an autograder.
    
    Args:
        human: Array of human scores
        auto: Array of autograder scores
        name: Identifier for reporting
        thresholds: Optional dict with 'kappa', 'tau', 'bias' thresholds
        
    Returns:
        Dictionary of computed metrics
    """
    # Default thresholds (general use case)
    if thresholds is None:
        thresholds = {'kappa': 0.70, 'tau': 0.60, 'bias': 0.30}
    
    # Agreement
    kappa = cohen_kappa_score(human, auto, weights='quadratic')
    
    # Correlation
    tau, tau_p = kendalltau(human, auto)
    rho, rho_p = spearmanr(human, auto)
    r, r_p = pearsonr(human, auto)
    
    # Bias
    bias = np.mean(auto - human)
    mae = np.mean(np.abs(auto - human))
    
    # Thresholds
    kappa_pass = kappa >= thresholds['kappa']
    tau_pass = tau >= thresholds['tau']
    bias_pass = abs(bias) <= thresholds['bias']
    
    print("=" * 60)
    print(f"AUTOGRADER EVALUATION REPORT: {name}")
    print("=" * 60)
    print(f"Sample size: {len(human)}")
    print()
    
    print("AGREEMENT")
    print("-" * 40)
    print(f"  Quadratic Kappa:   {kappa:.3f}  {'PASS' if kappa_pass else 'FAIL'}  (threshold: >= {thresholds['kappa']:.2f})")
    print()
    
    print("CORRELATION")
    print("-" * 40)
    print(f"  Kendall's Tau:     {tau:.3f}  {'PASS' if tau_pass else 'FAIL'}  (threshold: >= {thresholds['tau']:.2f})")
    print(f"  Spearman's Rho:    {rho:.3f}")
    print(f"  Pearson's R:       {r:.3f}  (use with caution for ordinal data)")
    print()
    
    print("BIAS")
    print("-" * 40)
    print(f"  Mean Bias:         {bias:+.3f}  {'PASS' if bias_pass else 'FAIL'}  (threshold: +/-{thresholds['bias']:.2f})")
    print(f"  Mean Abs Error:    {mae:.3f}")
    print()
    
    print("=" * 60)
    if kappa_pass and tau_pass and bias_pass:
        print("VERDICT: APPROVED FOR PRODUCTION")
    else:
        print("VERDICT: REQUIRES REMEDIATION")
        if not kappa_pass:
            print("  - Improve agreement: Review autograder prompt and rubric")
        if not tau_pass:
            print("  - Improve ranking: Check for systematic comparison errors")
        if not bias_pass:
            print(f"  - Correct bias: Autograder is {'lenient' if bias > 0 else 'harsh'}")
    print("=" * 60)
    
    return {
        'kappa': kappa, 'tau': tau, 'rho': rho, 'r': r,
        'bias': bias, 'mae': mae,
        'approved': kappa_pass and tau_pass and bias_pass,
        'n': len(human)
    }

# Run evaluation
results = evaluate_autograder(human_scores, autograder, "Password Reset Autograder v1.0")

---
# Domain-Specific Thresholds

Production thresholds should vary based on use case severity. The defaults above (Kappa ≥ 0.70, Tau ≥ 0.60, |bias| ≤ 0.30) are reasonable starting points but may be too lenient for high-stakes applications.

| Scenario | Suggested Kappa | Suggested |bias| | Rationale |
|----------|-----------------|-----------|----------|
| **Internal search/chat** | ≥ 0.70 | ≤ 0.30 | Lower stakes; user can reformulate |
| **User-facing help/support** | ≥ 0.80 | ≤ 0.20 | User trust at stake; bad answers visible |
| **Safety/compliance gates** | ≥ 0.85 | ≤ 0.10 | False negatives have serious consequences |

**Important:** These thresholds must be tuned empirically for your specific domain and validated against human-human baselines.

In [None]:
# Example: Evaluating with stricter safety thresholds
safety_thresholds = {'kappa': 0.85, 'tau': 0.75, 'bias': 0.10}

print("Evaluating with SAFETY-GRADE thresholds:\n")
safety_results = evaluate_autograder(
    human_scores, autograder, 
    "Password Reset Autograder (Safety Review)",
    thresholds=safety_thresholds
)

---
# Stratified Analysis by Slice

A production autograder needs to be validated **per slice**, not just globally. Hidden issues often emerge in specific segments—e.g., "autograder is harsh on mobile flows but lenient on admin flows."

Always validate across dimensions like:
- Surface/platform (web, mobile, API)
- Risk level (low, medium, high)
- Query category or intent
- Language or locale
- Model version

In [None]:
def evaluate_by_slice(human, auto, slices, slice_name="slice", min_n=5):
    """
    Evaluate autograder metrics stratified by a categorical variable.
    
    Args:
        human: Array of human scores
        auto: Array of autograder scores
        slices: Array of slice labels (same length as human/auto)
        slice_name: Name of the slice dimension for reporting
        min_n: Minimum samples required per slice
    """
    print(f"STRATIFIED ANALYSIS BY {slice_name.upper()}")
    print("=" * 60)
    
    results = []
    for s in sorted(set(slices)):
        mask = (slices == s)
        n = mask.sum()
        
        if n < min_n:
            print(f"\n{slice_name}={s}: n={n} (skipped, below min_n={min_n})")
            continue
            
        h, a = human[mask], auto[mask]
        
        kappa = cohen_kappa_score(h, a, weights='quadratic')
        tau, _ = kendalltau(h, a)
        bias = np.mean(a - h)
        
        print(f"\n{slice_name}={s} (n={n})")
        print(f"  Kappa: {kappa:.3f}  |  Tau: {tau:.3f}  |  Bias: {bias:+.3f}")
        
        results.append({
            'slice': s, 'n': n, 'kappa': kappa, 'tau': tau, 'bias': bias
        })
    
    print("\n" + "=" * 60)
    return results

# Analyze by surface
surface_results = evaluate_by_slice(human_scores, autograder, surfaces, "surface")

In [None]:
# Analyze by risk level
risk_results = evaluate_by_slice(human_scores, autograder, risk_levels, "risk_level")

---
# Disagreement Analysis

Understanding *where* and *why* the autograder disagrees with humans bridges metrics to qualitative debugging.

In [None]:
def analyze_disagreements(responses, human, auto, surfaces, risk_levels, threshold=0):
    """
    Identify and analyze disagreements between human and autograder scores.
    
    Args:
        responses: List of response texts
        human: Array of human scores
        auto: Array of autograder scores
        surfaces: Array of surface labels
        risk_levels: Array of risk level labels
        threshold: Minimum |delta| to flag as disagreement
    """
    print("DISAGREEMENT ANALYSIS")
    print("=" * 80)
    
    disagreements = []
    for i in range(len(human)):
        delta = auto[i] - human[i]
        if abs(delta) > threshold:
            disagreements.append({
                'idx': i,
                'response': responses[i][:60] + "..." if len(responses[i]) > 60 else responses[i],
                'human': human[i],
                'auto': auto[i],
                'delta': delta,
                'surface': surfaces[i],
                'risk': risk_levels[i],
                'direction': 'lenient' if delta > 0 else 'harsh'
            })
    
    if not disagreements:
        print("No disagreements found.")
        return []
    
    # Sort by absolute delta (worst first)
    disagreements.sort(key=lambda x: abs(x['delta']), reverse=True)
    
    print(f"Found {len(disagreements)} disagreements:\n")
    
    for d in disagreements[:10]:  # Show top 10
        print(f"[{d['idx']}] Human: {d['human']} | Auto: {d['auto']} | Delta: {d['delta']:+d} ({d['direction']})")
        print(f"    Surface: {d['surface']} | Risk: {d['risk']}")
        print(f"    \"{d['response']}\"")
        print()
    
    # Summary statistics
    lenient_count = sum(1 for d in disagreements if d['delta'] > 0)
    harsh_count = sum(1 for d in disagreements if d['delta'] < 0)
    
    print("-" * 80)
    print(f"Summary: {lenient_count} lenient, {harsh_count} harsh disagreements")
    
    return disagreements

disagreements = analyze_disagreements(
    responses, human_scores, autograder, surfaces, risk_levels
)

---
# Threshold Calibration: Operating Point Analysis

When using autograder scores as quality gates (e.g., "pass if score ≥ 4"), you need to understand the error rates at different thresholds.

In [None]:
def analyze_threshold_calibration(human, auto, pass_threshold=4):
    """
    Analyze error rates at a given pass/fail threshold.
    
    Args:
        human: Array of human scores
        auto: Array of autograder scores  
        pass_threshold: Score >= this is considered "pass"
    """
    print(f"THRESHOLD CALIBRATION (pass >= {pass_threshold})")
    print("=" * 60)
    
    # Binary pass/fail
    human_pass = human >= pass_threshold
    auto_pass = auto >= pass_threshold
    
    # Confusion matrix
    # True Positive: Both say pass
    # True Negative: Both say fail
    # False Positive: Auto says pass, human says fail (bad response slips through)
    # False Negative: Auto says fail, human says pass (good response blocked)
    
    tp = np.sum(human_pass & auto_pass)
    tn = np.sum(~human_pass & ~auto_pass)
    fp = np.sum(~human_pass & auto_pass)  # Auto too lenient
    fn = np.sum(human_pass & ~auto_pass)  # Auto too harsh
    
    total = len(human)
    
    print(f"\nConfusion Matrix:")
    print(f"                    Human")
    print(f"                 Pass    Fail")
    print(f"Auto Pass  |    {tp:3d}     {fp:3d}")
    print(f"Auto Fail  |    {fn:3d}     {tn:3d}")
    print()
    
    # Metrics
    accuracy = (tp + tn) / total
    false_positive_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
    false_negative_rate = fn / (fn + tp) if (fn + tp) > 0 else 0
    
    print(f"Accuracy:            {accuracy:.1%}")
    print(f"False Positive Rate: {false_positive_rate:.1%} (bad responses slip through)")
    print(f"False Negative Rate: {false_negative_rate:.1%} (good responses blocked)")
    print()
    
    # Per-score breakdown
    print("Per-score breakdown:")
    for score in sorted(set(human)):
        mask = human == score
        auto_pass_rate = np.mean(auto[mask] >= pass_threshold)
        print(f"  Human {score}: {auto_pass_rate:.0%} passed by autograder (n={mask.sum()})")
    
    return {
        'threshold': pass_threshold,
        'accuracy': accuracy,
        'fpr': false_positive_rate,
        'fnr': false_negative_rate,
        'tp': tp, 'tn': tn, 'fp': fp, 'fn': fn
    }

# Analyze at threshold 4 ("good or better")
threshold_results = analyze_threshold_calibration(human_scores, autograder, pass_threshold=4)

In [None]:
# Compare multiple thresholds
print("THRESHOLD COMPARISON")
print("=" * 60)
print(f"{'Threshold':<12} {'Accuracy':<12} {'FPR':<12} {'FNR':<12}")
print("-" * 60)

for thresh in [3, 4, 5]:
    human_pass = human_scores >= thresh
    auto_pass = autograder >= thresh
    tp = np.sum(human_pass & auto_pass)
    tn = np.sum(~human_pass & ~auto_pass)
    fp = np.sum(~human_pass & auto_pass)
    fn = np.sum(human_pass & ~auto_pass)
    accuracy = (tp + tn) / len(human_scores)
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0
    print(f">= {thresh:<9} {accuracy:<12.1%} {fpr:<12.1%} {fnr:<12.1%}")

---
# Ranking Validation: Best-of-N Selection

For model comparison or response selection tasks, you often need to pick the "best" response from a set. Kendall's Tau validates ranking, but you also want to measure **win-rate**: how often does the autograder select the same winner as humans?

**Note:** In practice, you often care about "picks within top-k by human" (e.g., top-2 or top-3) rather than exact winner match. The "win + tie" metric below approximates this by counting cases where the autograder picks a response tied for the best human score.

In [None]:
def evaluate_ranking_selection(human_rankings, auto_rankings, n_candidates=3):
    """
    Evaluate autograder's ability to select the best response from N candidates.
    
    Args:
        human_rankings: Array of human scores (one per response)
        auto_rankings: Array of autograder scores (one per response)
        n_candidates: Number of candidates per query
    
    Assumes responses are grouped by query (first n_candidates for query 1, etc.)
    """
    n_queries = len(human_rankings) // n_candidates
    
    wins = 0
    ties = 0
    losses = 0
    
    print(f"RANKING SELECTION ANALYSIS (best-of-{n_candidates})")
    print("=" * 60)
    
    for q in range(n_queries):
        start = q * n_candidates
        end = start + n_candidates
        
        h_scores = human_rankings[start:end]
        a_scores = auto_rankings[start:end]
        
        human_winner = np.argmax(h_scores)
        auto_winner = np.argmax(a_scores)
        
        # Check if autograder picked the human's top choice
        if h_scores[auto_winner] == h_scores[human_winner]:
            if human_winner == auto_winner:
                wins += 1
            else:
                ties += 1  # Autograder picked a different response with same score
        else:
            losses += 1
    
    total = wins + ties + losses
    print(f"Queries analyzed: {total}")
    print(f"Exact match (same winner): {wins} ({wins/total:.1%})")
    print(f"Acceptable (tied for best): {ties} ({ties/total:.1%})")
    print(f"Mismatch (wrong winner): {losses} ({losses/total:.1%})")
    print()
    print(f"Effective win rate: {(wins + ties)/total:.1%}")
    
    return {'wins': wins, 'ties': ties, 'losses': losses, 'win_rate': (wins + ties) / total}

# Demo: Simulate 6 queries with 3 candidates each
# (In production, you'd have actual response groups)
demo_human = np.array([5, 3, 2,  4, 4, 3,  3, 4, 5,  2, 3, 1,  5, 4, 4,  3, 2, 3])
demo_auto =  np.array([5, 3, 2,  4, 3, 4,  3, 4, 5,  2, 3, 2,  4, 5, 4,  3, 2, 3])

ranking_results = evaluate_ranking_selection(demo_human, demo_auto, n_candidates=3)

---
# Safety and Policy Grading

For safety-critical applications, quality scores alone are insufficient. You often need a separate **safety dimension** with stricter thresholds and asymmetric error costs (false negatives are worse than false positives).

In [None]:
# Simulated safety labels (0=unsafe, 1=borderline, 2=safe)
human_safety = np.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2])
auto_safety =  np.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 0, 0, 0, 2, 1, 2])

print("SAFETY GRADING EVALUATION")
print("=" * 60)

# Kappa for safety dimension
safety_kappa = cohen_kappa_score(human_safety, auto_safety, weights='quadratic')
safety_bias = np.mean(auto_safety - human_safety)

print(f"Safety Quadratic Kappa: {safety_kappa:.3f}")
print(f"Safety Mean Bias:       {safety_bias:+.3f}")
print()

# For safety, we especially care about false negatives (unsafe content marked safe)
# Binary: unsafe (0) vs safe (1-2)
human_unsafe = human_safety == 0
auto_unsafe = auto_safety == 0

# False negative: Human says unsafe, auto says safe
safety_fn = np.sum(human_unsafe & ~auto_unsafe)
safety_fn_rate = safety_fn / np.sum(human_unsafe) if np.sum(human_unsafe) > 0 else 0

print(f"CRITICAL: False Negative Rate (unsafe content marked safe)")
print(f"  {safety_fn} of {np.sum(human_unsafe)} unsafe items missed ({safety_fn_rate:.1%})")
print()

if safety_fn_rate > 0:
    print("WARNING: Autograder missed unsafe content. Review and remediate before deployment.")
else:
    print("All unsafe content correctly identified.")

---
# Rubric Drift Experiment

Low Kappa can indicate rubric ambiguity rather than autograder failure. This experiment simulates a second human annotator who interprets the rubric slightly differently.

In [None]:
# Simulate rubric drift: Human 2 is more generous with "friendly" responses
# Upgrades some 3's to 4's when the response is conversational
human2_drift = human_scores.copy()

# Indices 7 and 18 are conversational responses that Human 2 rates higher
human2_drift[7] = 4  # Was 3
human2_drift[18] = 3  # Was 2

print("RUBRIC DRIFT SIMULATION")
print("=" * 60)
print("Scenario: Human 2 rates 'friendly' responses 1 point higher")
print()

h1_h2_kappa = cohen_kappa_score(human_scores, human2_drift, weights='quadratic')
auto_h1_kappa = cohen_kappa_score(human_scores, autograder, weights='quadratic')
auto_h2_kappa = cohen_kappa_score(human2_drift, autograder, weights='quadratic')

print(f"Human1 vs Human2 Kappa:      {h1_h2_kappa:.3f}")
print(f"Autograder vs Human1 Kappa:  {auto_h1_kappa:.3f}")
print(f"Autograder vs Human2 Kappa:  {auto_h2_kappa:.3f}")
print()
print("Takeaway: If human-human Kappa is low, clarify the rubric before")
print("          blaming the autograder. Both raters may be 'reasonable'")
print("          but using different standards.")

---
# Failure Mode Analysis

Understanding common autograder failure patterns aids in diagnosis and remediation.

In [None]:
print("FAILURE MODE: Constant Output (Lazy Autograder)")
print("Cause: Prompt too vague, model defaults to neutral score")
print("-" * 60)
lazy = np.array([4] * 20)
_ = evaluate_autograder(human_scores, lazy, "Constant-4 Autograder")

In [None]:
print("FAILURE MODE: Systematic Leniency")
print("Cause: Prompt avoids low scores, or model optimizes for user satisfaction")
print("-" * 60)
lenient = np.clip(human_scores + 1, 1, 5)
_ = evaluate_autograder(human_scores, lenient, "Lenient Autograder (+1)")

In [None]:
print("FAILURE MODE: Systematic Harshness")
print("Cause: Rubric standards too strict, or prompt emphasizes criticism")
print("-" * 60)
harsh = np.clip(human_scores - 1, 1, 5)
_ = evaluate_autograder(human_scores, harsh, "Harsh Autograder (-1)")

In [None]:
print("FAILURE MODE: Random Output")
print("Cause: Model failure, prompt injection, or complete misalignment")
print("-" * 60)
np.random.seed(42)
random_scores = np.random.randint(1, 6, 20)
_ = evaluate_autograder(human_scores, random_scores, "Random Autograder")

---
# Metric Selection Guide

| Use Case | Primary Metric | Secondary Metrics |
|----------|---------------|-------------------|
| General autograder validation | Quadratic Kappa | Bias, MAE |
| A/B testing / Model comparison | Kendall's Tau | Spearman's Rho |
| Quality threshold calibration | Mean Bias | Kappa |
| Academic publication | Kappa + Tau | Spearman, Pearson |

## Recommended Thresholds for Production

| Metric | Minimum (General) | Ideal | Safety-Critical |
|--------|-------------------|-------|----------------|
| Quadratic Kappa | 0.70 | 0.80+ | 0.85+ |
| Kendall's Tau | 0.60 | 0.75+ | 0.80+ |
| Spearman's Rho | 0.70 | 0.85+ | 0.90+ |
| Mean Bias | ±0.30 | ±0.15 | ±0.10 |
| MAE | 0.50 | 0.30 | 0.20 |

## Sample Size Guidelines

| Purpose | Minimum N | Notes |
|---------|----------|-------|
| Initial validation | 50 | Wide confidence intervals |
| Production approval | 100 | Reasonable precision |
| High-stakes deployment | 200+ | Narrow confidence intervals |
| Per-category validation | 30 per category | Minimum for slice-level metrics |

**Note:** At N=20 (as in this notebook), confidence intervals for Kappa and Tau are typically ±0.15–0.20. These examples are illustrative, not production-ready.

---
# Apply to Your Data

## Quick Start

Replace the arrays below with your evaluation data:

In [None]:
# Your data:
my_human_scores = np.array([5, 4, 3, 5, 2, 4, 3, 5, 4, 3])
my_autograder   = np.array([5, 4, 3, 4, 2, 4, 3, 5, 4, 3])

my_results = evaluate_autograder(my_human_scores, my_autograder, "Custom Autograder")

## CSV Import Function

For file-based workflows:

In [None]:
import csv
from pathlib import Path

def run_eval(human_csv_path: str, auto_csv_path: str, id_col: str = 'id', score_col: str = 'score'):
    """
    Load human and autograder scores from CSV files and run evaluation.
    
    Args:
        human_csv_path: Path to CSV with human scores
        auto_csv_path: Path to CSV with autograder scores
        id_col: Column name for response IDs
        score_col: Column name for scores
        
    Returns:
        Evaluation results dictionary
    """
    # Load CSVs
    def load_csv(path):
        with open(path, 'r') as f:
            reader = csv.DictReader(f)
            return {row[id_col]: int(row[score_col]) for row in reader}
    
    human_data = load_csv(human_csv_path)
    auto_data = load_csv(auto_csv_path)
    
    # Join on ID
    common_ids = set(human_data.keys()) & set(auto_data.keys())
    
    if len(common_ids) < len(human_data):
        print(f"Warning: {len(human_data) - len(common_ids)} items in human file not found in auto file")
    
    human_scores = np.array([human_data[id] for id in sorted(common_ids)])
    auto_scores = np.array([auto_data[id] for id in sorted(common_ids)])
    
    # Run evaluation
    return evaluate_autograder(human_scores, auto_scores, f"Evaluation ({Path(human_csv_path).stem})")

# Example usage (uncomment and modify paths):
# results = run_eval('human_scores.csv', 'autograder_scores.csv', id_col='response_id', score_col='quality_score')

## Export Results for Logging

For experiment tracking and CI/CD pipelines:

In [None]:
def export_results(results: dict, output_path: str = None) -> str:
    """
    Export evaluation results as JSON for logging or CI/CD.
    
    Args:
        results: Dictionary from evaluate_autograder()
        output_path: Optional path to save JSON file
        
    Returns:
        JSON string of results
    """
    summary = {
        'kappa_quadratic': round(results['kappa'], 4),
        'kendall_tau': round(results['tau'], 4),
        'spearman_rho': round(results['rho'], 4),
        'pearson_r': round(results['r'], 4),
        'mean_bias': round(results['bias'], 4),
        'mae': round(results['mae'], 4),
        'approved': results['approved'],
        'n': results['n']
    }
    
    json_str = json.dumps(summary, indent=2)
    
    if output_path:
        with open(output_path, 'w') as f:
            f.write(json_str)
        print(f"Results saved to {output_path}")
    
    return json_str

# Export current results
print("Exportable summary:")
print(export_results(results))

---
# Next Steps

After running this validation:

1. **If approved:** Document thresholds used, archive the validation dataset, schedule periodic revalidation

2. **If failed on Kappa:** 
   - First check human-human Kappa to establish ceiling
   - Review autograder prompt for clarity
   - Check rubric for ambiguous criteria
   - Analyze disagreements by slice

3. **If failed on bias:**
   - Lenient: Add examples of low-quality responses to prompt
   - Harsh: Add examples of acceptable responses to prompt
   - Calibrate threshold if gate is downstream

4. **If failed on Tau:**
   - Check for score compression (all 3s and 4s)
   - Verify ranking examples in prompt
   - Consider pairwise comparison approach

5. **For high-stakes deployment:**
   - Increase sample size to 200+
   - Add confidence intervals (bootstrap)
   - Validate per slice
   - Separate safety dimension