# Stage 4: Rubric-Based Evaluation

This notebook walks through **multi-dimensional scoring** of AI responses using rubrics.

Unlike pass/fail tests, rubrics grade responses across multiple quality dimensions:
- **Relevance**: Does it answer the question?
- **Accuracy**: Is the information correct?
- **Completeness**: Does it cover everything needed?
- **Clarity**: Is it well-organized and readable?

## Setup

In [None]:
import sys
from pathlib import Path

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

import yaml
from scorer import RubricScorer, print_result
from evaluator import run_rubric_evaluation, print_summary, load_rubric_test_cases

## Understanding Rubrics

Let's examine how rubrics are structured:

In [None]:
# Load and explore the rubrics
with open("rubrics.yaml") as f:
    rubrics = yaml.safe_load(f)

print("Evaluation Dimensions:")
print("=" * 50)
for dim_name, dim_info in rubrics["dimensions"].items():
    print(f"\n{dim_info['name']} (weight: {dim_info['weight']:.0%})")
    print(f"  {dim_info['description']}")

In [None]:
# View the scoring criteria for one dimension
print("Accuracy Scoring Criteria:")
print("=" * 50)
for score, desc in rubrics["dimensions"]["accuracy"]["criteria"].items():
    print(f"  {score}: {desc}")

## Scoring a Single Response

Let's run a query through the agent and score it:

In [None]:
# Import the agent from local config
from rubric_config import ask_acme

# Run a query
query = "What is our remote work policy?"
response = ask_acme(query)

print(f"Query: {query}")
print("-" * 50)
print(f"Response:\n{response}")

In [None]:
# Score the response
scorer = RubricScorer()
result = scorer.score(
    query=query,
    response=response,
    sources=["remote_work_policy.md"],
    category="policy"
)

print_result(result)

## Category-Specific Weights

Different query types may prioritize different dimensions. For example:
- **Policy questions**: Accuracy is critical
- **Metrics questions**: Completeness matters more

Let's compare the weights:

In [None]:
print("Default Weights vs Category-Specific Weights")
print("=" * 60)

default_weights = scorer.get_weights()
print(f"\nDefault:")
for dim, weight in default_weights.items():
    print(f"  {dim}: {weight:.0%}")

for category in ["policy", "metrics", "engineering"]:
    weights = scorer.get_weights(category)
    print(f"\n{category.capitalize()}:")
    for dim, weight in weights.items():
        diff = weight - default_weights[dim]
        indicator = "↑" if diff > 0 else "↓" if diff < 0 else " "
        print(f"  {dim}: {weight:.0%} {indicator}")

## Running a Full Evaluation

Let's run the complete rubric evaluation suite:

In [None]:
# Load test cases
test_cases = load_rubric_test_cases("rubrics.yaml")

print(f"Found {len(test_cases)} test cases:")
for case in test_cases[:5]:
    print(f"  [{case['id']}] {case['query'][:50]}...")

In [None]:
# Run evaluation on a subset (for speed)
results, summary = run_rubric_evaluation(test_cases[:3], verbose=True)

In [None]:
# View summary
print_summary(summary)

## Quality Thresholds

Set thresholds to catch quality regressions:

In [None]:
# Define minimum acceptable scores
QUALITY_THRESHOLDS = {
    "overall": 3.5,
    "accuracy": 4.0,  # Accuracy is critical
    "relevance": 3.5,
    "completeness": 3.0,
    "clarity": 3.0
}

def check_quality_gates(results):
    """Check if results meet quality thresholds."""
    failures = []
    
    # Check overall average
    avg_overall = sum(r.overall_score for r in results) / len(results)
    if avg_overall < QUALITY_THRESHOLDS["overall"]:
        failures.append(f"Overall avg {avg_overall:.2f} < {QUALITY_THRESHOLDS['overall']}")
    
    # Check dimension averages
    for dim in ["accuracy", "relevance", "completeness", "clarity"]:
        scores = [s.score for r in results for s in r.scores if s.dimension == dim]
        avg = sum(scores) / len(scores)
        if avg < QUALITY_THRESHOLDS[dim]:
            failures.append(f"{dim.capitalize()} avg {avg:.2f} < {QUALITY_THRESHOLDS[dim]}")
    
    return failures

# Check our results
failures = check_quality_gates(results)
if failures:
    print("Quality Gate Failures:")
    for f in failures:
        print(f"  - {f}")
else:
    print("All quality gates passed!")

## Try Your Own Query

Test a query and see how it scores:

In [None]:
# Change this to test different queries
test_query = "How do I request time off?"
test_category = "policy"  # optional: policy, metrics, engineering, cross_tool

# Get response
response = ask_acme(test_query)
print(f"Query: {test_query}")
print(f"Response: {response}\n")

# Score it
result = scorer.score(
    query=test_query,
    response=response,
    category=test_category
)
print_result(result)

## Key Takeaways

1. **Rubrics provide nuance**: Quality scores reveal gradual degradation before hard failures
2. **Multiple dimensions**: Different aspects of quality can be tracked independently
3. **Category weights**: Customize what matters for different query types
4. **Integration**: Rubrics layer on top of golden sets and scenarios
5. **Quality gates**: Set thresholds to prevent quality regressions

## Next Steps

- Calibrate rubric weights for your specific use case
- Set up automated quality tracking over time
- Move to CI integration for continuous evaluation