# Workout: Evaluation

## Setup
```bash
uv add openai sentence-transformers
```

---
## Drill 1: Basic Response Check 游릭
**Task:** Create heuristic response evaluator

In [None]:
def evaluate_response(response: str) -> dict:
    """Quick quality checks on response."""
    return {
        "length": ...,
        "word_count": ...,
        "is_empty": ...,
        "is_refusal": ...,  # "I cannot", "I'm unable to"
    }

# Test
result = evaluate_response("I cannot answer that question.")
print(result)

---
## Drill 2: String Similarity 游릭
**Task:** Calculate text similarity

In [None]:
from difflib import SequenceMatcher

def string_similarity(a: str, b: str) -> float:
    """Return 0-1 similarity score."""
    pass

# Test
print(string_similarity("hello world", "hello there"))
print(string_similarity("python", "python"))

---
## Drill 3: Semantic Similarity 游리
**Task:** Use embeddings for similarity

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_similarity(a: str, b: str) -> float:
    """Return cosine similarity of embeddings."""
    pass

# Test
print(semantic_similarity("I love dogs", "Dogs are my favorite"))
print(semantic_similarity("I love dogs", "The weather is nice"))

---
## Drill 4: LLM-as-Judge 游리
**Task:** Evaluate response with GPT-4

In [None]:
from openai import OpenAI

client = OpenAI()

def evaluate_with_llm(
    query: str,
    response: str,
    criteria: str = "relevance and accuracy"
) -> dict:
    """Return {score: 1-5, reasoning: str}."""
    pass

# Test
result = evaluate_with_llm(
    "What is Python?",
    "Python is a snake species.",
    "accuracy"
)
print(result)

---
## Drill 5: Pairwise Comparison 游리
**Task:** Compare two responses

In [None]:
def compare_responses(query: str, response_a: str, response_b: str) -> str:
    """Return 'A', 'B', or 'tie'."""
    pass

# Test
winner = compare_responses(
    "What is 2+2?",
    "The answer is 4.",
    "2+2 equals 4, which is the sum of two 2s."
)
print(f"Winner: {winner}")

---
## Drill 6: Retrieval Metrics 游리
**Task:** Calculate precision, recall, MRR

In [None]:
def retrieval_metrics(
    retrieved: list[str],
    relevant: list[str],
    k: int = 5
) -> dict:
    """Calculate retrieval quality metrics."""
    pass

# Test
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = ["doc2", "doc5", "doc7"]
print(retrieval_metrics(retrieved, relevant))

---
## Drill 7: Faithfulness Check 游댮
**Task:** Check if response is grounded in context

In [None]:
def check_faithfulness(
    response: str,
    context: list[str]
) -> dict:
    """Return {score: 0-1, ungrounded_claims: list}."""
    pass

# Test
context = ["Python was created by Guido van Rossum in 1991."]
response = "Python was created by Guido in 1991. It's the most popular language."
# "most popular" is not in context - should flag

---
## Drill 8: Test Case Runner 游댮
**Task:** Build evaluation test suite

In [None]:
from dataclasses import dataclass

@dataclass
class TestCase:
    id: str
    query: str
    expected: str

def run_test_suite(
    cases: list[TestCase],
    pipeline,  # Callable[[str], str]
    threshold: float = 0.8
) -> dict:
    """Run tests and return summary."""
    pass

# Test
cases = [
    TestCase("q1", "2+2?", "4"),
    TestCase("q2", "Capital of France?", "Paris"),
]

---
## Self-Check

- [ ] Can implement heuristic checks
- [ ] Can calculate string and semantic similarity
- [ ] Can use LLM-as-judge pattern
- [ ] Can evaluate retrieval quality
- [ ] Can check response faithfulness
- [ ] Can build and run test suites