# Golden Sets Walkthrough

This notebook walks through the golden set evaluation process - your first line of defense for AI quality.

## What You'll Learn

1. How golden sets are structured
2. How to run evaluations
3. How to interpret results
4. How to add new test cases

In [None]:
import sys
sys.path.insert(0, "../setup_agent")

import yaml
from pathlib import Path
from evaluator import run_test_case, load_golden_set, check_tools, check_must_contain

## 1. Understanding Golden Set Structure

Let's look at how a golden set test case is defined:

In [None]:
# Load the golden set
test_cases = load_golden_set()
print(f"Loaded {len(test_cases)} test cases\n")

# Look at the first test case
first_case = test_cases[0]
print("Example test case:")
print("-" * 40)
for key, value in first_case.items():
    print(f"{key}: {value}")

## 2. Running a Single Test Case

Let's run one test case and see what happens:

In [None]:
# Run the first test case
result = run_test_case(test_cases[0], verbose=True)

print(f"Test ID: {result.id}")
print(f"Query: {result.query}")
print(f"Passed: {result.passed}")
print(f"Tools Used: {result.tools_used}")
print(f"\nChecks:")
print(f"  - Tool Check: {result.tool_check}")
print(f"  - Source Check: {result.source_check}")
print(f"  - Content Check: {result.content_check}")
print(f"  - Negative Check: {result.negative_check}")

if result.errors:
    print(f"\nErrors: {result.errors}")

print(f"\nResponse (first 500 chars):")
print("-" * 40)
print(result.response[:500])

## 3. Understanding the Checks

Each test case can have multiple checks:

In [None]:
# Tool check example
expected_tools = ["vector_search"]
actual_tools = ["vector_search"]

passed, error = check_tools(expected_tools, actual_tools)
print(f"Tool Check: {'PASS' if passed else 'FAIL'}")
print(f"Expected: {expected_tools}")
print(f"Actual: {actual_tools}")

# Try with wrong tools
actual_wrong = ["sql_query"]
passed, error = check_tools(expected_tools, actual_wrong)
print(f"\nWrong Tool Check: {'PASS' if passed else 'FAIL'}")
print(f"Error: {error}")

In [None]:
# Content check example
keywords = ["remote", "core hours", "500"]
response = "Our remote work policy includes core hours from 10 AM to 3 PM and a $500 stipend."

passed, error = check_must_contain(keywords, response)
print(f"Content Check: {'PASS' if passed else 'FAIL'}")
print(f"Looking for: {keywords}")
print(f"In response: {response[:100]}...")

# Try with missing keyword
response_missing = "Our remote work policy includes core hours."
passed, error = check_must_contain(keywords, response_missing)
print(f"\nMissing Keyword Check: {'PASS' if passed else 'FAIL'}")
print(f"Error: {error}")

## 4. Running Multiple Test Cases

Let's run a few test cases to see the variety:

In [None]:
# Run a subset of tests (first 5)
results = []
for tc in test_cases[:5]:
    print(f"Running {tc['id']}: {tc['query'][:40]}...")
    result = run_test_case(tc)
    results.append(result)
    status = "✓ PASS" if result.passed else "✗ FAIL"
    print(f"  {status}")
    if not result.passed:
        for err in result.errors:
            print(f"    - {err}")
    print()

# Summary
passed = sum(1 for r in results if r.passed)
print(f"Results: {passed}/{len(results)} passed")

## 5. Adding a New Test Case

Here's how to add a new golden set test case:

In [None]:
# Define a new test case
new_test = {
    "id": "gs-custom-001",
    "category": "vector_search",
    "query": "What are the code review requirements?",
    "expected_tools": ["vector_search"],
    "expected_sources": ["code_review_standards.md"],
    "must_contain": ["review", "approval"],
    "must_not_contain": ["I don't know"],
}

# Run it
print("Testing new case:")
print(f"Query: {new_test['query']}")
print()

result = run_test_case(new_test)
print(f"Passed: {result.passed}")
print(f"Tools used: {result.tools_used}")
if result.errors:
    print(f"Errors: {result.errors}")

## Next Steps

Now that you understand golden sets:

1. Run the full suite: `uv run python evaluator.py`
2. Add test cases for edge cases you discover
3. Move on to **Stage 2: Labeled Scenarios** for broader coverage

```bash
cd ../stage_2_labeled_scenarios
cat README.md
```