-
Notifications
You must be signed in to change notification settings - Fork 0
27 evals
📖 This page is generated from
modules/27-evals/README.md. Edit the source, not the wiki; edits here are overwritten on the next sync. Run the hands-on labs from the repo, linked inline.
⬅ Previous: Module 26: Orchestrating Multiple Agents
You will swap the model. Evals are the only thing that tells you whether the swap was safe. This is the instrument that turns "the agent's output looks fine" into a number you can gate on, and it's where the whole course's thesis finally pays out.
This is the closer. It assumes the whole course, but it leans hardest on:
-
Module 1: the thesis (the model is the cheap, swappable part; the workflow is the durable
skill) and the
tasks-appwe've carried the whole way. This module is where the thesis gets its proof. - Module 10, Reviewing Code You Didn't Write: the human review skill evals partially automate and partially replace once a human isn't in the loop.
- Module 13, Testing in the AI Era: you can write a deterministic pass/fail check. Evals are the next thing up the ladder: scoring output that a single test can't fully pin down.
- Module 14, Continuous Integration: running checks automatically on every change, with an exit code that gates. Evals run the same way and gate the same way.
- Modules 24–26, the Unit 5 agent ladder: assistive agents (24), autonomous-but-supervised agents (25), and orchestrated fleets (26). Evals are what decide how far up that ladder any given agent is allowed to climb.
By the end of this module you can:
- State precisely what an eval is and how it differs from a test, and when you need one instead of the other.
- Build a small eval set for a concrete agent task: representative cases plus a grader that turns output into a score.
- Score agent output programmatically, and use an LLM-as-judge where you must, honestly, knowing its failure modes.
- Run a regression eval across a model or prompt change and read whether the change was safe.
- Set a guardrail: tie an autonomy level to an eval score so an agent earns the right to act unattended instead of being granted it on faith.
Unit 5 walked the agent from your elbow into the pipeline: assisting you (Module 24), then acting under supervision (Module 25), then several of them at once (Module 26). Each step removed a human from a loop. So the question this module exists to answer is blunt:
An agent did work while you were asleep. How do you know it did good work?
"I read the diff" doesn't scale: the whole point of an unattended agent is that you weren't there. "CI passed" is necessary but thin. CI proves the code builds and your existing tests are green, not that the agent actually did the right thing, well, on the cases that matter. You need a way to measure agent output systematically, the same way every time, on a fixed set of cases, with a score you can compare across runs. That measurement is an eval.
An eval has exactly three parts. None of them are exotic:
- An eval set: a fixed list of representative cases. Inputs the agent will face, chosen to cover the normal path and the edges where it tends to fail.
-
A grader: something that turns each case's output into a result. Pass/fail, or a score. The
grader can be code (
==, a regex, "does it compile, run, and produce this output") or, when the output is open-ended, another model (LLM-as-judge). - An aggregate + a threshold: roll the per-case results into one number, and a line that number has to clear. "18/20 = 90%, and I require 90%."
That's it. An eval is a test suite pointed at agent behavior instead of a function, with a score instead of a single green check, run against a moving target (the model) instead of frozen code.
This audience already writes tests (Module 13). The instinct to ask "isn't an eval just a test?" is correct enough to be dangerous. Where they diverge:
| A test (Module 13) | An eval | |
|---|---|---|
| Subject | Your code, frozen | An agent/model's output, which changes under you |
| Result | Binary: pass/fail | A score across many cases (90%, not "green") |
| Determinism | Same input → same output | Same input may give different output run to run |
| Failure meaning | The code is broken | The agent is less good, maybe still acceptable |
| What it gates | "Is the code correct?" | "Is this model/prompt good enough to trust here?" |
The practical upshot: a single failing case doesn't condemn an agent the way a failing unit test condemns code. You're measuring a rate. An agent that gets 19/20 right may be exactly what you want unattended on low-stakes work and nowhere near enough for high-stakes work. The eval gives you the rate; you set the bar per task.
And the inverse: where a deterministic test is possible, write the test, not an eval. Evals are
for the band of behavior tests can't pin down: open-ended output, judgment calls, "did it pick a
reasonable approach." Reaching for an LLM judge to grade something == could have caught is how you
get a slower, flakier, more expensive test that you trust less. (The lab's grader is deliberately
programmatic for exactly this reason.)
The eval set is the asset. The grader is plumbing; the cases are where the judgment lives, and a good set is mostly edges. Three sources fill it fast:
- The normal path: a couple of cases proving the agent does the obvious thing. These rarely catch anything; they're the floor.
-
The edges you already know break: every "it looked right but" bug your agents have shipped is
a permanent case. Module 13 left us a perfect one: an agent implemented
pending_count()aslen(self.tasks). It passes any quick manual check (add three tasks, count says three) and is wrong the instant a task is marked done. That bug becomes case #4 in this module's lab and never escapes again. - The cases you'd manually check anyway: write down the inputs you reflexively try when reviewing this kind of change. That list is your eval set; you've just been running it in your head and forgetting the results.
Keep it small and sharp. Twenty discriminating cases beat two hundred that all test the happy path. A case that every candidate passes tells you nothing; the cases that separate a good agent from a bad one are the whole value. And the eval set is code-adjacent data: commit it, review changes to it in PRs (Module 10), and grow it every time an agent surprises you. It is durable in exactly the way the syllabus means: it outlives every model it ever judges.
Two graders, in strict priority order.
Programmatic. If "correct" is checkable in code (exact value, output matches, exit code is 0, the file it shouldn't have touched is untouched), do that. It's deterministic, free, fast, and you trust it completely. Most of what an agent does to a codebase is checkable this way, because code either runs and produces the right thing or it doesn't.
LLM-as-judge. Some output has no ==: "is this commit message clear?", "does this PR
description explain the change?", "is this refactor actually cleaner?" The standard move is to ask
another model to grade it against a rubric. It works, and sometimes it's the only option, but be
honest about what you've built:
- Correlated blind spots. A judge is a model grading a model. It can share the candidate's confusion and pass a wrong answer because both are wrong the same way. Your grader and the thing it grades are not independent.
- Bias. Judges favor longer, more confident, and first-presented answers regardless of correctness. Control for position and length or your scores measure verbosity.
- Drift. Swap the judge model and your scores move while the candidate didn't change. The ruler is made of rubber, which is poison for regression evals, whose entire job is to hold the ruler still.
So when you must use a judge: pin it (fixed model, temperature: 0), keep it separate from the
model under test, and calibrate it against human labels: hand-grade ~20 examples, run the judge
on the same 20, and confirm it agrees with you before you let it gate anything. An uncalibrated
judge is a vibe with a number attached. The lab ships a model-agnostic judge stub (llm_judge.py)
that abstains until you point it at your own endpoint, with these limits written into the file.
Here is where the course thesis stops being a slogan and becomes a procedure.
You will swap the model. A cheaper one ships, your provider deprecates the one you're on, a new release benchmarks better, someone edits the agent's prompt or its committed instructions file (Module 5). Every one of those changes the behavior of every agent you run, silently. The code around the model didn't change; the model did, and the model is the part you don't control.
A regression eval is the discipline of running the same eval set before and after the change and comparing the scores. The current model/prompt earns a baseline score. After the change (a new model, a new prompt), the same eval set runs again and the two scores get compared. A score that held or rose means the swap is safe by this eval; a score that dropped is a regression caught before it ran unattended against real work, not after.
This is the answer to "the model is swappable." It's swappable because the eval set is what makes swapping safe. Your prompts, your pipeline, your review reflexes, and, most of all, your eval set don't expire when the model does. They're the durable skill the course promised in Module
- The model is a component you can replace; the eval is the regression test that tells you the replacement fits. That's the whole argument, made operational.
The last piece, and the real subject of Unit 5: how much is this agent allowed to do without a human? Don't answer that by gut. Answer it with the eval score, and make the score gate the autonomy.
| Eval score on this task | Reasonable autonomy (the Unit 5 ladder) |
|---|---|
| Low / unmeasured | Assistive only; it suggests, a human decides (Module 24). |
| Solid, below your bar | Autonomous but fully gated; opens a PR, a human reviews and merges (Module 25). |
| At/above bar, stable across runs | Unattended on this narrow task, landing behind CI + the eval as a gate. |
| High across a broad set, held over time | Orchestrate it; let it run in a fleet (Module 26). |
Two things make a guardrail bite:
- The threshold blocks. The eval returns an exit code; below-bar exits non-zero and stops the pipeline exactly like a failing test (Module 14). The lab does this. An eval whose result nobody is forced to act on is a dashboard, not a guardrail.
- Autonomy is per-task, not per-agent. The same model can be trustworthy enough to merge doc fixes unattended and nowhere near enough to touch auth code. You hold a different eval and a different bar for each. "Trust the agent" is the wrong granularity; "trust this agent, on this task, to this score" is the right one.
Every other module made a tool more valuable because you're using AI. This module closes the argument the course opened with.
Module 1 claimed the model is the cheap, swappable part and the workflow is the durable skill. Every module since has been an installment on that claim: version control, review, CI, containers, secrets, MCP, agents. Evals are where it's proven. An eval set is, literally, a model-agnostic instrument: it judges output without caring which model produced it, which is exactly why it survives the swap that retires the model. You don't trust an agent because you trust the vendor or this quarter's benchmark; you trust it because your eval, on your cases, scored it above your bar, and you'll re-run that same eval the day the model changes under you, which it will.
That's the durable skill. Models are weather. The eval set is the thermometer you keep.
Starting point (this lab is skip-friendly). This lab is self-contained and does not depend on the earlier labs. Its files live in
modules/27-evals/lab/. Copy them into a working folder and make a first commit so you start clean:cp -r ~/ai-workflow-course/modules/27-evals/lab ~/ai-workflow-course/27-evals-lab cd ~/ai-workflow-course/27-evals-lab && git init -b main && git add -A && git commit -m "start: module 27"
Lab language: Python + shell. You'll run a tiny eval harness, point an agent at a task, and run a regression eval across a "model swap."
The lab files are in lab/:
-
eval_set.py: five cases for thepending_counttask (data only). -
run_eval.pyis the runner; it imports a candidate, scores it, prints a scorecard, exits non-zero below threshold. -
candidates/current_model/tasks.py: a correct candidate (stand-in for your current model's output). -
candidates/swapped_model/tasks.py: a plausible-but-wrong candidate (stand-in for a bad swap). -
llm_judge.py: a model-agnostic LLM-as-judge stub, with its limits written in.
You'll need: Python 3.10+, the tasks-app you've carried since Module 1, and Claude Code (sub
your own agent). No API key or paid model is required to complete the lab; the bundled candidates let
the regression demo run offline. The real payoff comes when you replace them with your own agent's
output.
-
From the lab folder, run the eval against the passing candidate:
cd modules/27-evals/lab python3 run_eval.py candidates/current_model echo "exit code: $?"
Five cases pass, the score is 100%, and the exit code is
0. This is your baseline: the score the current model earns on this task. Read the cases ineval_set.py: notice case #4, "completed tasks are NOT pending." That's the Module 13 bug, now a permanent case.
-
Now simulate the swap: run the exact same eval set against the other candidate:
python3 run_eval.py candidates/swapped_model echo "exit code: $?"
It drops to 60% and exits
1. Look at which cases failed: the easy ones still pass; this output would sail through a casual manual check. The eval caught a regression that a skim would have missed, and the non-zero exit code means a pipeline would have blocked it. That is a guardrail doing its job.
-
Open your
tasks-appand tell Claude Code (sub your own agent) to implement (or re-implement)pending_count()and write its version straight intocandidates/my_run_1/tasks.py, creating the folder if it doesn't exist. You direct; the agent does the file plumbing. Then run the eval yourself and read the scorecard:python3 run_eval.py candidates/my_run_1
-
Now actually swap something. Either change the model Claude Code uses, or change the prompt (ask the same thing a different way, or tweak your committed instructions file from Module 5). Have the agent write this run into
candidates/my_run_2/, then runrun_eval.pyyourself and compare the two scores. You just ran a regression eval on a real model/prompt change and got a number that tells you whether the change was safe. If a run scores below 100%, read the failing case and direct the agent to append the input that broke it as a new permanent case ineval_set.py; verify the case it added. The set gets sharper every time an agent surprises you. -
(Optional, needs a model endpoint.) Open
llm_judge.py, read the limits at the bottom, set theEVAL_JUDGE_*environment variables to your own endpoint, and grade an open-ended output, say a commit message your agent wrote. Note how much shakier that score feels than the programmatic one. That feeling is correct, and it's why programmatic graders come first.
-
Decide the autonomy for this task using the ladder in Key concepts. Write one sentence: "
pending_countchanges may merge unattended only whenrun_eval.pyscores 100%; otherwise a human reviews." Then make it enforceable. This is one job in a CI workflow (Module 14), so direct Claude Code (sub your own agent) to add an eval-gate job to the workflow it already wired up in Module 14, running the same command from Parts A–B. The job it adds should look like this:- name: Eval gate working-directory: modules/27-evals/lab run: python run_eval.py candidates/current_model --threshold 1.0
Review the diff before you accept it, and confirm the path logic is right. The
working-directory:line makes the CI jobcdinto the lab folder first, so thecandidates/...path andrun_eval.py's ownfrom eval_set import CASESresolve exactly as they did on your machine. (Drop it and point a repo-root job straight atpython3 modules/27-evals/lab/run_eval.py candidates/current_model, andcandidates/won't exist from the repo root: the gate crashes with a false failure, which is worse than no gate. If the agent prefers a single line, it can spell both paths out from the repo root:python3 modules/27-evals/lab/run_eval.py modules/27-evals/lab/candidates/current_model --threshold 1.0.)Below threshold exits non-zero and the pipeline blocks, exactly like a failing test. The guardrail is now structural, not a promise.
One honest caveat, or this gate guards nothing.
candidates/current_modelis the bundled, always-correct stand-in: it scores 100% on every run, forever, so a gate pointed at it can never fail. That's a dashboard, not a guardrail: the exact trap this section warns about. In a real pipeline, point the gate at the candidate that actually varies: your agent's real output for this task (thecandidates/my_run_2you made in Part C, or wherever your pipeline writes the model's output before merge). Prove the gate bites by aiming it atcandidates/swapped_model: the same command drops to 60%, exits1, and blocks the merge.
The honesty this course has insisted on all the way through applies hardest to its own closer.
- Evals measure what you put in them, and nothing else. A 100% score means the agent passed your cases, not that it's correct in general. The gap between "passes my eval" and "is actually good" is exactly the cases you didn't think to write. An eval set is a lower bound on quality, never a proof. Treat a green eval as "no known regression," not "verified correct."
- Eval sets rot. Cases that no model ever fails stop discriminating; tasks drift away from what you actually do. An eval set you don't prune and grow becomes a comforting green light that's measuring last year's problems. Budget maintenance for it like any other test suite.
- LLM-as-judge is a model grading a model. Re-read that section: correlated blind spots, bias, and drift are not edge cases, they're the default behavior. An uncalibrated judge can hand you a confident wrong score, which is worse than no score. Where you can grade in code, do.
- A score is not a decision. The eval tells you the rate; you still set the bar, and the right bar depends on stakes the eval can't see. 95% might be plenty for triaging issue labels and reckless for anything touching auth, money, or customer data. The number informs the judgment; it doesn't replace it.
- Evals don't catch novel harms, only measured ones. A genuinely new failure mode (a class of mistake no case anticipates) passes every eval until the day it doesn't and you add the case after the fact. Evals make agents trustworthy on known territory. They are not a substitute for the recovery muscles (Module 12) that exist for when something gets through anyway.
You're done when:
- You can explain the difference between a test and an eval, and say when you'd reach for each.
- You've run
run_eval.pyagainst both bundled candidates and watched the same eval set pass one and fail the other, including the exit code flipping to1. - You've graded your own agent's output, then changed the model or prompt and re-run the same eval set as a regression check, and you can read the before/after scores as "safe" or "not safe."
- You can state, for one concrete task, the eval score that would let an agent act unattended on it, and where that threshold would live in your pipeline.
- You can say, in your own words, why the eval set is the durable skill and the model is the swappable part. That's the whole course in one sentence, and you can now run it from the keyboard.
That's the close. You started by copy-pasting out of a chat window; you're ending by letting an agent act without you and holding a measured, enforceable line on whether to trust it. The model under that line will change many times. The line is yours to keep.
This is an expansion-zone module over fast-moving ground. Re-check at build/publish time:
- No vendor pinned. Confirm the module text, lab, and
llm_judge.pystill name no specific LLM provider, model id, or pricing, and thatllm_judge.py's endpoint config is still generic (env-var driven, OpenAI-style-compatible but not branded). - Eval frameworks named. If the module names any eval framework or LLM-as-judge tool by name (it currently names none on purpose), verify it still exists and behaves as described. Prefer keeping it tool-agnostic.
- LLM-as-judge claims. The bias/drift/correlation caveats are durable, but re-check that no cited best practice (e.g., calibration-against-human-labels guidance) has been superseded.
- Module cross-references. Confirm Modules 13, 14, 10, and 24–26 still carry the responsibilities referenced here (tests, CI gating, review, the agent autonomy ladder) and that none were renumbered.
- Lab still runs.
python3 run_eval.py candidates/current_modelexits 0 at 100%, andcandidates/swapped_modelexits 1 below threshold, on a current Python 3.x.
Continue to: Capstone: The Full Loop ➡
Generated from the ai-workflow-course repo • the model is the cheap, swappable part; the workflow is the durable skill.
Unit 1: Get out of the chat window
- 1 · The Copy-Paste Problem
- 2 · Version Control as a Safety Net
- 3 · Version Control for Words, Not Just Code
- 4 · Getting the AI Out of the Browser
- 5 · Commit the AI's Config, Not Just the Code
- 6 · Branches as Sandboxes for Experiments
- 7 · Worktrees for Running Agents in Parallel
Unit 2: Make it shareable, reviewable, recoverable
- 8 · Remotes and Hosting (GitHub, the Alternatives, and Owning Your Repo)
- 9 · Issues and the Task Layer
- 10 · Reviewing Code You Didn't Write
- 11 · Collaboration: Humans and Agents on One Repo
- 12 · When It Goes Wrong: Revert, Reset, and Recovery
Unit 3: Automate the checking and shipping
- 13 · Testing in the AI Era
- 14 · Continuous Integration
- 15 · Security Scanning for AI-Generated Code
- 16 · Containers and Reproducible Environments
- 17 · Secrets, Config, and Environments
- 18 · Continuous Delivery and Deployment
- 19 · Runners, the Compute Behind the Automation
Unit 4: Extend the AI into your systems
- 20 · MCP Servers, Giving the AI Hands
- 21 · Skills: Teaching the AI Your Playbook
- 22 · Securing Third-Party MCP Servers and Skills
- 23 · Working with Existing Codebases
Unit 5: AI in the Loop
- 24 · Assistive Agents (AI Review and Issue Triage)
- 25 · Module 25. Autonomous Agents: Issue-to-PR and Self-Healing CI
- 26 · Orchestrating Multiple Agents
- 27 · Module 27. Evals: Trusting an Agent That Acts Without You
Finale