After assert Breaks Down — SwarmAI's Eval Architecture and Methodology
#83
xg-gh-25
started this conversation in
Show and tell
Replies: 2 comments
-
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment


Uh oh!
There was an error while loading. Please reload this page.
-
After
assertBreaks Down: SwarmAI's Eval Architecture and MethodologyTL;DR
Traditional software relies on
assert+ a CI red/green light to guarantee "no regressions." Agents can't — non-determinism breaksassert, prompt-as-source-code has no diff/review/rollback, and dependencies drift on their own (the model updates silently; you deployed nothing, but behavior changed).SwarmAI's answer: Eval is the Agent-era replacement for
assert, and it must be a git-bound hard gate.This piece covers three things:
1. Why Agents Need to Redefine "Testing"
Source: AWS's official Eval-First: Enterprise Agents with AgentCore (Summit 2026-06, public repo, MIT-0). Its thesis in one line:
Three root causes that break traditional software engineering on agents:
assert x == 5always holdsassertdies, only eval worksThis is why SwarmAI treats Eval as the Agent-era implementation of the TDD pillar in the AIDLC methodology: when
assertbreaks, Eval takes over as the deployment gate.2. Architecture: Eval as a System-Level Decoupled Subsystem
2.1 The single most important decision
The whole architecture is decided by one sentence (the product owner's directive):
Why does this reorganize everything? Because previously eval was welded to the governance system via tags like
affected_by: STEERING.R1plus reading DDD content at runtime. That coupling caused three diseases:STEERING.R1, so the case can't runSolution: cut the umbilical cord.
2.2 The decoupling invariant (the spine of the architecture)
One boundary I have to state plainly (otherwise it's a lie): the LLM-judge capability path does read live STEERING/SOUL/AGENT at verdict time — and this is intentional: the judge needs to know which rules the agent actually follows in order to judge "would it be compliant?" (
eval_runner.py:_load_rules_context, the judge prompt comment reads verbatim "so it knows what rules exist"). That path runs only in nightly monitoring and is never a gate. So "decoupling" holds in a more precise, two-part form: (1) public cases have zero governance references; (2) the gatingci_eval_gateonly checks digest+report, never reads DDD. The judge capability path reading live rules is by design, not coupling.The success metric (executable, pointing at files that actually exist):
2.3 Physical structure
Code side (public repo):
backend/scripts/eval_runner.pybackend/core/eval_service.pybackend/scripts/ci_eval_gate.pys_golden-case+golden_case_validator.pydesktop/src/pages/EvalDashboard.tsx2.4 The bridge: a one-way wire
DDD and eval are decoupled but not disconnected — they're bridged by a tool, not a dependency. Direction is everything:
s_golden-case(or the auto-seed hook) reads DDD to author the case — then bakes the needed context into the case as literal text. The case carries its own expected behavior, not a "pointer to STEERING.R1" for eval to resolve later.2.5 Public/Private split = decoupling as a security boundary
The discriminator is one question: what does this case depend on?
Fail-closed three layers (to prevent a privacy-leak recurrence):
get_case_detailonly allowlists safe metadata for private cases. A denylist leaks every newly-added field by default; an allowlist fails closed (an unanticipated field is dropped, never exposed)Misclassification fails toward safety: a misjudged case stays private (you lose a little shareability) — it never leaks (which would be the disaster). Public is earned, not the default.
3. Methodology: How to Do Agent Eval
This section maps directly to the hardest, most reusable design in AWS's official methodology: the two-pillar framework.
3.1 Pillar one · Three granularities (mapping to session/trace/span)
Rule of thumb: score by outcome, give partial credit, attribute via trajectory rather than exact-sequence matching.
3.2 Pillar two · Three evidence-weight tiers (the most overlooked, most critical)
The two pillars are orthogonal → a 3×3 matrix. "Choosing metrics = picking the cells for your business, not turning them all on."
3.3 Three scorer types (fill the matrix, align to the evidence tier)
LLM-judge biases (don't use it naked):
3.4 Capability eval vs Regression eval (we used to conflate these — biggest takeaway)
This is the distinction you should think through first in the whole methodology:
A mature capability case "graduates" into the regression suite → into CI.
3.5 Where we stand (verified against code)
eval_runner.pyjudge pinned, T=0.0Honest about the gaps: the L3 reject tier, bidirectional judge, and PoLL are all not done yet — they're deferred "legibility-track" items that don't touch the core gate.
3.6 Golden Set = Eval IP
Real production data + expert labeling, 4 scenario classes (common/edge/compliance/escalation), failed cases flow back in. Start from ~20, look at the data before scoring (error analysis: cluster failures → distill the rubric). Path: 20→100→500. Anti-patterns: rubric-before-data, deploying an LLM-judge too early, a frozen golden set, synthetic data masking reality.
4. Gate Trade-offs: The Part Most Worth Explaining
A gate's entire value is in where it sits, what it binds to, and when it lets through. We've stepped on the rakes; every trade-off has a cost.
4.1 What to bind: bind INPUTS, not HEAD
The fatal bug in v1 (caught by adversarial review): the gate criterion was written as
report.git_commit == HEAD. But the report itself is a git-tracked file — commit the report and HEAD advances, sogit_commit != HEAD→ the gate is permanently red. A self-contradicting fixed point.The right way: bind to the inputs eval depends on (code + golden_set), not which commit it lives in. Like
uv.lockhashing its inputs rather than hashing itself:{ "code_digest": "<sha256 of git ls-tree over eval-relevant paths + golden_set content>", "tree_dirty_at_run": false, "bvt": { "total": N, "passed": N, "failed": 0, "error": 0, "green": true } }Why digest-of-inputs:
EvalHistory/isn't in the digest → digest unchanged → gate stays green (fixes the fixed-point)golden_set.yaml(which defines what the gate tests) → digest changes → report stale → gate blocks (closes the "quietly edit a case to bypass a green gate" hole)code_digestusesgit ls-tree(not byte-wise hashing) — respects.gitignore, O(1) subprocess, and scopes only eval-relevant paths, not all code. Because public cases don't reference DDD, the digest never depends on DDD → changing STEERING never staleens the gate. Decoupling is precisely the prerequisite that makes a scoped digest correct.4.2 Who gets in: BVT is a derived view, not a hand-maintained list
BVT (Build Verification Test) = the regression gate set, defined as a derived view (anti-rot):
file_contains, keyword_match, trajectory_*, pluscanary_pass(~3s shell, deterministic just slow)runtime_health: it spawns the full session graph + SDK, 30s timeout, flaky under load → would poison the gate → goes to nightlydraft→ 4-gate validation → leaves draft → auto-joins BVT (riding the existing stable-promotion mechanism, no new system)Gate criterion (regression = zero tolerance, binary):
Not
score ≥ threshold— BVT is regression: all green or block, binary and clear.4.3 Which layer it sits at: build doesn't block, only release does ⭐
This is the newest and most counterintuitive trade-off.
V1 put the gate at step 0 of
prod.sh build. It looked right — "eval must pass before build." But in practice:buildis a high-frequency dev action (I ran it several times in one session), and gating build badly slowed iteration.The product owner's one-line correction: "build can't block, only release can."
Behind it is a generalizable rule:
The discriminator signal: if a gate fires on an action you do dozens of times a day, it's at the wrong layer.
After moving it, it's pure profit — faster dev loop + unchanged release safety. The gate now lives at:
prod.sh'scmd_release/cmd_release_hive+s_swarm-release's PREFLIGHT_eval_gate()helper, ensuring no release path runs unguarded (adversarial review caught that therelease-hiveindependent release path originally had no gate at all)4.4 Three-state semantics: why not two states
The gate returns three states, not a simple pass/fail:
Why the soft exit-2 state? Because without it, a fresh clone / bootstrap (eval never run, no report) could never release. Three states are the key to "shippable."
The
set -etrap (caught by adversarial review):_x=$(cmd)will abort the whole script at the assignment on a non-zero return, before$?can be read — so the gate call must be wrapped inset +e / set -e.The TTY guard: exit 2 can't
readfor user input under CI (no TTY) (it would hang or silently die on EOF) — so it fails closed with an explicit error. Escape hatch:SWARMAI_SKIP_EVAL_GATE=1(CI / emergency).The once-per-process guard (
_EVAL_GATE_PASSED):release-allcallscmd_releasethencmd_release_hive, both pass through the gate — a guard avoids asking twice on exit 2.4.5 Three mount points
ci_eval_gate.pypure-checks the committed report vs HEADci_eval_gate.pyKey design: CI never runs eval (zero Bedrock cost) — it only verifies the committed report is fresh and green. Eval is actually run by the developer locally + nightly. This is the lock-file pattern: CI checks the lockfile, it doesn't re-resolve dependencies.
5. Case Intake: The Only Sanctioned Path to Add a Case
A diluted golden set = a dead gate. Quality must be enforced at intake time, not prayed for.
The
s_golden-caseskill is the only sanctioned path to add/edit a case, with 4 quality gates (a case can't leavedraftunless it passes all of them):G3 is honest about its own limits: L1/programmatic cases do a real mutation test (run with deliberately-broken input, confirm it goes red); behavior cases can only do a weaker judge-discrimination (feed the judge a hand-written bad trajectory + a real good one, confirm the judge can separate them) — this tests "the judge can discriminate," not "the case catches a genuinely-misbehaving agent," a genuinely weaker guarantee that we don't pretend otherwise.
The
validated_by_4gatestamp is content-bound (sha256 of the case body) — edit a case and the stamp drops until re-validation. This structurally guarantees: auto-seed → draft (no stamp) → not in BVT; hand-edit the yaml → active but no stamp → not in BVT. BVT eligibility is earned only by passing the 4 gates.6. Sources (all verified, not from memory)
AWS official methodology:
docs/enterprise-agent-guide-series-ENGLISH.mdAcademic papers:
Two custom evaluators (real AWS-validated code, not slideware):
interplay.py7. AIDLC Wrap-up: Why This Is a Leap, Not a Chore
The AIDLC stack = DDD (business understanding) + SDD (intent → spec) + TDD (binary verification).
In the Agent era, traditional TDD's
assertbreaks (non-determinism, prompt-as-code, dependency drift). Eval is the Agent-era replacement for the TDD gate. By git-binding the regression BVT to push + release, SwarmAI becomes living proof that "the AIDLC third pillar is runnable, not theory" — and it's directly pitchable to enterprise customers facing the same wall: "When I can'tassert, how do I gate agent deployment?"The success metric is still those two grep lines (pointing at real files, not an empty directory):
(The judge capability path reading live rules is by design and not part of this metric — see the boundary note in §2.2.)
8. Real Golden-Case Samples: How to Design a Case With Teeth
By now you might be wondering: what does a case actually look like? Below are three real cases (taken straight from our golden set), ordered from weakest to strongest "verdict power." The core principle of case design is one sentence: assertions land on "behavior/fact," not "a string happens to appear" — otherwise it's test-theater (testing nothing).
Sample 1:
file_contains— lightest, anchors a code invariantThe cheapest case type: assert a fact holds in the code. Zero Bedrock cost, millisecond-level, good for guarding "architectural invariants."
Why it has teeth: if someone renames or deletes
SessionRouter, this case goes RED immediately.affected_bymakes it enter the BVT subset only when the relevant file changes — that's the key to "re-run on demand."Sample 2:
canary_pass— medium, proves a capability "still runs"Assert a script/capability can be loaded and executed without erroring. One notch stronger than
file_contains: it executes a real code path, not just checks for text.Why it has teeth: a broken import chain, a missing dependency, a changed signature — all turn it RED. Note
expected_contains: OKis what the script prints after it actually executes, not a string in the source — so it tests "does it run," not "is this line of text in the code."Sample 3:
trajectory_capture(behavior) — strongest, observes the agent's real behavior trajectoryThis is our most powerful case type, and where Agent eval departs from traditional unit testing: it doesn't check output text — it observes which tools the agent actually called — i.e. "did it do the right thing," not "did it say the right thing."
Why this is the strongest design: early on we had a case (
GS_ACT005) that used an LLM-judge to decide "does the answer reflect DDD," and it fell into the circular-judge trap — the judge hallucinated the DDD content itself and gave a high score, while the agent had never read the file at all.trajectory_capturemoves the verdict from "does the output look right" to "did it actually Read that file" — which is observable and unfakeable.Appendix: Gaps We Still Flag Honestly
These are all deferred and don't touch the core gate. The gate stands — and that's what this round was for.
This article is based on SwarmAI's internal design docs (
Knowledge/Designs/2026-06-26-eval-system-decoupled-design.md,2026-06-26-eval-first-leap-design.md) + notes on AWS's official methodology (Knowledge/Learned/2026-06-26-aws-eval-first-agentcore-methodology.md), with all runtime numbers verified against code on 2026-06-26/27/28.Beta Was this translation helpful? Give feedback.
All reactions