feat(eval): lift token-recall checker + export ErrorCluster (agent-app scaffold prep)#223
Conversation
…ster createTokenRecallChecker — the no-LLM CorrectnessChecker, sibling of createLlmCorrectnessChecker — lands in the substrate so the agent-app eval-campaign scaffold and every product compose it instead of hand-rolling their own default gate. Substrate-shaped: a pure deterministic checker that makes sense with no running agent loop. ErrorCluster joins the trace-analyst barrel (and flows to the root via export *), so agent-runtime imports the named type instead of deriving DatasetOverview['error_clusters'][number].
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 73 | 83 | 73 |
| Confidence | 80 | 80 | 80 |
| Correctness | 73 | 83 | 73 |
| Security | 73 | 83 | 73 |
| Testing | 73 | 83 | 73 |
| Architecture | 73 | 83 | 73 |
Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Substring-includes matching is more permissive than structural tokenRecall — src/completion-verifier.ts
At line 408,
lower.includes(t)checks substring containment, whereas the structural matcher at line 136 usescand.has(t)on a token Set for exact matches. A title token like 'car' matches 'scary', 'carbon', 'vicar' via includes. This inflates recall for short tokens. Fix: tokenize the body the same way and check set membership for consistency.
🟡 LOW Integration test doesn't use existing emptyState() helper — src/completion-verifier.test.ts
Line 266-276: The 'plugs into verifyCompletion' test manually constructs
{ artifacts: [...], proposals: [], toolCalls: [] }instead of spreadingemptyState()like the other tests (e.g. line 54-59, 71). Minor inconsistency — functionally correct. Fix: use{ ...emptyState(), artifacts: [...] }for consistency.
🟡 LOW No negative integration test for token-recall inside verifyCompletion — src/completion-verifier.test.ts
Test 6 verifies the happy path (token-recall checker passes inside verifyCompletion → fullyComplete=true). There is no corresponding test verifying that when token-recall rejects, verifyCompletion correctly reports fullyComplete=false with correct.correct=false. The unit tests cover this path separately but the integration wiring is only tested one direction.
🟡 LOW No-significant-tokens test uses filler content that's trivially meaningless — src/completion-verifier.test.ts
Line 251:
await check({ reqId: 'r', title: 'Review the new update' }, LONG)— LONG is 120 'x' chars. The test assersr.correctis true. This faithfully mirrors the implementation behavior (empty token set → structural match accepted), but the test would be stronger if it documented WHY filler content is acceptable here (the structural match already passed before the checker was called). Consider adding a second assertion verifying the reason includes 'no significant tokens' (already done) plus a comment noting the structural gate ran first. Current state is accurate but the reader may wonder why 120 'x' chars passes.
🟡 LOW Checker's minContentLength default (120) mismatches structural MIN_CONTENT_CHARS (50) — src/completion-verifier.ts
An artifact with 80 chars of substantive content passes the structural gate (line 158:
>= MIN_CONTENT_CHARS = 50) but thencreateTokenRecallCheckerreturnscorrect: falsewith 'content too thin' (line 399:minLen ?? 120). This 50-120 char gap means structurally-present items can never pass the default checker. Failing loud is correct, but the asymmetry is undocumented.
🟡 LOW Duplicate stopword list with divergence from existing STOPWORDS — src/completion-verifier.ts
STOPWORDS(line 99) contains 'by';TITLE_STOPWORDS(line 366) omits 'by' but adds 'review', 'update', 'new', 'proposed'. Two stopword sets in the same file with ~85% overlap means future edits risk updating only one. Consider extracting the shared set and extending it, or documenting why they diverge.
🟡 LOW Substring token matching can false-positive on short tokens — src/completion-verifier.ts
Line 408:
lower.includes(t)does substring matching against the full body. A 3-char title token (the minimum aftert.length > 2filter) like 'cap' matches 'capacity', 'capital', 'caption', etc. The structural matcher (line 131) uses exact Set-based token matching for comparison. The difference is intentional (prose vs identifiers) and mitigated by TITLE_STOPWORDS filtering common short words, but no test asserts the boundary behavior. Consider adding a word-boundary check (new RegExp('\\b' + escape(t) + '\\b')) or a test that a short to
🟡 LOW Two divergent stopwords sets risk silent drift — src/completion-verifier.ts
STOPWORDS(line 99, 12 entries, includes 'by') andTITLE_STOPWORDS(line 366, 16 entries, includes 'review'/'update'/'new'/'proposed', omits 'by') serve similar token-filtering roles but are independent Sets with different membership. They also use different minimum token lengths (>1vs>2). Not a bug, but a future editor could miss that adding a word to one set doesn't affect the other. A code comment cross-referencing the other set, or a shared base set, would reduce drift risk.
🟡 LOW requirement.category is ignored by token-recall checker — src/completion-verifier.ts
createLlmCorrectnessCheckerincludesrequirement.categoryin the LLM prompt (line 349), butcreateTokenRecallCheckeronly tokenizesrequirement.title(line 401). If a requirement title is entirely stopwords and the semantics are in the category field, the checker returnscorrect: truevacuously (line 405-406). This is a documented shortfall of the title-only approach.
tangletools · 2026-06-05T22:06:24Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 9 non-blocking findings — 49d4e186
Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-05T22:06:24Z · immutable trace
biome wraps the over-width _findings-text imports in ace.ts/memory.ts that landed unformatted in #222 (the husky pre-commit hook was non-executable so biome format never ran), and the new createTokenRecallChecker returns. Pure formatting — no logic change. Unblocks the lint job.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 982288e4
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-05T22:37:08Z
Git tracked it 100644, so the hook silently no-ops on commit (the 'hook was ignored because it's not set as executable' warning) — which is how unformatted code reached main. 100755 makes it run.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 377cfb39
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-05T22:37:24Z
Substrate prep for the shared agent-app eval-campaign scaffold: two primitives products currently hand-roll move down into the substrate, so the scaffold is thin composition rather than reimplementation.
What
createTokenRecallChecker— the deterministic, no-LLMCorrectnessChecker, the sibling ofcreateLlmCorrectnessChecker. Currently owned by@tangle-network/agent-app/eval; its own doc said "agent-eval exports onlycreateLlmCorrectnessChecker" — i.e. it was always meant to live here. A produced item fulfils a requirement when its content is substantive (≥minContentLength) AND recalls ≥minRecallof the requirement title's significant tokens. The default gate for apps/tests with no LLM judge. Pass it toverifyCompletion.ErrorClusterjoins the trace-analyst barrel (and flows to the root viaexport *). agent-runtime currently carries a/** ErrorCluster isn't re-exported from the agent-eval root; derive it from the overview. */hack (type ErrorCluster = DatasetOverview['error_clusters'][number]). With the named export published, that derive-hack can be deleted in agent-runtime's next bump.Why substrate, not agent-app
Both pass the layering test ("does this concept make sense with no running agent loop?"): a deterministic content checker and a failure-cluster type are pure substrate primitives. agent-eval is the bottom — every consumer composes these; the reverse import is forbidden.
Verification
pnpm typecheckcleanpnpm test— 1879 passed, 2 skipped (1870 prior + 9 newcreateTokenRecallCheckercases incl. thin-content reject, recall pass/fail, no-significant-tokens structural accept, customminRecall, and averifyCompletionintegration)pnpm buildclean; both symbols verified present indist/index.js+dist/index.d.tsFollow-ups (separate PRs)
createTokenRecallCheckerfrom the substrate; delete the local copyErrorClusterderive-hack, import the named type