feat(eval): lift token-recall checker + export ErrorCluster (agent-app scaffold prep) by drewstone · Pull Request #223 · tangle-network/agent-eval

drewstone · 2026-06-05T21:57:10Z

Substrate prep for the shared agent-app eval-campaign scaffold: two primitives products currently hand-roll move down into the substrate, so the scaffold is thin composition rather than reimplementation.

What

createTokenRecallChecker — the deterministic, no-LLM CorrectnessChecker, the sibling of createLlmCorrectnessChecker. Currently owned by @tangle-network/agent-app/eval; its own doc said "agent-eval exports only createLlmCorrectnessChecker" — i.e. it was always meant to live here. A produced item fulfils a requirement when its content is substantive (≥ minContentLength) AND recalls ≥ minRecall of the requirement title's significant tokens. The default gate for apps/tests with no LLM judge. Pass it to verifyCompletion.

ErrorCluster joins the trace-analyst barrel (and flows to the root via export *). agent-runtime currently carries a /** ErrorCluster isn't re-exported from the agent-eval root; derive it from the overview. */ hack (type ErrorCluster = DatasetOverview['error_clusters'][number]). With the named export published, that derive-hack can be deleted in agent-runtime's next bump.

Why substrate, not agent-app

Both pass the layering test ("does this concept make sense with no running agent loop?"): a deterministic content checker and a failure-cluster type are pure substrate primitives. agent-eval is the bottom — every consumer composes these; the reverse import is forbidden.

Verification

pnpm typecheck clean
pnpm test — 1879 passed, 2 skipped (1870 prior + 9 new createTokenRecallChecker cases incl. thin-content reject, recall pass/fail, no-significant-tokens structural accept, custom minRecall, and a verifyCompletion integration)
pnpm build clean; both symbols verified present in dist/index.js + dist/index.d.ts
Additive only — no behavior change to existing exports

Follow-ups (separate PRs)

agent-app: re-export createTokenRecallChecker from the substrate; delete the local copy
agent-runtime: drop the ErrorCluster derive-hack, import the named type

…ster createTokenRecallChecker — the no-LLM CorrectnessChecker, sibling of createLlmCorrectnessChecker — lands in the substrate so the agent-app eval-campaign scaffold and every product compose it instead of hand-rolling their own default gate. Substrate-shaped: a pure deterministic checker that makes sense with no running agent loop. ErrorCluster joins the trace-analyst barrel (and flows to the root via export *), so agent-runtime imports the named type instead of deriving DatasetOverview['error_clusters'][number].

tangletools · 2026-06-05T22:06:27Z

✅ No Blockers — `49d4e186`

Readiness 73/100 · Confidence 80/100 · 9 findings (1 medium, 8 low)

	deepseek	glm	aggregate
Readiness	73	83	73
Confidence	80	80	80
Correctness	73	83	73
Security	73	83	73
Testing	73	83	73
Architecture	73	83	73

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Substring-includes matching is more permissive than structural tokenRecall — src/completion-verifier.ts

At line 408, lower.includes(t) checks substring containment, whereas the structural matcher at line 136 uses cand.has(t) on a token Set for exact matches. A title token like 'car' matches 'scary', 'carbon', 'vicar' via includes. This inflates recall for short tokens. Fix: tokenize the body the same way and check set membership for consistency.

🟡 LOW Integration test doesn't use existing emptyState() helper — src/completion-verifier.test.ts

Line 266-276: The 'plugs into verifyCompletion' test manually constructs { artifacts: [...], proposals: [], toolCalls: [] } instead of spreading emptyState() like the other tests (e.g. line 54-59, 71). Minor inconsistency — functionally correct. Fix: use { ...emptyState(), artifacts: [...] } for consistency.

🟡 LOW No negative integration test for token-recall inside verifyCompletion — src/completion-verifier.test.ts

Test 6 verifies the happy path (token-recall checker passes inside verifyCompletion → fullyComplete=true). There is no corresponding test verifying that when token-recall rejects, verifyCompletion correctly reports fullyComplete=false with correct.correct=false. The unit tests cover this path separately but the integration wiring is only tested one direction.

🟡 LOW No-significant-tokens test uses filler content that's trivially meaningless — src/completion-verifier.test.ts

Line 251: await check({ reqId: 'r', title: 'Review the new update' }, LONG) — LONG is 120 'x' chars. The test assers r.correct is true. This faithfully mirrors the implementation behavior (empty token set → structural match accepted), but the test would be stronger if it documented WHY filler content is acceptable here (the structural match already passed before the checker was called). Consider adding a second assertion verifying the reason includes 'no significant tokens' (already done) plus a comment noting the structural gate ran first. Current state is accurate but the reader may wonder why 120 'x' chars passes.

🟡 LOW Checker's minContentLength default (120) mismatches structural MIN_CONTENT_CHARS (50) — src/completion-verifier.ts

An artifact with 80 chars of substantive content passes the structural gate (line 158: >= MIN_CONTENT_CHARS = 50) but then createTokenRecallChecker returns correct: false with 'content too thin' (line 399: minLen ?? 120). This 50-120 char gap means structurally-present items can never pass the default checker. Failing loud is correct, but the asymmetry is undocumented.

🟡 LOW Duplicate stopword list with divergence from existing STOPWORDS — src/completion-verifier.ts

STOPWORDS (line 99) contains 'by'; TITLE_STOPWORDS (line 366) omits 'by' but adds 'review', 'update', 'new', 'proposed'. Two stopword sets in the same file with ~85% overlap means future edits risk updating only one. Consider extracting the shared set and extending it, or documenting why they diverge.

🟡 LOW Substring token matching can false-positive on short tokens — src/completion-verifier.ts

Line 408: lower.includes(t) does substring matching against the full body. A 3-char title token (the minimum after t.length > 2 filter) like 'cap' matches 'capacity', 'capital', 'caption', etc. The structural matcher (line 131) uses exact Set-based token matching for comparison. The difference is intentional (prose vs identifiers) and mitigated by TITLE_STOPWORDS filtering common short words, but no test asserts the boundary behavior. Consider adding a word-boundary check (new RegExp('\\b' + escape(t) + '\\b')) or a test that a short to

🟡 LOW Two divergent stopwords sets risk silent drift — src/completion-verifier.ts

STOPWORDS (line 99, 12 entries, includes 'by') and TITLE_STOPWORDS (line 366, 16 entries, includes 'review'/'update'/'new'/'proposed', omits 'by') serve similar token-filtering roles but are independent Sets with different membership. They also use different minimum token lengths (>1 vs >2). Not a bug, but a future editor could miss that adding a word to one set doesn't affect the other. A code comment cross-referencing the other set, or a shared base set, would reduce drift risk.

🟡 LOW requirement.category is ignored by token-recall checker — src/completion-verifier.ts

createLlmCorrectnessChecker includes requirement.category in the LLM prompt (line 349), but createTokenRecallChecker only tokenizes requirement.title (line 401). If a requirement title is entirely stopwords and the semantics are in the category field, the checker returns correct: true vacuously (line 405-406). This is a documented shortfall of the title-only approach.

_{tangletools · 2026-06-05T22:06:24Z · trace}

tangletools

✅ Approved — 9 non-blocking findings — `49d4e186`

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-05T22:06:24Z · immutable trace}

biome wraps the over-width _findings-text imports in ace.ts/memory.ts that landed unformatted in #222 (the husky pre-commit hook was non-executable so biome format never ran), and the new createTokenRecallChecker returns. Pure formatting — no logic change. Unblocks the lint job.

tangletools

✅ Refreshed approval after new commits — `982288e4`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-05T22:37:08Z}

Git tracked it 100644, so the hook silently no-ops on commit (the 'hook was ignored because it's not set as executable' warning) — which is how unformatted code reached main. 100755 makes it run.

tangletools

✅ Refreshed approval after new commits — `377cfb39`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-05T22:37:24Z}

Lockstep version bump (npm + pyproject + python __version__ fallback) for the eval-campaign scaffold prep primitives merged in #223 + #224.

tangletools previously approved these changes Jun 5, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 982288e June 5, 2026 22:37

tangletools previously approved these changes Jun 5, 2026

View reviewed changes

fix(hooks): mark .husky/pre-commit executable

377cfb3

Git tracked it 100644, so the hook silently no-ops on commit (the 'hook was ignored because it's not set as executable' warning) — which is how unformatted code reached main. 100755 makes it run.

drewstone dismissed tangletools’s stale review via 377cfb3 June 5, 2026 22:37

tangletools approved these changes Jun 5, 2026

View reviewed changes

drewstone merged commit 68c4962 into main Jun 5, 2026
1 check passed

drewstone deleted the feat/substrate-prep-token-recall-errorcluster branch June 5, 2026 22:39

drewstone mentioned this pull request Jun 5, 2026

chore(release): 0.81.0 — eval-campaign scaffold prep primitives #225

Merged

drewstone added a commit that referenced this pull request Jun 5, 2026

chore(release): 0.81.0 — eval-campaign scaffold prep primitives (#225)

9368f7d

Lockstep version bump (npm + pyproject + python __version__ fallback) for the eval-campaign scaffold prep primitives merged in #223 + #224.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): lift token-recall checker + export ErrorCluster (agent-app scaffold prep)#223

feat(eval): lift token-recall checker + export ErrorCluster (agent-app scaffold prep)#223
drewstone merged 3 commits into
mainfrom
feat/substrate-prep-token-recall-errorcluster

drewstone commented Jun 5, 2026

Uh oh!

tangletools commented Jun 5, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 5, 2026

What

Why substrate, not agent-app

Verification

Follow-ups (separate PRs)

Uh oh!

tangletools commented Jun 5, 2026

✅ No Blockers — 49d4e186

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 9 non-blocking findings — 49d4e186

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 982288e4

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 377cfb39

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `49d4e186`

✅ Approved — 9 non-blocking findings — `49d4e186`

✅ Refreshed approval after new commits — `982288e4`

✅ Refreshed approval after new commits — `377cfb39`