Skip to content

feat(eval): lift token-recall checker + export ErrorCluster (agent-app scaffold prep)#223

Merged
drewstone merged 3 commits into
mainfrom
feat/substrate-prep-token-recall-errorcluster
Jun 5, 2026
Merged

feat(eval): lift token-recall checker + export ErrorCluster (agent-app scaffold prep)#223
drewstone merged 3 commits into
mainfrom
feat/substrate-prep-token-recall-errorcluster

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Substrate prep for the shared agent-app eval-campaign scaffold: two primitives products currently hand-roll move down into the substrate, so the scaffold is thin composition rather than reimplementation.

What

createTokenRecallChecker — the deterministic, no-LLM CorrectnessChecker, the sibling of createLlmCorrectnessChecker. Currently owned by @tangle-network/agent-app/eval; its own doc said "agent-eval exports only createLlmCorrectnessChecker" — i.e. it was always meant to live here. A produced item fulfils a requirement when its content is substantive (≥ minContentLength) AND recalls ≥ minRecall of the requirement title's significant tokens. The default gate for apps/tests with no LLM judge. Pass it to verifyCompletion.

ErrorCluster joins the trace-analyst barrel (and flows to the root via export *). agent-runtime currently carries a /** ErrorCluster isn't re-exported from the agent-eval root; derive it from the overview. */ hack (type ErrorCluster = DatasetOverview['error_clusters'][number]). With the named export published, that derive-hack can be deleted in agent-runtime's next bump.

Why substrate, not agent-app

Both pass the layering test ("does this concept make sense with no running agent loop?"): a deterministic content checker and a failure-cluster type are pure substrate primitives. agent-eval is the bottom — every consumer composes these; the reverse import is forbidden.

Verification

  • pnpm typecheck clean
  • pnpm test1879 passed, 2 skipped (1870 prior + 9 new createTokenRecallChecker cases incl. thin-content reject, recall pass/fail, no-significant-tokens structural accept, custom minRecall, and a verifyCompletion integration)
  • pnpm build clean; both symbols verified present in dist/index.js + dist/index.d.ts
  • Additive only — no behavior change to existing exports

Follow-ups (separate PRs)

  • agent-app: re-export createTokenRecallChecker from the substrate; delete the local copy
  • agent-runtime: drop the ErrorCluster derive-hack, import the named type

…ster

createTokenRecallChecker — the no-LLM CorrectnessChecker, sibling of
createLlmCorrectnessChecker — lands in the substrate so the agent-app
eval-campaign scaffold and every product compose it instead of hand-rolling
their own default gate. Substrate-shaped: a pure deterministic checker that
makes sense with no running agent loop.

ErrorCluster joins the trace-analyst barrel (and flows to the root via
export *), so agent-runtime imports the named type instead of deriving
DatasetOverview['error_clusters'][number].
@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — 49d4e186

Readiness 73/100 · Confidence 80/100 · 9 findings (1 medium, 8 low)

deepseek glm aggregate
Readiness 73 83 73
Confidence 80 80 80
Correctness 73 83 73
Security 73 83 73
Testing 73 83 73
Architecture 73 83 73

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Substring-includes matching is more permissive than structural tokenRecall — src/completion-verifier.ts

At line 408, lower.includes(t) checks substring containment, whereas the structural matcher at line 136 uses cand.has(t) on a token Set for exact matches. A title token like 'car' matches 'scary', 'carbon', 'vicar' via includes. This inflates recall for short tokens. Fix: tokenize the body the same way and check set membership for consistency.

🟡 LOW Integration test doesn't use existing emptyState() helper — src/completion-verifier.test.ts

Line 266-276: The 'plugs into verifyCompletion' test manually constructs { artifacts: [...], proposals: [], toolCalls: [] } instead of spreading emptyState() like the other tests (e.g. line 54-59, 71). Minor inconsistency — functionally correct. Fix: use { ...emptyState(), artifacts: [...] } for consistency.

🟡 LOW No negative integration test for token-recall inside verifyCompletion — src/completion-verifier.test.ts

Test 6 verifies the happy path (token-recall checker passes inside verifyCompletion → fullyComplete=true). There is no corresponding test verifying that when token-recall rejects, verifyCompletion correctly reports fullyComplete=false with correct.correct=false. The unit tests cover this path separately but the integration wiring is only tested one direction.

🟡 LOW No-significant-tokens test uses filler content that's trivially meaningless — src/completion-verifier.test.ts

Line 251: await check({ reqId: 'r', title: 'Review the new update' }, LONG) — LONG is 120 'x' chars. The test assers r.correct is true. This faithfully mirrors the implementation behavior (empty token set → structural match accepted), but the test would be stronger if it documented WHY filler content is acceptable here (the structural match already passed before the checker was called). Consider adding a second assertion verifying the reason includes 'no significant tokens' (already done) plus a comment noting the structural gate ran first. Current state is accurate but the reader may wonder why 120 'x' chars passes.

🟡 LOW Checker's minContentLength default (120) mismatches structural MIN_CONTENT_CHARS (50) — src/completion-verifier.ts

An artifact with 80 chars of substantive content passes the structural gate (line 158: >= MIN_CONTENT_CHARS = 50) but then createTokenRecallChecker returns correct: false with 'content too thin' (line 399: minLen ?? 120). This 50-120 char gap means structurally-present items can never pass the default checker. Failing loud is correct, but the asymmetry is undocumented.

🟡 LOW Duplicate stopword list with divergence from existing STOPWORDS — src/completion-verifier.ts

STOPWORDS (line 99) contains 'by'; TITLE_STOPWORDS (line 366) omits 'by' but adds 'review', 'update', 'new', 'proposed'. Two stopword sets in the same file with ~85% overlap means future edits risk updating only one. Consider extracting the shared set and extending it, or documenting why they diverge.

🟡 LOW Substring token matching can false-positive on short tokens — src/completion-verifier.ts

Line 408: lower.includes(t) does substring matching against the full body. A 3-char title token (the minimum after t.length > 2 filter) like 'cap' matches 'capacity', 'capital', 'caption', etc. The structural matcher (line 131) uses exact Set-based token matching for comparison. The difference is intentional (prose vs identifiers) and mitigated by TITLE_STOPWORDS filtering common short words, but no test asserts the boundary behavior. Consider adding a word-boundary check (new RegExp('\\b' + escape(t) + '\\b')) or a test that a short to

🟡 LOW Two divergent stopwords sets risk silent drift — src/completion-verifier.ts

STOPWORDS (line 99, 12 entries, includes 'by') and TITLE_STOPWORDS (line 366, 16 entries, includes 'review'/'update'/'new'/'proposed', omits 'by') serve similar token-filtering roles but are independent Sets with different membership. They also use different minimum token lengths (>1 vs >2). Not a bug, but a future editor could miss that adding a word to one set doesn't affect the other. A code comment cross-referencing the other set, or a shared base set, would reduce drift risk.

🟡 LOW requirement.category is ignored by token-recall checker — src/completion-verifier.ts

createLlmCorrectnessChecker includes requirement.category in the LLM prompt (line 349), but createTokenRecallChecker only tokenizes requirement.title (line 401). If a requirement title is entirely stopwords and the semantics are in the category field, the checker returns correct: true vacuously (line 405-406). This is a documented shortfall of the title-only approach.


tangletools · 2026-06-05T22:06:24Z · trace

tangletools
tangletools previously approved these changes Jun 5, 2026
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 9 non-blocking findings — 49d4e186

Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-05T22:06:24Z · immutable trace

biome wraps the over-width _findings-text imports in ace.ts/memory.ts that
landed unformatted in #222 (the husky pre-commit hook was non-executable so
biome format never ran), and the new createTokenRecallChecker returns. Pure
formatting — no logic change. Unblocks the lint job.
tangletools
tangletools previously approved these changes Jun 5, 2026
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 982288e4

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-05T22:37:08Z

Git tracked it 100644, so the hook silently no-ops on commit (the
'hook was ignored because it's not set as executable' warning) — which is
how unformatted code reached main. 100755 makes it run.
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 377cfb39

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-05T22:37:24Z

@drewstone drewstone merged commit 68c4962 into main Jun 5, 2026
1 check passed
@drewstone drewstone deleted the feat/substrate-prep-token-recall-errorcluster branch June 5, 2026 22:39
drewstone added a commit that referenced this pull request Jun 5, 2026
Lockstep version bump (npm + pyproject + python __version__ fallback) for the
eval-campaign scaffold prep primitives merged in #223 + #224.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants