feat(analyst): typed kinds for failure + recursive self-improvement (on top of #56)#57
Merged
drewstone merged 1 commit intoMay 19, 2026
Conversation
…-improvement
The original PR-A adapter (`createTraceAnalystAdapter`) shipped a single
generic flow whose output was `findings:string[]` — every bullet became
a flat-severity `medium` / confidence `0.6` `AnalystFinding`, losing the
per-finding grading the analyzer LLM is perfectly capable of producing.
This commit adds the typed-kind architecture on top:
finding-signature.ts
Strict Zod schema for one analyst-emitted row (severity / claim /
subject / evidence_uri / confidence / rationale / recommended_action).
`RAW_FINDING_SCHEMA_PROMPT` embeds the shape into every kind's actor
prompt; the LLM emits structured JSON; the factory Zod-validates each
row at the boundary and drops malformed rows with a logged reason.
kind-factory.ts
`createTraceAnalystKind(spec, { ai })` lifts a kind spec into the
existing `Analyst<TraceAnalysisStore>` contract. Wires Ax `agent(...)`
with `findings:json[]` output, AxJSRuntime sandbox, advanced-mode
recursion when `maxDepth>0`, and the per-kind tool subset. `versionSuffix`
is reserved for prompt-optimization artifacts (MIPRO/GEPA bumps Ax
version → wire-up in a follow-up).
tool-groups.ts
Five named subsets — `all`, `discovery`, `discoveryAndRead`,
`discoveryAndSearch`, `targeted` — so each kind takes only what it
needs from the seven trace-analyst tools. Unknown group names throw
(silent-all would defeat the cost-control point).
kinds/ — four default kinds, ordered for dependency-aware runs:
1. failure-mode (maxDepth 3, parallel 4) — clusters dataset failures
into distinct modes with cited evidence. Discovery → cluster → cite
protocol. Aggressive RLM delegation: one subagent per cluster, and
confounded clusters split again at the next level.
2. knowledge-gap (maxDepth 2, parallel 4) — names the specific
information the agent lacked or that was stale, attributed to the
runtime layer that should have surfaced it. Anchored on
`@tangle-network/agent-knowledge` (wiki page / claim / raw source
loci) with secondary loci for `websearch:outdated:*`, `tool-doc:*`,
`system-prompt:*`, `memory:*`. Subagents fan out per layer.
3. knowledge-poisoning (maxDepth 2, parallel 4) — finds confident-
but-wrong actions. DUAL-VERIFY protocol: subagents prove (i) the
agent acted on the belief and (ii) the belief is false in this
trace's evidence. Only findings with both halves proven survive.
4. improvement (maxDepth 3, parallel 4) — converts upstream findings
into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES →
COMPETE → CITE: subagents simulate competing fix candidates per
cluster; the winning candidate per cluster is emitted with leverage
grade, rationale, and a literal edit phrased as a diff. Cross-
references upstream findings via `evidence_uri: "finding://<id>"`
so the dependency graph renders.
Tests (21 new in `kinds/kinds.test.ts`):
- Zod schema rejects out-of-range confidence / unknown severity /
extra fields (strict mode), logs the rejection reason
- `parseRawFinding` returns null + logs on failure, types value on success
- default suite emits the four kinds in run order
- every kind exercises Ax recursion (maxDepth ≥ 1)
- improvement has the deepest depth (competing candidate fixes)
- knowledge-gap prompt anchors on agent-knowledge + websearch + tool-doc,
not generic RAG
- knowledge-poisoning enforces dual-verify
- failure-mode requires clustering, not enumeration
- tool groups filter narrowly; unknown name throws
- `versionSuffix` appends to kind version (for future optimizer pins)
- `finding_id` stable across runs for same kind + area + claim + subject
The legacy `createTraceAnalystAdapter` is now `@deprecated` (kept one
minor for consumer migration). New code should reach for kinds first.
Total: 1146/1146 tests pass, typecheck clean. Ax 19's `.d.ts` is missing
some optimizer-class exports that 21.x ships; the optimizer-fit pipeline
lands in a follow-up after the Ax bump.
drewstone
added a commit
that referenced
this pull request
May 19, 2026
…56) * feat(analyst): registry + findings envelope over existing primitives Adds a generic, model-agnostic, transport-agnostic Analyst layer that orchestrates agent-eval's existing analyzers without re-implementing them. One contract, one runner, one persistence path — reusable by VB operator bench, the leaderboard submission pipeline, and the orchestrator on-completion reports surface with the same code. - `src/analyst/types.ts` — `Analyst` contract, `AnalystFinding` envelope with sha-stable `finding_id`, `AnalystRunInputs` with `inputKind` routing (trace-store | artifact-dir | run-record | judge-input | custom) - `src/analyst/chat-client.ts` — `ChatClient` abstraction over router | sandbox-sdk | cli-bridge | direct-provider | mock so analyst code never depends on the transport - `src/analyst/registry.ts` — register/list/run with input routing, per-analyst isolation (one failure does not stop others), budget split, per-analyst telemetry - `src/analyst/findings-store.ts` — locked JSONL append + `diffFindings` (appeared / disappeared / persisted / changed) keyed by stable id - `src/analyst/adapters.ts` — five thin lifters wrapping `analyzeTraces`, `MultiLayerVerifier`, `RunCritic`, `JudgeFn`, `SemanticConceptJudge` - `src/analyst/analyst.test.ts` — 12 tests covering hash stability, registration validation, routing, failure isolation, only/skip, cost attribution, store round-trip, diff semantics, mock transport Version: 0.28.0 * refactor(analyst): hook + policy surface for cross-cutting concerns Lifts five reviewer concerns into a small policy surface so consumers override what they need without changing the registry. All defaults preserve previous behavior. - types.ts: `AnalystContext.chat` is now `ChatClient` (was `LlmClient`). Drops the cast in registry.run() and matches what the PR promised — analyst code is transport-agnostic by contract, not by convention. - chat-client.ts: `wrapLlmClient` races the in-flight call against `ChatCallOpts.signal`. Awaiting code unblocks on abort; the in-flight HTTP request still bounds by `timeoutMs` (LlmClient doesn't yet accept an external AbortSignal — documented inline). - registry.ts: * `AnalystHooks` — `onBeforeAnalyze`, `onAfterAnalyze`, `onError`, `onComplete`. `onError` MAY return findings to convert a crash into structured findings; `onAfterAnalyze` runs for ok | failed | skipped. This is the seam for telemetry, cost ingestion, storage rotation, error-to-finding conversion — all without registry changes. * `BudgetPolicy` — `{ totalUsd, weights, allocate }`. Default still equal-split; `allocate` is the precise hook when weights aren't enough. - findings-store.ts: `diffFindings(prev, cur, { isMaterial })`. Default materiality test (severity / confidence Δ > 0.05 / evidence count) is exported as `defaultIsMaterial` so consumers can layer stricter predicates without re-implementing the base. - 8 new tests cover hook ordering, error→finding conversion, skipped hooks, equal-split + weighted budget, default + custom diff policy, signal racing. 1125/1125 tests pass. * chore(analyst): biome format + organize imports * feat(analyst): typed kinds for failure + recursive self-improvement (on top of #56) (#57) The original PR-A adapter (`createTraceAnalystAdapter`) shipped a single generic flow whose output was `findings:string[]` — every bullet became a flat-severity `medium` / confidence `0.6` `AnalystFinding`, losing the per-finding grading the analyzer LLM is perfectly capable of producing. This commit adds the typed-kind architecture on top: finding-signature.ts Strict Zod schema for one analyst-emitted row (severity / claim / subject / evidence_uri / confidence / rationale / recommended_action). `RAW_FINDING_SCHEMA_PROMPT` embeds the shape into every kind's actor prompt; the LLM emits structured JSON; the factory Zod-validates each row at the boundary and drops malformed rows with a logged reason. kind-factory.ts `createTraceAnalystKind(spec, { ai })` lifts a kind spec into the existing `Analyst<TraceAnalysisStore>` contract. Wires Ax `agent(...)` with `findings:json[]` output, AxJSRuntime sandbox, advanced-mode recursion when `maxDepth>0`, and the per-kind tool subset. `versionSuffix` is reserved for prompt-optimization artifacts (MIPRO/GEPA bumps Ax version → wire-up in a follow-up). tool-groups.ts Five named subsets — `all`, `discovery`, `discoveryAndRead`, `discoveryAndSearch`, `targeted` — so each kind takes only what it needs from the seven trace-analyst tools. Unknown group names throw (silent-all would defeat the cost-control point). kinds/ — four default kinds, ordered for dependency-aware runs: 1. failure-mode (maxDepth 3, parallel 4) — clusters dataset failures into distinct modes with cited evidence. Discovery → cluster → cite protocol. Aggressive RLM delegation: one subagent per cluster, and confounded clusters split again at the next level. 2. knowledge-gap (maxDepth 2, parallel 4) — names the specific information the agent lacked or that was stale, attributed to the runtime layer that should have surfaced it. Anchored on `@tangle-network/agent-knowledge` (wiki page / claim / raw source loci) with secondary loci for `websearch:outdated:*`, `tool-doc:*`, `system-prompt:*`, `memory:*`. Subagents fan out per layer. 3. knowledge-poisoning (maxDepth 2, parallel 4) — finds confident- but-wrong actions. DUAL-VERIFY protocol: subagents prove (i) the agent acted on the belief and (ii) the belief is false in this trace's evidence. Only findings with both halves proven survive. 4. improvement (maxDepth 3, parallel 4) — converts upstream findings into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES → COMPETE → CITE: subagents simulate competing fix candidates per cluster; the winning candidate per cluster is emitted with leverage grade, rationale, and a literal edit phrased as a diff. Cross- references upstream findings via `evidence_uri: "finding://<id>"` so the dependency graph renders. Tests (21 new in `kinds/kinds.test.ts`): - Zod schema rejects out-of-range confidence / unknown severity / extra fields (strict mode), logs the rejection reason - `parseRawFinding` returns null + logs on failure, types value on success - default suite emits the four kinds in run order - every kind exercises Ax recursion (maxDepth ≥ 1) - improvement has the deepest depth (competing candidate fixes) - knowledge-gap prompt anchors on agent-knowledge + websearch + tool-doc, not generic RAG - knowledge-poisoning enforces dual-verify - failure-mode requires clustering, not enumeration - tool groups filter narrowly; unknown name throws - `versionSuffix` appends to kind version (for future optimizer pins) - `finding_id` stable across runs for same kind + area + claim + subject The legacy `createTraceAnalystAdapter` is now `@deprecated` (kept one minor for consumer migration). New code should reach for kinds first. Total: 1146/1146 tests pass, typecheck clean. Ax 19's `.d.ts` is missing some optimizer-class exports that 21.x ships; the optimizer-fit pipeline lands in a follow-up after the Ax bump.
drewstone
added a commit
that referenced
this pull request
May 19, 2026
… priorFindings Wraps two unreleased features into a single minor: - PR #57: kind-factory + 4 trace-analyst kinds (failure-mode, knowledge-gap, knowledge-poisoning, improvement) via Ax structured output + Zod-validated `RawAnalystFinding`. - PR #58: priorFindings context wiring so kinds chain across runs ('improvement' sees what 'failure-mode' surfaced). createTraceAnalystAdapter retains @deprecated marker pointing at the new kinds; kept for one minor while consumers migrate. 1153/1153 tests green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on #56 (
feat/analyst-registry-and-findings).What this adds
Replaces the legacy
createTraceAnalystAdapter(string-bullet findings, flat severity/confidence defaults) with a typed-kind architecture anchored on Ax structured-output + Zod validation, plus four default kinds focused on agent failure and recursive self-improvement — not cost.Substrate
finding-signature.ts— strict Zod schema for one analyst-emitted row (severity / claim / subject /evidence_uri/ confidence / rationale / recommended_action).RAW_FINDING_SCHEMA_PROMPTembeds the contract into every kind's actor prompt; rows that fail Zod are logged and dropped, not silently lifted with flat defaults.kind-factory.ts—createTraceAnalystKind(spec, { ai })wires Axagent(...)withfindings:json[]output, the JS-runtime sandbox, advanced-mode recursion whenmaxDepth>0, and the per-kind tool subset.versionSuffixreserves the hook for optimizer-fitted prompts.tool-groups.ts— five named subsets (all/discovery/discoveryAndRead/discoveryAndSearch/targeted) so kinds take only the tools they need. Unknown group names throw.Default kinds (in dependency order)
@tangle-network/agent-knowledge(wiki:<page>,claim:<topic>,raw:<source>,stale:<page>), with secondary lociwebsearch:outdated:*,tool-doc:*,system-prompt:*,memory:*. Subagents fan out per layer.evidence_uri: \"finding://<id>\".Tests
21 new in
kinds/kinds.test.tscovering:parseRawFinding: null + log on failure, typed value on successFull suite: 1146/1146 pass, typecheck clean,
pnpm buildclean.Why kinds, not one big trace-analyst
finding://<id>— improvement kind chains on top of upstream findings. The registry already supports thefindingevidence kind; this PR makes it useful.Deferred
.d.tsis missing optimizer-class exports that 21.x ships. Goldens are kept as data onTraceAnalystKindSpecso the hook lands in a small follow-up after Ax bump.streamingForwardexists in Ax; kinds useforwardfor now. Switching is mechanical once any kind benefits from progressive emission.Test plan
pnpm typecheck— cleanpnpm vitest run src/analyst— 41/41 passpnpm test— 1146/1146 passpnpm build— clean