Skip to content

feat(analyst): typed kinds for failure + recursive self-improvement (on top of #56)#57

Merged
drewstone merged 1 commit into
feat/analyst-registry-and-findingsfrom
feat/analyst-typed-kinds
May 19, 2026
Merged

feat(analyst): typed kinds for failure + recursive self-improvement (on top of #56)#57
drewstone merged 1 commit into
feat/analyst-registry-and-findingsfrom
feat/analyst-typed-kinds

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Stacks on #56 (feat/analyst-registry-and-findings).

What this adds

Replaces the legacy createTraceAnalystAdapter (string-bullet findings, flat severity/confidence defaults) with a typed-kind architecture anchored on Ax structured-output + Zod validation, plus four default kinds focused on agent failure and recursive self-improvement — not cost.

Substrate

  • finding-signature.ts — strict Zod schema for one analyst-emitted row (severity / claim / subject / evidence_uri / confidence / rationale / recommended_action). RAW_FINDING_SCHEMA_PROMPT embeds the contract into every kind's actor prompt; rows that fail Zod are logged and dropped, not silently lifted with flat defaults.
  • kind-factory.tscreateTraceAnalystKind(spec, { ai }) wires Ax agent(...) with findings:json[] output, the JS-runtime sandbox, advanced-mode recursion when maxDepth>0, and the per-kind tool subset. versionSuffix reserves the hook for optimizer-fitted prompts.
  • tool-groups.ts — five named subsets (all / discovery / discoveryAndRead / discoveryAndSearch / targeted) so kinds take only the tools they need. Unknown group names throw.

Default kinds (in dependency order)

Kind maxDepth Brief
failure-mode 3 Clusters dataset failures into distinct modes with cited evidence. RLM fans out one subagent per cluster; confounded clusters split again next level.
knowledge-gap 2 Names missing/stale knowledge attributed to the runtime layer that should have held it. Anchored on @tangle-network/agent-knowledge (wiki:<page>, claim:<topic>, raw:<source>, stale:<page>), with secondary loci websearch:outdated:*, tool-doc:*, system-prompt:*, memory:*. Subagents fan out per layer.
knowledge-poisoning 2 Confident-but-wrong actions. DUAL-VERIFY protocol: subagents prove both (i) the agent acted on the belief AND (ii) the belief is false in this trace's evidence. Single-evidence findings get dropped.
improvement 3 Converts upstream findings into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES → COMPETE → CITE: competing fix candidates simulated by subagents; winning candidate per cluster emitted with leverage grade, rationale, and literal edit phrased as a diff. Cross-references upstream findings via evidence_uri: \"finding://<id>\".

Tests

21 new in kinds/kinds.test.ts covering:

  • Zod schema: rejects out-of-range confidence, unknown severity, extra fields; logs rejection reason
  • parseRawFinding: null + log on failure, typed value on success
  • Default suite shape, run order, recursion budgets, finding_id stability
  • Per-kind prompt invariants (clustering for failure-mode, dual-verify for poisoning, agent-knowledge anchoring for gaps)
  • Tool-group filtering and unknown-name throw

Full suite: 1146/1146 pass, typecheck clean, pnpm build clean.

Why kinds, not one big trace-analyst

  • Typed outputs — severity / confidence / evidence become first-class instead of flat-defaulted at the adapter layer.
  • Narrow tools per kind — failure-mode gets full discovery; knowledge-gap only needs search; targeted analysts skip full-trace dumps.
  • Real RLM depth — recursion budgets per kind (2 or 3) so the LLM actually fans out instead of doing all work at the root.
  • Composable via finding://<id> — improvement kind chains on top of upstream findings. The registry already supports the finding evidence kind; this PR makes it useful.

Deferred

  • Ax optimizer hook (MIPRO / GEPA / Bootstrap) — Ax 19.0.45 .d.ts is missing optimizer-class exports that 21.x ships. Goldens are kept as data on TraceAnalystKindSpec so the hook lands in a small follow-up after Ax bump.
  • Streaming forwardstreamingForward exists in Ax; kinds use forward for now. Switching is mechanical once any kind benefits from progressive emission.

Test plan

  • pnpm typecheck — clean
  • pnpm vitest run src/analyst — 41/41 pass
  • pnpm test — 1146/1146 pass
  • pnpm build — clean
  • Wire one kind end-to-end against a real trace dataset in PR-B (blueprint-agent)

…-improvement

The original PR-A adapter (`createTraceAnalystAdapter`) shipped a single
generic flow whose output was `findings:string[]` — every bullet became
a flat-severity `medium` / confidence `0.6` `AnalystFinding`, losing the
per-finding grading the analyzer LLM is perfectly capable of producing.

This commit adds the typed-kind architecture on top:

  finding-signature.ts
    Strict Zod schema for one analyst-emitted row (severity / claim /
    subject / evidence_uri / confidence / rationale / recommended_action).
    `RAW_FINDING_SCHEMA_PROMPT` embeds the shape into every kind's actor
    prompt; the LLM emits structured JSON; the factory Zod-validates each
    row at the boundary and drops malformed rows with a logged reason.

  kind-factory.ts
    `createTraceAnalystKind(spec, { ai })` lifts a kind spec into the
    existing `Analyst<TraceAnalysisStore>` contract. Wires Ax `agent(...)`
    with `findings:json[]` output, AxJSRuntime sandbox, advanced-mode
    recursion when `maxDepth>0`, and the per-kind tool subset. `versionSuffix`
    is reserved for prompt-optimization artifacts (MIPRO/GEPA bumps Ax
    version → wire-up in a follow-up).

  tool-groups.ts
    Five named subsets — `all`, `discovery`, `discoveryAndRead`,
    `discoveryAndSearch`, `targeted` — so each kind takes only what it
    needs from the seven trace-analyst tools. Unknown group names throw
    (silent-all would defeat the cost-control point).

  kinds/ — four default kinds, ordered for dependency-aware runs:

    1. failure-mode (maxDepth 3, parallel 4) — clusters dataset failures
       into distinct modes with cited evidence. Discovery → cluster → cite
       protocol. Aggressive RLM delegation: one subagent per cluster, and
       confounded clusters split again at the next level.

    2. knowledge-gap (maxDepth 2, parallel 4) — names the specific
       information the agent lacked or that was stale, attributed to the
       runtime layer that should have surfaced it. Anchored on
       `@tangle-network/agent-knowledge` (wiki page / claim / raw source
       loci) with secondary loci for `websearch:outdated:*`, `tool-doc:*`,
       `system-prompt:*`, `memory:*`. Subagents fan out per layer.

    3. knowledge-poisoning (maxDepth 2, parallel 4) — finds confident-
       but-wrong actions. DUAL-VERIFY protocol: subagents prove (i) the
       agent acted on the belief and (ii) the belief is false in this
       trace's evidence. Only findings with both halves proven survive.

    4. improvement (maxDepth 3, parallel 4) — converts upstream findings
       into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES →
       COMPETE → CITE: subagents simulate competing fix candidates per
       cluster; the winning candidate per cluster is emitted with leverage
       grade, rationale, and a literal edit phrased as a diff. Cross-
       references upstream findings via `evidence_uri: "finding://<id>"`
       so the dependency graph renders.

  Tests (21 new in `kinds/kinds.test.ts`):
    - Zod schema rejects out-of-range confidence / unknown severity /
      extra fields (strict mode), logs the rejection reason
    - `parseRawFinding` returns null + logs on failure, types value on success
    - default suite emits the four kinds in run order
    - every kind exercises Ax recursion (maxDepth ≥ 1)
    - improvement has the deepest depth (competing candidate fixes)
    - knowledge-gap prompt anchors on agent-knowledge + websearch + tool-doc,
      not generic RAG
    - knowledge-poisoning enforces dual-verify
    - failure-mode requires clustering, not enumeration
    - tool groups filter narrowly; unknown name throws
    - `versionSuffix` appends to kind version (for future optimizer pins)
    - `finding_id` stable across runs for same kind + area + claim + subject

  The legacy `createTraceAnalystAdapter` is now `@deprecated` (kept one
  minor for consumer migration). New code should reach for kinds first.

Total: 1146/1146 tests pass, typecheck clean. Ax 19's `.d.ts` is missing
some optimizer-class exports that 21.x ships; the optimizer-fit pipeline
lands in a follow-up after the Ax bump.
@drewstone drewstone merged commit 35cbff8 into feat/analyst-registry-and-findings May 19, 2026
drewstone added a commit that referenced this pull request May 19, 2026
…56)

* feat(analyst): registry + findings envelope over existing primitives

Adds a generic, model-agnostic, transport-agnostic Analyst layer that
orchestrates agent-eval's existing analyzers without re-implementing
them. One contract, one runner, one persistence path — reusable by VB
operator bench, the leaderboard submission pipeline, and the orchestrator
on-completion reports surface with the same code.

- `src/analyst/types.ts` — `Analyst` contract, `AnalystFinding` envelope
  with sha-stable `finding_id`, `AnalystRunInputs` with `inputKind`
  routing (trace-store | artifact-dir | run-record | judge-input | custom)
- `src/analyst/chat-client.ts` — `ChatClient` abstraction over
  router | sandbox-sdk | cli-bridge | direct-provider | mock so analyst
  code never depends on the transport
- `src/analyst/registry.ts` — register/list/run with input routing,
  per-analyst isolation (one failure does not stop others), budget split,
  per-analyst telemetry
- `src/analyst/findings-store.ts` — locked JSONL append + `diffFindings`
  (appeared / disappeared / persisted / changed) keyed by stable id
- `src/analyst/adapters.ts` — five thin lifters wrapping `analyzeTraces`,
  `MultiLayerVerifier`, `RunCritic`, `JudgeFn`, `SemanticConceptJudge`
- `src/analyst/analyst.test.ts` — 12 tests covering hash stability,
  registration validation, routing, failure isolation, only/skip,
  cost attribution, store round-trip, diff semantics, mock transport

Version: 0.28.0

* refactor(analyst): hook + policy surface for cross-cutting concerns

Lifts five reviewer concerns into a small policy surface so consumers
override what they need without changing the registry. All defaults
preserve previous behavior.

- types.ts: `AnalystContext.chat` is now `ChatClient` (was `LlmClient`).
  Drops the cast in registry.run() and matches what the PR promised —
  analyst code is transport-agnostic by contract, not by convention.

- chat-client.ts: `wrapLlmClient` races the in-flight call against
  `ChatCallOpts.signal`. Awaiting code unblocks on abort; the in-flight
  HTTP request still bounds by `timeoutMs` (LlmClient doesn't yet
  accept an external AbortSignal — documented inline).

- registry.ts:
  * `AnalystHooks` — `onBeforeAnalyze`, `onAfterAnalyze`, `onError`,
    `onComplete`. `onError` MAY return findings to convert a crash into
    structured findings; `onAfterAnalyze` runs for ok | failed | skipped.
    This is the seam for telemetry, cost ingestion, storage rotation,
    error-to-finding conversion — all without registry changes.
  * `BudgetPolicy` — `{ totalUsd, weights, allocate }`. Default still
    equal-split; `allocate` is the precise hook when weights aren't
    enough.

- findings-store.ts: `diffFindings(prev, cur, { isMaterial })`. Default
  materiality test (severity / confidence Δ > 0.05 / evidence count) is
  exported as `defaultIsMaterial` so consumers can layer stricter
  predicates without re-implementing the base.

- 8 new tests cover hook ordering, error→finding conversion, skipped
  hooks, equal-split + weighted budget, default + custom diff policy,
  signal racing.

1125/1125 tests pass.

* chore(analyst): biome format + organize imports

* feat(analyst): typed kinds for failure + recursive self-improvement (on top of #56) (#57)

The original PR-A adapter (`createTraceAnalystAdapter`) shipped a single
generic flow whose output was `findings:string[]` — every bullet became
a flat-severity `medium` / confidence `0.6` `AnalystFinding`, losing the
per-finding grading the analyzer LLM is perfectly capable of producing.

This commit adds the typed-kind architecture on top:

  finding-signature.ts
    Strict Zod schema for one analyst-emitted row (severity / claim /
    subject / evidence_uri / confidence / rationale / recommended_action).
    `RAW_FINDING_SCHEMA_PROMPT` embeds the shape into every kind's actor
    prompt; the LLM emits structured JSON; the factory Zod-validates each
    row at the boundary and drops malformed rows with a logged reason.

  kind-factory.ts
    `createTraceAnalystKind(spec, { ai })` lifts a kind spec into the
    existing `Analyst<TraceAnalysisStore>` contract. Wires Ax `agent(...)`
    with `findings:json[]` output, AxJSRuntime sandbox, advanced-mode
    recursion when `maxDepth>0`, and the per-kind tool subset. `versionSuffix`
    is reserved for prompt-optimization artifacts (MIPRO/GEPA bumps Ax
    version → wire-up in a follow-up).

  tool-groups.ts
    Five named subsets — `all`, `discovery`, `discoveryAndRead`,
    `discoveryAndSearch`, `targeted` — so each kind takes only what it
    needs from the seven trace-analyst tools. Unknown group names throw
    (silent-all would defeat the cost-control point).

  kinds/ — four default kinds, ordered for dependency-aware runs:

    1. failure-mode (maxDepth 3, parallel 4) — clusters dataset failures
       into distinct modes with cited evidence. Discovery → cluster → cite
       protocol. Aggressive RLM delegation: one subagent per cluster, and
       confounded clusters split again at the next level.

    2. knowledge-gap (maxDepth 2, parallel 4) — names the specific
       information the agent lacked or that was stale, attributed to the
       runtime layer that should have surfaced it. Anchored on
       `@tangle-network/agent-knowledge` (wiki page / claim / raw source
       loci) with secondary loci for `websearch:outdated:*`, `tool-doc:*`,
       `system-prompt:*`, `memory:*`. Subagents fan out per layer.

    3. knowledge-poisoning (maxDepth 2, parallel 4) — finds confident-
       but-wrong actions. DUAL-VERIFY protocol: subagents prove (i) the
       agent acted on the belief and (ii) the belief is false in this
       trace's evidence. Only findings with both halves proven survive.

    4. improvement (maxDepth 3, parallel 4) — converts upstream findings
       into concrete locus-named edits. DISCOVERY → CANDIDATE-FIXES →
       COMPETE → CITE: subagents simulate competing fix candidates per
       cluster; the winning candidate per cluster is emitted with leverage
       grade, rationale, and a literal edit phrased as a diff. Cross-
       references upstream findings via `evidence_uri: "finding://<id>"`
       so the dependency graph renders.

  Tests (21 new in `kinds/kinds.test.ts`):
    - Zod schema rejects out-of-range confidence / unknown severity /
      extra fields (strict mode), logs the rejection reason
    - `parseRawFinding` returns null + logs on failure, types value on success
    - default suite emits the four kinds in run order
    - every kind exercises Ax recursion (maxDepth ≥ 1)
    - improvement has the deepest depth (competing candidate fixes)
    - knowledge-gap prompt anchors on agent-knowledge + websearch + tool-doc,
      not generic RAG
    - knowledge-poisoning enforces dual-verify
    - failure-mode requires clustering, not enumeration
    - tool groups filter narrowly; unknown name throws
    - `versionSuffix` appends to kind version (for future optimizer pins)
    - `finding_id` stable across runs for same kind + area + claim + subject

  The legacy `createTraceAnalystAdapter` is now `@deprecated` (kept one
  minor for consumer migration). New code should reach for kinds first.

Total: 1146/1146 tests pass, typecheck clean. Ax 19's `.d.ts` is missing
some optimizer-class exports that 21.x ships; the optimizer-fit pipeline
lands in a follow-up after the Ax bump.
drewstone added a commit that referenced this pull request May 19, 2026
… priorFindings

Wraps two unreleased features into a single minor:
- PR #57: kind-factory + 4 trace-analyst kinds (failure-mode,
  knowledge-gap, knowledge-poisoning, improvement) via Ax structured
  output + Zod-validated `RawAnalystFinding`.
- PR #58: priorFindings context wiring so kinds chain across runs
  ('improvement' sees what 'failure-mode' surfaced).

createTraceAnalystAdapter retains @deprecated marker pointing at the
new kinds; kept for one minor while consumers migrate.

1153/1153 tests green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant