feat(0.11.0): defineAgent + surfaces-driven adapters + outcome measurement#16
Merged
Conversation
…ement
The substrate redesign. Closes every universal failure flagged by the audit
of the 4 per-agent draft PRs:
1. Subject-routing dead code → `parseFindingSubject` (agent-eval 0.30.0) +
`resolveSubjectPath(subject, surfaces, repoRoot)` resolve a typed
FindingSubject to a real file path. No `startsWith(...)` prose matching.
2. ImprovementAdapter has no apply → `createSurfaceImprovementAdapter`
ships a first-class apply with two modes:
`write` — `git apply -p0` in-place; operator reviews via git diff
`open-pr` — branch + commit + push + `gh pr create`
plus a `none` mode for report-only runs. Race-checked via SHA-256 of
the file content the patch was drafted against.
3. No outcome measurement → `measureOutcome(result, opts)` re-runs the
cohort after apply, computes composite delta, optionally rolls back
applied paths on regression. Process-counts-only reporting is gone.
4. Fabricated file paths → `validateSurfaces` runs at `defineAgent` time
and throws `AgentManifestError` listing every missing surface. A
manifest that ships broken can't get past `pnpm typecheck`.
5. Per-vertical ImprovementAdapter code (~150 lines × N agents) →
`createSurfaceImprovementAdapter(opts)` is the ONE adapter every
vertical uses. Per-agent customization happens at the manifest level
(`surfaces` + `autoApply`), not by writing a new adapter.
6. Knowledge proposals lacking lint → `createSurfaceKnowledgeAdapter`
wraps agent-knowledge's `applyKnowledgeWriteBlocks` with an optional
`lintAfterApply` hook. Wiki drift surfaces in `warnings` immediately.
API surface (new sub-export `@tangle-network/agent-runtime/agent`):
- `defineAgent<TPersona, TRunOutput>(manifest)` — typed, validated factory
- `AgentSurfaces`, `validateSurfaces`, `resolveSubjectPath` — surface map +
Subject→Path resolver
- `createSurfaceImprovementAdapter(opts)` — LLM-drafted patches +
`git apply` / `gh pr create` apply
- `createSurfaceKnowledgeAdapter(opts, deps)` — agent-knowledge integration
with post-apply lint
- `measureOutcome(result, opts)` — before/after cohort delta + rollback
- `AgentManifestError` — fail-loud manifest validation
Bumps agent-eval dep to ^0.30.0 (FindingSubject lives there).
Tests: 124/124 pass (27 new under `tests/agent.test.ts`) covering manifest
validation, subject-resolution for every surface variant, propose/apply
error paths, race-detection via SHA mismatch, and outcome rollback on
regression.
Per-vertical PRs now collapse from ~700 lines of glue to a ~50-line
`defineAgent({...})` call + the substrate's default adapters. Tracking
that cascade in follow-up PRs per repo.
drewstone
added a commit
that referenced
this pull request
May 20, 2026
drewstone
added a commit
that referenced
this pull request
May 20, 2026
Two follow-ups missed by the #16 merge window — both required by the per-agent manifest PRs (tax #69, legal #70, gtm #122, creative #98) to edit existing system-prompt skill directories rather than fabricating flat <section>.md siblings. Multi-candidate path probing (surfaces.ts): resolveSubjectPath now probes candidates in order rather than picking a single path. system-prompt:<section> resolves to: <surfaces.systemPrompt>/<section>.md (canonical create-new) <surfaces.systemPrompt>/<section>/SKILL.md (tax/legal/gtm/creative) <surfaces.systemPrompt>/<section>/index.md (flat-md repos) tool-doc:<tool> probes <tool>/README.md then <tool>.md. First hit wins for edit-existing; first candidate is the canonical create-new target. Streaming-first AgentRuntime.act (define-agent.ts): Returns AgentRunInvocation { events, output } instead of Promise<T>. Preserves the chat-centric product's streaming surface through the substrate: - events: AsyncIterable<RuntimeStreamEvent> — chat UX consumes verbatim (SSE / WebSocket / inline render). runChatTurn plugs in directly. - output: Promise<TRunOutput> — resolves after stream drains; eval substrate awaits this; chat UX ignores (already rendered). Helpers: unimplementedAgentRun(reason?) — stub manifests use until their eval path is fully wired; yields no events, rejects output loudly. collectAgentRun(invocation) — eval-path drain helper; chat UX MUST NOT call this (defeats streaming). 133/133 tests pass (6 new: streaming contract + multi-candidate probing). Bumps to 0.11.1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The substrate redesign. Closes every universal failure flagged by the audit of the four per-agent draft PRs (now closed):
Subject-routing dead code → typed `FindingSubject` (agent-eval 0.30.0) + `resolveSubjectPath(subject, surfaces, repoRoot)` resolves to a real file path. No `startsWith(...)` prose matching.
ImprovementAdapter has no apply → `createSurfaceImprovementAdapter` ships first-class apply with three modes:
Race-checked via SHA-256 of the file content the patch was drafted against.
No outcome measurement → `measureOutcome(result, opts)` re-runs the cohort after apply, computes composite delta, optionally rolls back applied paths on regression.
Fabricated file paths → `validateSurfaces` runs at `defineAgent` time and throws `AgentManifestError` listing every missing surface. Broken manifests can't ship.
Per-vertical ImprovementAdapter duplication → `createSurfaceImprovementAdapter(opts)` is the ONE adapter every vertical uses. Customization moves to the manifest's `surfaces` + `autoApply`, not new adapter code.
Knowledge proposals lacking lint → `createSurfaceKnowledgeAdapter` wraps `applyKnowledgeWriteBlocks` with an optional `lintAfterApply` hook.
API surface
New sub-export `@tangle-network/agent-runtime/agent` ships:
Why the redesign
The audit of the four draft PRs found the same structural failures across every one: subject grammar unenforced, no apply on the improvement side, no outcome metric, fabricated paths, per-vertical glue that doesn't scale past 4 verticals — let alone the 1000s the substrate needs to serve. Fixing those in-place across 4 PRs would have shipped the same theater; fixing the substrate fixes them by construction for every future agent.
After this lands, per-vertical PRs collapse from ~700 lines of glue to a ~50-line `defineAgent({...})` call + the substrate's default adapters.
Test plan
Gates