feat: eval substrate (search-bench + generate-eval) + unify the executor/driver surfaces by drewstone · Pull Request #190 · tangle-network/agent-runtime

drewstone · 2026-06-07T20:44:09Z

Two threads, both landed green (tsc 0 · biome clean · 676/676 tests).

1. The eval substrate — measure provider × harness × model on fresh, non-trainable tasks

The neutral-measurement layer the RSI runtime needs (grounding) and that sells as data (docs/eval-substrate.md is the north star + measurement non-negotiables).

bench/src/search-bench/ — coding-harness × web-search comparison: per-arm AgentProfile builders (native / provider-MCP+native-disabled / off), deterministic oracles (exact-identifier checks, no LLM judge), a runner over the unified executor (sandbox + cli-bridge backends), and a rigorous exporter (methodology + all tasks defined + per-task matrix + paired sign-test + CSV + gist).
21 web-verified discriminating tasks (tasks-fresh.ts) where GPT-4.1 fails 90% parametrically (search-correctable headroom) — generated + adversarially verified, not hand-authored.
bench/src/generate-eval/ — the data engine as a skill + certifier + kernel loop: an agent authors a task; the runtime certifies it (grounding gate = reference must execute+pass against the real pinned target; discrimination gate = a no-tools baseline must fail). Soundness guaranteed, production budget-bounded, exhaustion loud. skills/generate-eval/SKILL.md is portable to any agent/stack.

Honest first result (shipped as data, not marketing): on these 21 tasks, you.com is at correctness parity with the harness's native search (opencode 71/67/67, p=1.0; claude-code 65/66) and markedly more token-efficient (−36%/−61% input tokens vs opencode's page-dumping webfetch). The substrate's value is reporting where each provider wins/ties/loses, continuously.

2. Unify the executor / driver surfaces (no aliases — aggressive)

Collapsed the sprawl the survey mapped (six parallel surfaces) onto one vocabulary and one port:

Executor / ExecutorFactory / ExecutorResult (was LeafExecutor* — the literature-standard term; the supervision-tree "leaf" role is now a docstring).
createDriver / DriverDecision (was createDynamicDriver — "dynamic" distinguished it from static drivers that no longer exist); dynamic.ts → driver.ts.
SandboxClient (was LoopSandboxClient — no longer the loop's port name; it's the box-shaped structural contract).
createExecutor({ backend: 'router'|'bridge'|'cli'|'sandbox', …seam }) — the ONE built-in: the backend is serializable data (a profile/experiment-config/journal can name it), not an import choice. The per-backend factories are internal case-arms; the registry feeds from the same bodies; BYO agents implement Executor directly (the port stays open).
inlineSandboxClient(factory) — the ONE pseudo-box adapter: any non-box Executor drives runLoop without re-faking a box. The three duplicated shims (router-executor.ts, generate-eval's bridge client, search-bench's bridge POST) now share it.
Renamed across 241 sites / 47 files; all docs, the adoption skill, and examples propagated so no map disagrees with the code.

3. Process hygiene

CLAUDE.md decluttered to the timeless contract — a top-of-file rule (pointers, not state: no gate numbers / run ids / session status; those live in .evolve/current.json + memory/) and the code-map fixed to the unified names.

Test plan

pnpm run typecheck (src) + tsc --noEmit (bench): 0 errors. pnpm run lint: clean. pnpm test: 66 files / 676 tests pass.
search-bench + generate-eval proven live end-to-end (real sandbox + cli-bridge); see docs/eval-substrate.md.
Merges clean into main.

…un env passthrough research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines. experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it. Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.

…-unification checkpoint)

…l-loop # Conflicts: # bench/src/research-gate.mts

…boxClient No aliases, hard rename across src/bench/tests (241 sites, 47 files): - LeafExecutor → Executor, LeafExecutorFactory → ExecutorFactory, LeafResult → ExecutorResult (the literature-standard executor vocabulary; the supervision-tree 'leaf' role moves to the docstring) - createDynamicDriver → createDriver, CreateDynamicDriverOptions → CreateDriverOptions, DynamicDecision → DriverDecision ('dynamic' distinguished it from static drivers that no longer exist) - LoopSandboxClient → SandboxClient (no longer the loop's port name; it is the box-shaped structural contract for the sandbox substrate: lineage, fs artifacts, capabilities, MCP delegation, in-process clients) - src/runtime/dynamic.ts → src/runtime/driver.ts

… + one pseudo-box adapter Collapses per-backend executor factories into a single config-driven entrypoint; the backend becomes serializable DATA (a profile/experiment-config/journal can NAME it) instead of an import choice. - createExecutor({ backend: 'router'|'bridge'|'cli'|'sandbox', …seam }) is the ONE public built-in; routerInline/sandbox/cli become internal case-arms; the registry (Supervisor's resolve-by-harness path) feeds from the same bodies. - bridgeExecutor: the cli-bridge harness turn (model = harness selector, agent_profile = the arm's native-disable/MCP) — implemented ONCE, in src/. - inlineSandboxClient(factory): the ONE pseudo-box adapter — any non-box Executor drives runLoop without re-faking a box. Replaces the shims each call site grew. - generate-eval migrated onto it (deletes its bespoke bridge SandboxClient). router-executor.ts + search-bench/bridge.ts migrate next (same pattern). The port stays OPEN: BYO agents implement Executor and never pass through here.

…ied executor Delete the last two pseudo-box / raw-fetch duplicates: - router-executor.ts: a BYO Executor over runResearchShot wrapped by inlineSandboxClient — owns only the research-shot specifics, no hand-rolled create/streamPrompt/delete box shell. - search-bench/bridge.ts: the cell scorer's bridge POST+usage-parse becomes a createExecutor({backend:'bridge'}) call; it keeps only oracle scoring + citation extraction. Same backend the loop path uses, one implementation. - generate-eval run loop: per-round gate verdicts now logged (an exhaust says WHY — malformed JSON vs grounding vs discrimination).

…name - CLAUDE.md: add the top-of-file rule (pointers, not state — no gate numbers, run ids, or session/generation status; those live in .evolve/current.json + memory/). Replace the embedded science-state paragraph with a pointer; fix the code-map to the unified names (driver.ts, Executor, createExecutor, inlineSandboxClient, SandboxClient). - Propagate the executor/driver rename across all docs, the adoption skill, and examples so no map disagrees with the code (anti-staleness law).

drewstone added 7 commits June 6, 2026 16:36

feat(bench): search-bench + generate-eval + eval-substrate canon (pre…

bda36f4

…-unification checkpoint)

Merge remote-tracking branch 'origin/main' into feat/research-statefu…

faf5c7e

…l-loop # Conflicts: # bench/src/research-gate.mts

drewstone merged commit 7bd250c into main Jun 7, 2026
1 check passed

drewstone mentioned this pull request Jun 7, 2026

chore(release): 0.46.0 — eval substrate unification + sandbox live-path fix #192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval substrate (search-bench + generate-eval) + unify the executor/driver surfaces#190

feat: eval substrate (search-bench + generate-eval) + unify the executor/driver surfaces#190
drewstone merged 7 commits into
mainfrom
feat/research-stateful-loop

drewstone commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 7, 2026

1. The eval substrate — measure provider × harness × model on fresh, non-trainable tasks

2. Unify the executor / driver surfaces (no aliases — aggressive)

3. Process hygiene

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant