Skip to content

feat: eval substrate (search-bench + generate-eval) + unify the executor/driver surfaces#190

Merged
drewstone merged 7 commits into
mainfrom
feat/research-stateful-loop
Jun 7, 2026
Merged

feat: eval substrate (search-bench + generate-eval) + unify the executor/driver surfaces#190
drewstone merged 7 commits into
mainfrom
feat/research-stateful-loop

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Two threads, both landed green (tsc 0 · biome clean · 676/676 tests).

1. The eval substrate — measure provider × harness × model on fresh, non-trainable tasks

The neutral-measurement layer the RSI runtime needs (grounding) and that sells as data (docs/eval-substrate.md is the north star + measurement non-negotiables).

  • bench/src/search-bench/ — coding-harness × web-search comparison: per-arm AgentProfile builders (native / provider-MCP+native-disabled / off), deterministic oracles (exact-identifier checks, no LLM judge), a runner over the unified executor (sandbox + cli-bridge backends), and a rigorous exporter (methodology + all tasks defined + per-task matrix + paired sign-test + CSV + gist).
  • 21 web-verified discriminating tasks (tasks-fresh.ts) where GPT-4.1 fails 90% parametrically (search-correctable headroom) — generated + adversarially verified, not hand-authored.
  • bench/src/generate-eval/ — the data engine as a skill + certifier + kernel loop: an agent authors a task; the runtime certifies it (grounding gate = reference must execute+pass against the real pinned target; discrimination gate = a no-tools baseline must fail). Soundness guaranteed, production budget-bounded, exhaustion loud. skills/generate-eval/SKILL.md is portable to any agent/stack.

Honest first result (shipped as data, not marketing): on these 21 tasks, you.com is at correctness parity with the harness's native search (opencode 71/67/67, p=1.0; claude-code 65/66) and markedly more token-efficient (−36%/−61% input tokens vs opencode's page-dumping webfetch). The substrate's value is reporting where each provider wins/ties/loses, continuously.

2. Unify the executor / driver surfaces (no aliases — aggressive)

Collapsed the sprawl the survey mapped (six parallel surfaces) onto one vocabulary and one port:

  • Executor / ExecutorFactory / ExecutorResult (was LeafExecutor* — the literature-standard term; the supervision-tree "leaf" role is now a docstring).
  • createDriver / DriverDecision (was createDynamicDriver — "dynamic" distinguished it from static drivers that no longer exist); dynamic.tsdriver.ts.
  • SandboxClient (was LoopSandboxClient — no longer the loop's port name; it's the box-shaped structural contract).
  • createExecutor({ backend: 'router'|'bridge'|'cli'|'sandbox', …seam }) — the ONE built-in: the backend is serializable data (a profile/experiment-config/journal can name it), not an import choice. The per-backend factories are internal case-arms; the registry feeds from the same bodies; BYO agents implement Executor directly (the port stays open).
  • inlineSandboxClient(factory) — the ONE pseudo-box adapter: any non-box Executor drives runLoop without re-faking a box. The three duplicated shims (router-executor.ts, generate-eval's bridge client, search-bench's bridge POST) now share it.
  • Renamed across 241 sites / 47 files; all docs, the adoption skill, and examples propagated so no map disagrees with the code.

3. Process hygiene

CLAUDE.md decluttered to the timeless contract — a top-of-file rule (pointers, not state: no gate numbers / run ids / session status; those live in .evolve/current.json + memory/) and the code-map fixed to the unified names.

Test plan

  • pnpm run typecheck (src) + tsc --noEmit (bench): 0 errors. pnpm run lint: clean. pnpm test: 66 files / 676 tests pass.
  • search-bench + generate-eval proven live end-to-end (real sandbox + cli-bridge); see docs/eval-substrate.md.
  • Merges clean into main.

drewstone added 7 commits June 6, 2026 16:36
…un env passthrough

research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines.

experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it.

Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.
…l-loop

# Conflicts:
#	bench/src/research-gate.mts
…boxClient

No aliases, hard rename across src/bench/tests (241 sites, 47 files):
- LeafExecutor → Executor, LeafExecutorFactory → ExecutorFactory,
  LeafResult → ExecutorResult (the literature-standard executor vocabulary;
  the supervision-tree 'leaf' role moves to the docstring)
- createDynamicDriver → createDriver, CreateDynamicDriverOptions →
  CreateDriverOptions, DynamicDecision → DriverDecision ('dynamic'
  distinguished it from static drivers that no longer exist)
- LoopSandboxClient → SandboxClient (no longer the loop's port name; it is
  the box-shaped structural contract for the sandbox substrate: lineage,
  fs artifacts, capabilities, MCP delegation, in-process clients)
- src/runtime/dynamic.ts → src/runtime/driver.ts
… + one pseudo-box adapter

Collapses per-backend executor factories into a single config-driven entrypoint;
the backend becomes serializable DATA (a profile/experiment-config/journal can
NAME it) instead of an import choice.

- createExecutor({ backend: 'router'|'bridge'|'cli'|'sandbox', …seam }) is the
  ONE public built-in; routerInline/sandbox/cli become internal case-arms; the
  registry (Supervisor's resolve-by-harness path) feeds from the same bodies.
- bridgeExecutor: the cli-bridge harness turn (model = harness selector,
  agent_profile = the arm's native-disable/MCP) — implemented ONCE, in src/.
- inlineSandboxClient(factory): the ONE pseudo-box adapter — any non-box Executor
  drives runLoop without re-faking a box. Replaces the shims each call site grew.
- generate-eval migrated onto it (deletes its bespoke bridge SandboxClient).
  router-executor.ts + search-bench/bridge.ts migrate next (same pattern).

The port stays OPEN: BYO agents implement Executor and never pass through here.
…ied executor

Delete the last two pseudo-box / raw-fetch duplicates:
- router-executor.ts: a BYO Executor over runResearchShot wrapped by
  inlineSandboxClient — owns only the research-shot specifics, no hand-rolled
  create/streamPrompt/delete box shell.
- search-bench/bridge.ts: the cell scorer's bridge POST+usage-parse becomes a
  createExecutor({backend:'bridge'}) call; it keeps only oracle scoring +
  citation extraction. Same backend the loop path uses, one implementation.
- generate-eval run loop: per-round gate verdicts now logged (an exhaust says
  WHY — malformed JSON vs grounding vs discrimination).
…name

- CLAUDE.md: add the top-of-file rule (pointers, not state — no gate numbers,
  run ids, or session/generation status; those live in .evolve/current.json +
  memory/). Replace the embedded science-state paragraph with a pointer; fix
  the code-map to the unified names (driver.ts, Executor, createExecutor,
  inlineSandboxClient, SandboxClient).
- Propagate the executor/driver rename across all docs, the adoption skill,
  and examples so no map disagrees with the code (anti-staleness law).
@drewstone drewstone merged commit 7bd250c into main Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant