feat: eval substrate (search-bench + generate-eval) + unify the executor/driver surfaces#190
Merged
Merged
Conversation
…un env passthrough research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines. experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it. Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.
…-unification checkpoint)
…l-loop # Conflicts: # bench/src/research-gate.mts
…boxClient
No aliases, hard rename across src/bench/tests (241 sites, 47 files):
- LeafExecutor → Executor, LeafExecutorFactory → ExecutorFactory,
LeafResult → ExecutorResult (the literature-standard executor vocabulary;
the supervision-tree 'leaf' role moves to the docstring)
- createDynamicDriver → createDriver, CreateDynamicDriverOptions →
CreateDriverOptions, DynamicDecision → DriverDecision ('dynamic'
distinguished it from static drivers that no longer exist)
- LoopSandboxClient → SandboxClient (no longer the loop's port name; it is
the box-shaped structural contract for the sandbox substrate: lineage,
fs artifacts, capabilities, MCP delegation, in-process clients)
- src/runtime/dynamic.ts → src/runtime/driver.ts
… + one pseudo-box adapter
Collapses per-backend executor factories into a single config-driven entrypoint;
the backend becomes serializable DATA (a profile/experiment-config/journal can
NAME it) instead of an import choice.
- createExecutor({ backend: 'router'|'bridge'|'cli'|'sandbox', …seam }) is the
ONE public built-in; routerInline/sandbox/cli become internal case-arms; the
registry (Supervisor's resolve-by-harness path) feeds from the same bodies.
- bridgeExecutor: the cli-bridge harness turn (model = harness selector,
agent_profile = the arm's native-disable/MCP) — implemented ONCE, in src/.
- inlineSandboxClient(factory): the ONE pseudo-box adapter — any non-box Executor
drives runLoop without re-faking a box. Replaces the shims each call site grew.
- generate-eval migrated onto it (deletes its bespoke bridge SandboxClient).
router-executor.ts + search-bench/bridge.ts migrate next (same pattern).
The port stays OPEN: BYO agents implement Executor and never pass through here.
…ied executor
Delete the last two pseudo-box / raw-fetch duplicates:
- router-executor.ts: a BYO Executor over runResearchShot wrapped by
inlineSandboxClient — owns only the research-shot specifics, no hand-rolled
create/streamPrompt/delete box shell.
- search-bench/bridge.ts: the cell scorer's bridge POST+usage-parse becomes a
createExecutor({backend:'bridge'}) call; it keeps only oracle scoring +
citation extraction. Same backend the loop path uses, one implementation.
- generate-eval run loop: per-round gate verdicts now logged (an exhaust says
WHY — malformed JSON vs grounding vs discrimination).
…name - CLAUDE.md: add the top-of-file rule (pointers, not state — no gate numbers, run ids, or session/generation status; those live in .evolve/current.json + memory/). Replace the embedded science-state paragraph with a pointer; fix the code-map to the unified names (driver.ts, Executor, createExecutor, inlineSandboxClient, SandboxClient). - Propagate the executor/driver rename across all docs, the adoption skill, and examples so no map disagrees with the code (anti-staleness law).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two threads, both landed green (tsc 0 · biome clean · 676/676 tests).
1. The eval substrate — measure provider × harness × model on fresh, non-trainable tasks
The neutral-measurement layer the RSI runtime needs (grounding) and that sells as data (
docs/eval-substrate.mdis the north star + measurement non-negotiables).bench/src/search-bench/— coding-harness × web-search comparison: per-armAgentProfilebuilders (native / provider-MCP+native-disabled / off), deterministic oracles (exact-identifier checks, no LLM judge), a runner over the unified executor (sandbox + cli-bridge backends), and a rigorous exporter (methodology + all tasks defined + per-task matrix + paired sign-test + CSV + gist).tasks-fresh.ts) where GPT-4.1 fails 90% parametrically (search-correctable headroom) — generated + adversarially verified, not hand-authored.bench/src/generate-eval/— the data engine as a skill + certifier + kernel loop: an agent authors a task; the runtime certifies it (grounding gate = reference must execute+pass against the real pinned target; discrimination gate = a no-tools baseline must fail). Soundness guaranteed, production budget-bounded, exhaustion loud.skills/generate-eval/SKILL.mdis portable to any agent/stack.Honest first result (shipped as data, not marketing): on these 21 tasks, you.com is at correctness parity with the harness's native search (opencode 71/67/67, p=1.0; claude-code 65/66) and markedly more token-efficient (−36%/−61% input tokens vs opencode's page-dumping webfetch). The substrate's value is reporting where each provider wins/ties/loses, continuously.
2. Unify the executor / driver surfaces (no aliases — aggressive)
Collapsed the sprawl the survey mapped (six parallel surfaces) onto one vocabulary and one port:
Executor/ExecutorFactory/ExecutorResult(wasLeafExecutor*— the literature-standard term; the supervision-tree "leaf" role is now a docstring).createDriver/DriverDecision(wascreateDynamicDriver— "dynamic" distinguished it from static drivers that no longer exist);dynamic.ts→driver.ts.SandboxClient(wasLoopSandboxClient— no longer the loop's port name; it's the box-shaped structural contract).createExecutor({ backend: 'router'|'bridge'|'cli'|'sandbox', …seam })— the ONE built-in: the backend is serializable data (a profile/experiment-config/journal can name it), not an import choice. The per-backend factories are internal case-arms; the registry feeds from the same bodies; BYO agents implementExecutordirectly (the port stays open).inlineSandboxClient(factory)— the ONE pseudo-box adapter: any non-boxExecutordrivesrunLoopwithout re-faking a box. The three duplicated shims (router-executor.ts, generate-eval's bridge client, search-bench's bridge POST) now share it.3. Process hygiene
CLAUDE.mddecluttered to the timeless contract — a top-of-file rule (pointers, not state: no gate numbers / run ids / session status; those live in.evolve/current.json+memory/) and the code-map fixed to the unified names.Test plan
pnpm run typecheck(src) +tsc --noEmit(bench): 0 errors.pnpm run lint: clean.pnpm test: 66 files / 676 tests pass.docs/eval-substrate.md.main.