feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough by drewstone · Pull Request #186 · tangle-network/agent-runtime

drewstone · 2026-06-06T22:40:10Z

What

An off-sandbox research-bench leaderboard worker, plus the one shared-primitive extension it needs.

bench/src/research-gate.mts (new): model × web-search-provider × multi-shot leaderboard on the research benches (finsearchcomp / frames / hotpotqa / simpleqa), run over the router as pure HTTP (off-sandbox, so it never contends with sandbox-bound gates). Per shot: provider-pinned /v1/search?provider=<id> + web_fetch of the top-K result pages, then answer with that evidence — no tools on the answer call, so every arm differs only by the search provider (a clean controlled A/B). SEARCH=default skips search (the parametric control).
bench/src/experiment.ts: sandboxAgentRun gains an optional env passthrough, merged onto the standard OPENAI_* box env — so a caller can pin the in-box agent's web-search provider (TANGLE_SEARCH_DEFAULT_PROVIDER) / forward provider keys. Additive; existing callers unchanged.
bench/src/rsi.ts: forwards SEARCH / EXA_API_KEY to the box via that passthrough.

Why

First clean result through this path: on SimpleQA, you.com search lifts every model to ~90% (a model-equalizer; +70pp for gpt-4o-mini, +20pp for the strong models). The worker started as a 423-line standalone that reinvented runExperiment / runPool / sandboxAgentRun / corpus; this is the deep-cleaned 259-line version that consolidates onto the one-flow kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge) with zero reinvented pool/corpus/sandbox code.

Verification

research-gate.mts: tsc -p bench/tsconfig.json clean.
SimpleQA you-arm reproduces 2/2 through the cleaned worker.
Note: bench/src/experiment.ts:309 (lineage: { streaming: 'poll' }) shows a pre-existing tsc error on main — the bench's installed @tangle-network/agent-runtime is behind src (no lineage field yet). Not introduced by this PR; flagged for a dep bump.

Related (tracked separately, not in this PR)

ops-board #976: sandbox boxes (agent-dev-container) reach only the router (egress allowlist), so in-box agents can't web-search natively — the off-sandbox path here is unaffected; the harness-in-box search fix is tracked there.

…un env passthrough research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines. experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it. Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.

tangletools · 2026-06-06T22:45:21Z

✅ No Blockers — `521c8bff`

Readiness 69/100 · Confidence 65/100 · 8 findings (3 medium, 5 low)

	deepseek	glm	aggregate
Readiness	69	76	69
Confidence	65	65	65
Correctness	69	76	69
Security	69	76	69
Testing	69	76	69
Architecture	69	76	69

Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Router usage/cost data discarded when building corpus records — bench/src/research-gate.mts

Line 148: routerChatWithUsage returns { content, usage?, costUsd? } with real token counts and derived cost. But lines 214-223 build AttemptRecord manually and never populate costUsd, tokensIn, or tokensOut. The corpus contract (corpus.ts:25-31) says these fields are "present only when the worker actually reported them" — the worker DID report them, they're just being dropped. Downstream tools (corpus-report, eval gate) rely on cost-aware comparisons. Fix: capture { usage, costUsd } from the routerChatWithUsage re

🟠 MEDIUM Search failure produces contradictory model prompt — bench/src/research-gate.mts

Lines 102-138: When useSearch is true but the search HTTP call returns non-200, context stays empty (line 114: warn + fall through). At line 133, the system prompt still says "Use the WEB SEARCH RESULTS below (snippets + fetched page content) as your primary evidence; cite the source" because useSearch remains true. But at [line 138](https://github.com/tangle-network/agent-runtime/blob/521c8bff30

🟠 MEDIUM research-gate AttemptRecords lack costUsd/tokensIn/tokensOut — all records are unmappable by corpus projection — bench/src/research-gate.mts

Lines 214-223: The AttemptRecord written by research-gate omits costUsd, tokensIn, tokensOut. The corpus projection (corpus.ts:203-210) requires all three and marks any attempt missing them as 'unmappable'. This means benchRecordToCorpusRecords() will reject every attempt from research-gate. If research-gate records are only consumed by corpus-report.mts (which reads raw RunRecords, not canonical projections), this is benign. But if anyone runs the canonical bridge on a research-gate corpus, it silently produces zero records. The RunRecord itself still writes fine via appendRunRecord — this is only a downstream-projection gap. routerChatWithUsage returns

🟡 LOW env spread can override router auth keys — bench/src/experiment.ts

Line 217: env: { OPENAI_API_KEY: opts.routerKey, OPENAI_BASE_URL: opts.routerBaseUrl, ...opts.env } — caller-supplied env is spread ON TOP and can clobber OPENAI_API_KEY and OPENAI_BASE_URL. The JSDoc (line 201) documents this as "merged ON TOP" which is accurate, but there's no runtime guard against accidental override. A caller who unintentionally includes OPENAI_API_KEY in their env object would silently break router auth for that run. Fix: defensive check that opts.env doesn't contain auth-critical keys, or log a warning.

🟡 LOW Inconsistent trailing-slash handling on routerBaseUrl — bench/src/research-gate.mts

routerChatWithUsage (router-client.ts:32) strips trailing slashes: cfg.routerBaseUrl.replace(/\/$/, ''). But runResearchShot (line 106) and fetchPage (line 67) use raw ${cfg.routerBaseUrl}/search... and ${cfg.routerBaseUrl}/search/mcp.... A user-supplied ROUTER_BASE ending in / (e.g. https://router.tangle.tools/v1/) produces double-slash URLs (//search) in the research gate but not in the router-client, causing silent failures in one path but not the other. Fix: apply the same replace(/\/$/, '') guard to the base

🟡 LOW No test coverage for new research-gate.mts (249 lines) — bench/src/research-gate.mts

No tests found in bench/tests/ for research-gate, experiment, or rsi. The bench scripts appear to be integration-test-only (run against live router/sandbox), but the pure functions (fetchPage, runResearchShot's query-extraction logic, the search sentinel check) are unit-testable. The query-extraction line task.prompt.split('\n').find(l => l.trim().length > 0) (line 105) has an edge case: if the first non-empty line is a metadata header or instruction rather than the actual question, the search query will be poor. This is testable in isolation.

🟡 LOW fetchPage silently swallows all errors — no logging on auth/parse/network failures — bench/src/research-gate.mts

Lines 65-86: fetchPage catches all errors and returns '' with zero logging. A 401 auth failure, DNS failure, or malformed JSON response is invisible. The search shot surface logs a search FAIL (line 114) but page-fetch failures within Promise.all (line 120) are completely silent. When fetched[i] is '', the context silently omits that page. This is correct for fault isolation but makes debugging zero-page

🟡 LOW rsi.ts forwards EXA_API_KEY unconditionally — any process.env.EXA_API_KEY leaks into sandbox — bench/src/rsi.ts

Line 54: if (process.env.EXA_API_KEY) searchEnv.EXA_API_KEY = process.env.EXA_API_KEY forwards the key without allowlisting. The sandboxAgentRun docstring (experiment.ts:201-204) says 'Allowlisted keys only reach the spawned CLI' but the caller in rsi.ts does its own allowlisting by explicit key name. This is fine functionally, but the allowlist is分散 across callers rather than enforced at the sandboxAgentRun layer. If a caller accidentally spreads a broader env object, keys leak. Low risk in current code; noting for the allowlist-in-one-place principle.

_{tangletools · 2026-06-06T22:45:19Z · trace}

tangletools

✅ Approved — 8 non-blocking findings — `521c8bff`

Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T22:45:19Z · immutable trace}

tangletools · 2026-06-06T22:45:26Z

Premise check withheld merge — `521c8bff`

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

Cited claim: +70pp
PR body excerpt: feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 2 numeric claim(s) (+70pp, +20pp) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.

_{tangletools premise check · #186}

# Conflicts: # bench/src/research-gate.mts

tangletools approved these changes Jun 6, 2026

View reviewed changes

drewstone mentioned this pull request Jun 6, 2026

feat(bench): router-backed loop executor — stateful research through the real kernel #188

Merged

drewstone merged commit b7e9e3c into main Jun 7, 2026
1 check passed

drewstone added a commit that referenced this pull request Jun 7, 2026

Merge main into research-loop-executor (resolve squash-orphan of #186)

8c2b7ac

# Conflicts: # bench/src/research-gate.mts

drewstone deleted the feat/research-leaderboard branch June 7, 2026 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough#186

feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough#186
drewstone merged 1 commit into
mainfrom
feat/research-leaderboard

drewstone commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026

What

Why

Verification

Related (tracked separately, not in this PR)

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — 521c8bff

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 8 non-blocking findings — 521c8bff

Uh oh!

tangletools commented Jun 6, 2026

Premise check withheld merge — 521c8bff

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `521c8bff`

✅ Approved — 8 non-blocking findings — `521c8bff`

Premise check withheld merge — `521c8bff`