Skip to content

feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough#186

Merged
drewstone merged 1 commit into
mainfrom
feat/research-leaderboard
Jun 7, 2026
Merged

feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough#186
drewstone merged 1 commit into
mainfrom
feat/research-leaderboard

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

What

An off-sandbox research-bench leaderboard worker, plus the one shared-primitive extension it needs.

  • bench/src/research-gate.mts (new): model × web-search-provider × multi-shot leaderboard on the research benches (finsearchcomp / frames / hotpotqa / simpleqa), run over the router as pure HTTP (off-sandbox, so it never contends with sandbox-bound gates). Per shot: provider-pinned /v1/search?provider=<id> + web_fetch of the top-K result pages, then answer with that evidence — no tools on the answer call, so every arm differs only by the search provider (a clean controlled A/B). SEARCH=default skips search (the parametric control).
  • bench/src/experiment.ts: sandboxAgentRun gains an optional env passthrough, merged onto the standard OPENAI_* box env — so a caller can pin the in-box agent's web-search provider (TANGLE_SEARCH_DEFAULT_PROVIDER) / forward provider keys. Additive; existing callers unchanged.
  • bench/src/rsi.ts: forwards SEARCH / EXA_API_KEY to the box via that passthrough.

Why

First clean result through this path: on SimpleQA, you.com search lifts every model to ~90% (a model-equalizer; +70pp for gpt-4o-mini, +20pp for the strong models). The worker started as a 423-line standalone that reinvented runExperiment / runPool / sandboxAgentRun / corpus; this is the deep-cleaned 259-line version that consolidates onto the one-flow kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge) with zero reinvented pool/corpus/sandbox code.

Verification

  • research-gate.mts: tsc -p bench/tsconfig.json clean.
  • SimpleQA you-arm reproduces 2/2 through the cleaned worker.
  • Note: bench/src/experiment.ts:309 (lineage: { streaming: 'poll' }) shows a pre-existing tsc error on main — the bench's installed @tangle-network/agent-runtime is behind src (no lineage field yet). Not introduced by this PR; flagged for a dep bump.

Related (tracked separately, not in this PR)

  • ops-board #976: sandbox boxes (agent-dev-container) reach only the router (egress allowlist), so in-box agents can't web-search natively — the off-sandbox path here is unaffected; the harness-in-box search fix is tracked there.

…un env passthrough

research-gate.mts: off-sandbox research-bench leaderboard (model x web-search-provider x multi-shot) over the router -- provider-pinned /v1/search + web_fetch, then answer. Deep-cleaned onto the kernel primitives (routerChatWithUsage, runPool, appendRunRecord, adapter.judge); deleted the reinvented pool/corpus/sandbox backends. 424 -> 259 lines.

experiment.ts: sandboxAgentRun gains an optional env passthrough (merged onto OPENAI_*), letting a caller pin the in-box agent search provider (TANGLE_SEARCH_DEFAULT_PROVIDER). rsi.ts forwards SEARCH / EXA_API_KEY to the box via it.

Verified: tsc clean; SimpleQA you-arm reproduces 2/2 through the cleaned worker.
@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — 521c8bff

Readiness 69/100 · Confidence 65/100 · 8 findings (3 medium, 5 low)

deepseek glm aggregate
Readiness 69 76 69
Confidence 65 65 65
Correctness 69 76 69
Security 69 76 69
Testing 69 76 69
Architecture 69 76 69

Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Router usage/cost data discarded when building corpus records — bench/src/research-gate.mts

Line 148: routerChatWithUsage returns { content, usage?, costUsd? } with real token counts and derived cost. But lines 214-223 build AttemptRecord manually and never populate costUsd, tokensIn, or tokensOut. The corpus contract (corpus.ts:25-31) says these fields are "present only when the worker actually reported them" — the worker DID report them, they're just being dropped. Downstream tools (corpus-report, eval gate) rely on cost-aware comparisons. Fix: capture { usage, costUsd } from the routerChatWithUsage re

🟠 MEDIUM Search failure produces contradictory model prompt — bench/src/research-gate.mts

Lines 102-138: When useSearch is true but the search HTTP call returns non-200, context stays empty (line 114: warn + fall through). At line 133, the system prompt still says "Use the WEB SEARCH RESULTS below (snippets + fetched page content) as your primary evidence; cite the source" because useSearch remains true. But at [line 138](https://github.com/tangle-network/agent-runtime/blob/521c8bff30

🟠 MEDIUM research-gate AttemptRecords lack costUsd/tokensIn/tokensOut — all records are unmappable by corpus projection — bench/src/research-gate.mts

Lines 214-223: The AttemptRecord written by research-gate omits costUsd, tokensIn, tokensOut. The corpus projection (corpus.ts:203-210) requires all three and marks any attempt missing them as 'unmappable'. This means benchRecordToCorpusRecords() will reject every attempt from research-gate. If research-gate records are only consumed by corpus-report.mts (which reads raw RunRecords, not canonical projections), this is benign. But if anyone runs the canonical bridge on a research-gate corpus, it silently produces zero records. The RunRecord itself still writes fine via appendRunRecord — this is only a downstream-projection gap. routerChatWithUsage returns

🟡 LOW env spread can override router auth keys — bench/src/experiment.ts

Line 217: env: { OPENAI_API_KEY: opts.routerKey, OPENAI_BASE_URL: opts.routerBaseUrl, ...opts.env } — caller-supplied env is spread ON TOP and can clobber OPENAI_API_KEY and OPENAI_BASE_URL. The JSDoc (line 201) documents this as "merged ON TOP" which is accurate, but there's no runtime guard against accidental override. A caller who unintentionally includes OPENAI_API_KEY in their env object would silently break router auth for that run. Fix: defensive check that opts.env doesn't contain auth-critical keys, or log a warning.

🟡 LOW Inconsistent trailing-slash handling on routerBaseUrl — bench/src/research-gate.mts

routerChatWithUsage (router-client.ts:32) strips trailing slashes: cfg.routerBaseUrl.replace(/\/$/, ''). But runResearchShot (line 106) and fetchPage (line 67) use raw ${cfg.routerBaseUrl}/search... and ${cfg.routerBaseUrl}/search/mcp.... A user-supplied ROUTER_BASE ending in / (e.g. https://router.tangle.tools/v1/) produces double-slash URLs (//search) in the research gate but not in the router-client, causing silent failures in one path but not the other. Fix: apply the same replace(/\/$/, '') guard to the base

🟡 LOW No test coverage for new research-gate.mts (249 lines) — bench/src/research-gate.mts

No tests found in bench/tests/ for research-gate, experiment, or rsi. The bench scripts appear to be integration-test-only (run against live router/sandbox), but the pure functions (fetchPage, runResearchShot's query-extraction logic, the search sentinel check) are unit-testable. The query-extraction line task.prompt.split('\n').find(l => l.trim().length > 0) (line 105) has an edge case: if the first non-empty line is a metadata header or instruction rather than the actual question, the search query will be poor. This is testable in isolation.

🟡 LOW fetchPage silently swallows all errors — no logging on auth/parse/network failures — bench/src/research-gate.mts

Lines 65-86: fetchPage catches all errors and returns '' with zero logging. A 401 auth failure, DNS failure, or malformed JSON response is invisible. The search shot surface logs a search FAIL (line 114) but page-fetch failures within Promise.all (line 120) are completely silent. When fetched[i] is '', the context silently omits that page. This is correct for fault isolation but makes debugging zero-page

🟡 LOW rsi.ts forwards EXA_API_KEY unconditionally — any process.env.EXA_API_KEY leaks into sandbox — bench/src/rsi.ts

Line 54: if (process.env.EXA_API_KEY) searchEnv.EXA_API_KEY = process.env.EXA_API_KEY forwards the key without allowlisting. The sandboxAgentRun docstring (experiment.ts:201-204) says 'Allowlisted keys only reach the spawned CLI' but the caller in rsi.ts does its own allowlisting by explicit key name. This is fine functionally, but the allowlist is分散 across callers rather than enforced at the sandboxAgentRun layer. If a caller accidentally spreads a broader env object, keys leak. Low risk in current code; noting for the allowlist-in-one-place principle.


tangletools · 2026-06-06T22:45:19Z · trace

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 8 non-blocking findings — 521c8bff

Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T22:45:19Z · immutable trace

@tangletools
Copy link
Copy Markdown
Contributor

Premise check withheld merge — 521c8bff

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +70pp
  • PR body excerpt: feat(bench): research-leaderboard worker (router-RAG) + sandboxAgentRun env passthrough

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 2 numeric claim(s) (+70pp, +20pp) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #186

@drewstone drewstone merged commit b7e9e3c into main Jun 7, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 7, 2026
@drewstone drewstone deleted the feat/research-leaderboard branch June 7, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants