feat(product-benchmark): export helpers for product-owned benchmark bundles#302
Conversation
…est build, run-dir export
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 45a3c090
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-02T02:27:31Z
tangletools
left a comment
There was a problem hiding this comment.
⚪ Value Audit — audit-incomplete
| Verdict | audit-incomplete |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.1s |
| Interrogation | 14.8s (2 bridge agents) |
| Total | 14.9s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
No PR concerns were produced because the value/usefulness agent pass did not complete. Treat this audit as incomplete, not as approval.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 66 | 76 | 92 | 66 |
| Confidence | 70 | 70 | 70 | 70 |
| Correctness | 66 | 76 | 92 | 66 |
| Security | 66 | 76 | 92 | 66 |
| Testing | 66 | 76 | 92 | 66 |
| Architecture | 66 | 76 | 92 | 66 |
Reviewer score is advisory once the run is complete and the verdict has no blockers.
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Manifest arms/scenarios deduplicated by id only, dropping divergent attributes — src/product-benchmark/export.ts
Arms and scenarios are built with
new Map(records.map((record) => [record.armId, record])).values()and the same pattern for scenarioId, keeping only the first record per id. If records share an armId but differ in profileId, model, backend, harness, or reasoningEffort, the manifest silently keeps one set of policy axes and profileId. The validator (index.ts:468-472) only checks that the arm/scenario id exists, not that attributes match the records. This can produce a manifest that misrepresents the actual arms/scenarios in the records. Fix: group by a composite key (armId + profileId + model + backend + harness + reasoningEffort) for arms and (scenarioId + split) for scenarios, or validate that all records with the same armId/scenarioId agree on the manifest attributes and fail loud on
🟠 MEDIUM Substrate version defaults silently fall back to 'unknown' — src/product-benchmark/export.ts
packageVersionreturns 'unknown' when a package is not declared in the cwd package.json, and bothbuildProductBenchmarkManifest(lines 496-501) andresolveOptions(line 618) use this default. The manifest validator only checks that substrate fields are non-empty strings, so 'unknown' passes validation silently. This violates the repo's 'No fallbacks. Fail loud.' doctrine (CLAUDE.md) and can produce manifests with meaningless substrate versions when export is run from a repo that does not directly declare @tangle-network/a
🟠 MEDIUM Substrate 'unknown' versions pass silently; no substrateFailures analog to repoFailures — src/product-benchmark/index.ts
The PR adds repoFailures that flags manifest.repo.* === 'unknown' (index.ts:474-476) and assertProductBenchmarkRun (index.ts:547-558) throws on it. But manifest.substrate.{agentEval,agentRuntime,agentInterface,sandbox} are validated only as non-empty strings (validateProductBenchmarkManifest:308-311), and packageVersion (export.ts) returns 'unknown' whenever the dep is absent from cwd package.json. A bundle whose entire substrate block is 'unknown' — un-reproducible — passes assertProductBenchmarkRun cleanly. Reproducibility of a benchmark depends on substrate identity at least as much as on repo branch. Add a substrateFailures list (same shape as repoFailures) flagging any substrate key === 'unknown', and include it in assertProductBenchmarkRun's failure union at index.ts:553.
🟡 LOW Test coverage gaps: option overrides and individually-exported helpers untested — src/product-benchmark/export.test.ts
The four product-drift tests + merge/single-run/error tests cover the core matrix well, but several ProductBenchmarkExportOptions fields are never asserted: passThreshold (only the default 0.7 is exercised), backendVersion override, mutableSurfaces override (only default asserted at line ~117), scenarioTagPrefix override (only the projectId→strip heuristic). The individually-exported helpers runRecordToProductBenchmarkRecord and buildProductBenchmarkManifest are exercised only indirectly through exportProductBenchmarkRuns; productBenchmarkRepoIdentity is never called directly. Add one test per override to lock the option contract.
🟡 LOW Circular module dependency between index.ts and export.ts — src/product-benchmark/export.ts
export.ts imports
validateProductBenchmarkManifestandvalidateProductBenchmarkRecordfrom./index, while index.ts re-exports types and functions from./export(index.ts:560-572). Current load order happens to work because the validators are defined before the re-export in index.ts, but the coupling is fragile and can break with future reordering or bundler changes. Fix: move the shared validators and types that both modules need into a separatecontract.tsorvalidate.tsfile that both import, eliminating the cycle.
🟡 LOW Repo branch fallback 'detached:unknown' evades repo-failure detection — src/product-benchmark/export.ts
When git cannot resolve a branch or commit,
repoBranchreturnsdetached:unknown. TherepoFailurescheck invalidateProductBenchmarkRun(index.ts:474-476) only flags values that are empty or exactly equal to'unknown', sodetached:unknownpasses as a valid branch. This contradicts the stated intent that 'unknown' values are flagged. Fix: make the repo-failure check detect sentinels containing 'unknown' (e.g.,manifest.repo[key].includes('unknown')) or avoid embedding 'unknown' in the fallback string.
🟡 LOW Safety split detection only matches numeric 1, not boolean true — src/product-benchmark/export.ts
(record.outcome.raw as Record<string, number>).safety === 1is false when a harness emitssafety: true, becausetrue === 1is false in JavaScript. AlthoughRunOutcome.rawis typed asRecord<string, number>, the export ingests raw JSONL and casts viaexpectRunRecordShape, so a boolean can reach this check at runtime. A safety row would then be misclassified as practice/dev/holdout based on splitTag. Fix: accept both1andtrue(e.g.,raw.safety === 1 || raw.safety === true).
🟡 LOW packageVersion records dep semver range, not installed version — src/product-benchmark/export.ts
packageVersion reads cwd package.json and returns
pkg.dependencies?.[name] ?? pkg.devDependencies?.[name] ?? 'unknown'— i.e. a range like '^0.101.0', not the resolved/installed version from node_modules//package.json. The field is recorded under manifest.substrate.* and backend.version, both used for reproducibility. A range understates precision (many satisfying versions exist). Consistent with prior product exporters per the doc comment, so not a regression — but reading node_modules//package.json (or accepting an explicit resolvedVersion option) would make the substrate block actually pin a run. Worth a follow-up.
🟡 LOW sourceProfileHash fallback mixes cellId with source hash semantics — src/product-benchmark/export.ts
The fallback chain at line 308–309 reads
record.agentProfile?.sourceProfile?.hash ?? record.agentProfile?.cellId. ButcellIdis"agent-profile-cell:sha256:<full cell hash>"— the identity of the entire agent profile cell (profileId + sourceProfile + harness + model + dimensions + promptHash) — not just the source profile. Using it where a source-profile hash is expected produces a materialized profile hash that incorporates cell metadata rather than just{sourceHash, model}. The result is deterministic and self-consistent within the bundle (same computation used in manifest and records), but a consumer trying to independently verify the profile has
🟡 LOW splitTag==='search' silently collapses to 'practice' (undocumented mapping) — src/product-benchmark/export.ts
splitOf: custom → safety(raw.safety===1) → 'holdout' → 'dev' → else 'practice'. RunRecord.splitTag is 'search'|'dev'|'holdout' (run-record.ts:33); product-benchmark splits add 'safety'/'sentinel' but NOT 'search'. A row with splitTag='search' (the optimization pool per run-record.ts:32) falls through to 'practice' with no comment. This is a defensible default (search ≈ practice) but is invisible to callers and not tested. Either document the mapping in the splitOf docstring or add an explicit
if (record.splitTag === 'search') return 'practice'line with a one-line comment so the intent is visible at the decision site.
tangletools · 2026-07-02T03:19:05Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 10 non-blocking findings — 45a3c090
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-07-02T03:19:05Z · immutable trace
…venance - arms/scenarios dedup fails loud when records sharing an id disagree on a policy axis or split (id-only dedup silently misrepresented them) - substrateFailures mirrors repoFailures: an 'unknown' substrate version fails assertProductBenchmarkRun (un-reproducible bundle) - packageVersion falls back to the INSTALLED node_modules version (transitive deps record real provenance), plus an explicit substrate override option for environments where a package is absent
|
All three MEDIUMs fixed, latest push:
Full suite 2671 passed / 2 skipped; lint + typecheck clean; branch merged with main. |
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 282ba8bf
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-02T05:06:27Z
Problem
research-package.tsis triplicated across tax-agent (687 lines), legal-agent (666), and creative-agent (536) — and has already drifted 359 lines between tax and legal. agent-eval main already owns theproduct-benchmarkschema + validators (ProductBenchmarkRecord,ProductBenchmarkManifest,validateProductBenchmarkRun, integrity failures); what the products still hand-roll is the EXPORT side: converting RunRecords, building manifests, and writing the bundle.Change
Additive to the existing
@tangle-network/agent-eval/product-benchmarksubpath — no consumer updates required until products migrate:runRecordToProductBenchmarkRecord— RunRecord + artifact pointers → validated benchmark recordbuildProductBenchmarkManifest/productBenchmarkRepoIdentity— manifest with repo/commit/package provenanceexportProductBenchmark/exportProductBenchmarkRuns— run-dir(s) → manifest + records JSONL + artifact pointers, validated on writeProduct-specific code (scenario lists, runner commands, scoring, profile builders) stays in products; the next step is migrating tax/legal/creative
research-package.tsonto these helpers after release.Verification
npx vitest run src/product-benchmark/→ 12 pass / 0 failnpx tsc --noEmit→ clean🤖 Generated with Claude Code