feat(0.7.2): extract muffled-gate scanner + CostTracker.recordVerdict by drewstone · Pull Request #7 · tangle-network/agent-eval

drewstone · 2026-04-24T22:10:47Z

What

Promote two reusable primitives out of starter-foundry so every agent-eval consumer gets them for free.

1. `scanForMuffledGates` + default finders

Test helper that greps consumer source for gate/measurement anti-patterns. 5 default finders cover the common forms (fallback-to-pass, literal-true-pass, auto-match-no-expectation, skip-counts-as-pass, construct-vs-call-cwd). Supports per-file context-specific finders + auto-derived scan across importers. // muffle-ok: <reason> escape hatch.

Pattern is documented at starter-foundry/.evolve/patterns/muffled-gate.md (both gating + measurement layers). 10+ incidents in starter-foundry motivated this; same class hits every consumer.

2. `CostTracker.recordVerdict(verdict, scenarioId, tags?)`

Convenience wrapper over record + markOutcome for {usage, verdict}-shaped judge responses. Returns null + no-ops when verdict has no usage (compile-gate short-circuits don't spend). Starter-foundry's agent-eval-scaffold.mjs hand-rolls this 3-line pattern per seed; now one call.

Impact

agent-eval becomes the canonical home for both primitives
starter-foundry can replace its hand-rolled copy with an import (follow-up PR)
any future consumer (BA, GTM agent, third-party) gets both for free

Test plan

pnpm build — clean
pnpm test — 336/336 (+12 new: 7 scanner + 4 recordVerdict + 1 absorbed)
No breaking changes; purely additive exports
No Co-Authored-By

…or fallbacks Motivation: meta-analysis of starter-foundry's Gen 6→Round-0-post-Gen-9 arc surfaced 10+ incident-driven lessons about using this package. They lived nowhere canonical (README/CLAUDE.md described a stale 0.2-era API surface; the actual v0.7 builder-of-builders + sandbox harness exports had zero usage docs). Two shipped bugs traced to the same driver construct-vs-call cwd footgun. Consolidating into one authoritative doc + closing the footgun at the source. Changes: - .claude/skills/agent-eval/SKILL.md (NEW, sole source of truth) - minimal builder-of-builders path - 4 footguns (cwd-in-constructor, fallback-to-pass, fidelity-without- compile-gate, blob-vs-files channel) - 3 rules (both gates, single-source dispatch, Phase 1.5 walks entry points) - three-layer eval contract (builder → app-build → app-runtime) - regression tests every consumer should carry - extend-don't-duplicate index over the 100+ exports - muffled-gate pattern catalog (7 sub-shapes from shipped bugs) - README.md + CLAUDE.md → pointers to SKILL.md. No duplicated content. - SubprocessSandboxDriver constructor now accepts `{cwd?, env?}` as FALLBACKS when HarnessConfig omits them. Per-call config always wins. Pre-0.7.1 the constructor took no declared args, so TS tolerated `new Driver({cwd})` and silently dropped the arg at runtime — the exact shape of the Gen 8b promoter + Round-0 runtime eval bugs in starter-foundry. 0.7.1 makes the natural misuse do the obvious thing. New type: `SubprocessDriverDefaults`. Zero breaking changes for code that already reads cwd from HarnessConfig (the documented path). - tests/sandbox-harness.test.ts: +3 tests guarding the new defaults contract — default.cwd honored, per-call wins over default, defaults.env merges correctly. 322/322 tests pass (was 319; +3 new). typecheck clean. Version: 0.7.0 → 0.7.1.

PR #4 shipped the same SubprocessSandboxDriver constructor-fallback fix while this branch was open. Resolved: - src/sandbox-harness.ts + tests/sandbox-harness.test.ts: take main's version (functionally equivalent; type named SubprocessSandboxDriverOptions instead of SubprocessDriverDefaults — main's name is better, already shipped) - src/index.ts: export SubprocessSandboxDriverOptions (main forgot to export the new type) - tests: also fix the env-merge test's printenv form — BSD printenv on macOS only prints the first matched var, making the test platform-flaky. Switch to env|grep which survives missing vars. Net: keep SKILL.md + README/CLAUDE pointers + version bump + type re-export + macOS test fix. All 322 tests pass.

… helper Two reusable primitives promoted out of starter-foundry so every agent-eval consumer gets them for free: 1) scanForMuffledGates() + DEFAULT_FINDERS + UNIVERSAL_FINDERS (src/muffled-gate-scanner.ts, exported from index) Test helper that greps consumer source for gate/measurement anti-patterns and returns {file, line, pattern} findings. 5 default finders (fallback-to-pass, literal-true-pass, auto- match-no-expectation, skip-counts-as-pass, construct-vs-call-cwd). Supports per-file context-specific finders + auto-derived scan across importers of a target string (e.g. '@tangle-network/agent-eval'). `muffle-ok: <reason>` annotation is the opt-out escape hatch. Pattern documented at starter-foundry/.evolve/patterns/muffled-gate.md (both gating + measurement layers). 10+ incidents in starter-foundry motivated this; any agent-eval consumer hits the same class. 2) CostTracker.recordVerdict(verdict, scenarioId, tags?) (src/cost-tracker.ts) Convenience: record + markOutcome in one call from a {usage, verdict}-shaped judge response. Returns null + no-ops when verdict has no usage (e.g. compile-gate short-circuit) so callers don't need their own guard. Starter-foundry's agent-eval-scaffold.mjs hand-rolls this 3-line pattern per seed; now one call. Tests: +12 (7 scanner + 4 recordVerdict + 1 absorbed). 336/336 pass. Build clean. Version 0.7.1 → 0.7.2. No breaking changes; purely additive exports.

drewstone added 5 commits April 24, 2026 00:55

feat: add harness optimization primitives

31cf927

Merge branch 'main' into feat/muffled-gate-testing-util

3dbd386

drewstone merged commit 2bd19ca into main Apr 24, 2026

drewstone deleted the feat/muffled-gate-testing-util branch May 8, 2026 14:49

drewstone mentioned this pull request May 19, 2026

feat(analyst): registry + findings envelope over existing primitives #56

Merged

4 tasks

tangletools mentioned this pull request May 22, 2026

[0.32.0+] evalReportingSuite — close the measurement loop with 5 primitives, not 30 #76

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.7.2): extract muffled-gate scanner + CostTracker.recordVerdict#7

feat(0.7.2): extract muffled-gate scanner + CostTracker.recordVerdict#7
drewstone merged 5 commits into
mainfrom
feat/muffled-gate-testing-util

drewstone commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 24, 2026

What

1. scanForMuffledGates + default finders

2. CostTracker.recordVerdict(verdict, scenarioId, tags?)

Impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `scanForMuffledGates` + default finders

2. `CostTracker.recordVerdict(verdict, scenarioId, tags?)`