feat(0.7.2): extract muffled-gate scanner + CostTracker.recordVerdict#7
Merged
Conversation
…or fallbacks
Motivation: meta-analysis of starter-foundry's Gen 6→Round-0-post-Gen-9
arc surfaced 10+ incident-driven lessons about using this package. They
lived nowhere canonical (README/CLAUDE.md described a stale 0.2-era API
surface; the actual v0.7 builder-of-builders + sandbox harness exports
had zero usage docs). Two shipped bugs traced to the same driver
construct-vs-call cwd footgun. Consolidating into one authoritative
doc + closing the footgun at the source.
Changes:
- .claude/skills/agent-eval/SKILL.md (NEW, sole source of truth)
- minimal builder-of-builders path
- 4 footguns (cwd-in-constructor, fallback-to-pass, fidelity-without-
compile-gate, blob-vs-files channel)
- 3 rules (both gates, single-source dispatch, Phase 1.5 walks entry
points)
- three-layer eval contract (builder → app-build → app-runtime)
- regression tests every consumer should carry
- extend-don't-duplicate index over the 100+ exports
- muffled-gate pattern catalog (7 sub-shapes from shipped bugs)
- README.md + CLAUDE.md → pointers to SKILL.md. No duplicated content.
- SubprocessSandboxDriver constructor now accepts `{cwd?, env?}` as
FALLBACKS when HarnessConfig omits them. Per-call config always wins.
Pre-0.7.1 the constructor took no declared args, so TS tolerated
`new Driver({cwd})` and silently dropped the arg at runtime — the
exact shape of the Gen 8b promoter + Round-0 runtime eval bugs in
starter-foundry. 0.7.1 makes the natural misuse do the obvious thing.
New type: `SubprocessDriverDefaults`. Zero breaking changes for
code that already reads cwd from HarnessConfig (the documented path).
- tests/sandbox-harness.test.ts: +3 tests guarding the new defaults
contract — default.cwd honored, per-call wins over default,
defaults.env merges correctly.
322/322 tests pass (was 319; +3 new). typecheck clean.
Version: 0.7.0 → 0.7.1.
PR #4 shipped the same SubprocessSandboxDriver constructor-fallback fix while this branch was open. Resolved: - src/sandbox-harness.ts + tests/sandbox-harness.test.ts: take main's version (functionally equivalent; type named SubprocessSandboxDriverOptions instead of SubprocessDriverDefaults — main's name is better, already shipped) - src/index.ts: export SubprocessSandboxDriverOptions (main forgot to export the new type) - tests: also fix the env-merge test's printenv form — BSD printenv on macOS only prints the first matched var, making the test platform-flaky. Switch to env|grep which survives missing vars. Net: keep SKILL.md + README/CLAUDE pointers + version bump + type re-export + macOS test fix. All 322 tests pass.
… helper
Two reusable primitives promoted out of starter-foundry so every
agent-eval consumer gets them for free:
1) scanForMuffledGates() + DEFAULT_FINDERS + UNIVERSAL_FINDERS
(src/muffled-gate-scanner.ts, exported from index)
Test helper that greps consumer source for gate/measurement
anti-patterns and returns {file, line, pattern} findings. 5
default finders (fallback-to-pass, literal-true-pass, auto-
match-no-expectation, skip-counts-as-pass, construct-vs-call-cwd).
Supports per-file context-specific finders + auto-derived scan
across importers of a target string (e.g. '@tangle-network/agent-eval').
`muffle-ok: <reason>` annotation is the opt-out escape hatch.
Pattern documented at starter-foundry/.evolve/patterns/muffled-gate.md
(both gating + measurement layers). 10+ incidents in starter-foundry
motivated this; any agent-eval consumer hits the same class.
2) CostTracker.recordVerdict(verdict, scenarioId, tags?)
(src/cost-tracker.ts)
Convenience: record + markOutcome in one call from a
{usage, verdict}-shaped judge response. Returns null + no-ops when
verdict has no usage (e.g. compile-gate short-circuit) so callers
don't need their own guard. Starter-foundry's agent-eval-scaffold.mjs
hand-rolls this 3-line pattern per seed; now one call.
Tests: +12 (7 scanner + 4 recordVerdict + 1 absorbed). 336/336 pass.
Build clean. Version 0.7.1 → 0.7.2. No breaking changes; purely
additive exports.
4 tasks
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Promote two reusable primitives out of starter-foundry so every agent-eval consumer gets them for free.
1.
scanForMuffledGates+ default findersTest helper that greps consumer source for gate/measurement anti-patterns. 5 default finders cover the common forms (fallback-to-pass, literal-true-pass, auto-match-no-expectation, skip-counts-as-pass, construct-vs-call-cwd). Supports per-file context-specific finders + auto-derived scan across importers.
// muffle-ok: <reason>escape hatch.Pattern is documented at starter-foundry/.evolve/patterns/muffled-gate.md (both gating + measurement layers). 10+ incidents in starter-foundry motivated this; same class hits every consumer.
2.
CostTracker.recordVerdict(verdict, scenarioId, tags?)Convenience wrapper over
record + markOutcomefor{usage, verdict}-shaped judge responses. Returns null + no-ops when verdict has no usage (compile-gate short-circuits don't spend). Starter-foundry's agent-eval-scaffold.mjs hand-rolls this 3-line pattern per seed; now one call.Impact
Test plan
pnpm build— cleanpnpm test— 336/336 (+12 new: 7 scanner + 4 recordVerdict + 1 absorbed)