ci: harden npm publish workflow by drewstone · Pull Request #7 · tangle-network/browser-agent-driver

drewstone · 2026-03-04T00:35:41Z

Summary

support both release publish and tag push triggers (v*, agent-browser-driver-v*)
add strict tag-to-package-version guard
keep provenance publish and add NPM_TOKEN support

Validation

pnpm install --frozen-lockfile
pnpm build
pnpm test

The competitive bench at v0.19.0 surfaced a real architectural bug in bad's planner: on extraction tasks the planner emits runScript→complete with placeholder values (null, "<from prior step>") in complete.result because the planner has to commit to result text BEFORE runScript runs. Pre-fix: dashboard-extract pass rate 0/3 = 0%. Three layers of defense: 1. executePlan placeholder substitution (deterministic, runner-side). Tracks lastRunScriptOutput across plan steps. When a complete.result matches hasPlaceholderPattern() AND there's a captured runScript output, substitutes the runScript output as the actual final result. Marks the substituted turn with "Gen 7.2 substituted" for forensics. 2. executePlan auto-complete-from-runScript. When the planner correctly emits ONLY runScript (no complete) per the new prompt rule, and the plan exhausts, the runner synthesizes a complete with the runScript output instead of falling through to the per-action loop. Eliminates 4-5 wasted per-action LLM calls. 3. Planner system prompt rule #7. New rule explicitly tells the planner: "EXTRACTION TASKS: when the goal asks you to READ, EXTRACT, REPORT, or RETURN values from the page, the LAST step of your plan MUST be runScript. Do NOT emit a complete step after the runScript with literal values, because at planning time you cannot know what runScript will return." Byte-stable so prompt cache still hits. hasPlaceholderPattern detects: - JSON null literals: {"x": null, "y": null} - Angle-bracket placeholders: <from prior step>, <placeholder>, <value from ...>, <extracted ...>, <observed ...>, <previous step> - Double-curly templates: {{userCount}} Conservative — "null" in prose like "null pointer exception" does NOT match because we look for the JSON null literal pattern (`: null`). Verified result (5 reps × dashboard-extract, isolated run, gpt-5.2): | metric | n | mean | min | max | |-----------------|---|--------:|--------:|--------:| | pass rate | 5 | 100% | - | - | | wall-time (s) | 5 | 7.7 | 5.1 | 9.4 | | turns | 5 | 2.0 | 2 | 2 | | LLM calls | 5 | 1.0 | 1 | 1 | | total tokens | 5 | 3835 | 3700 | 4015 | | cost ($) | 5 | 0.0131 | 0.0112 | 0.0156 | Wilson 95% CI on pass rate: [57%, 100%]. Pre-Gen 7.2 (v0.19.0): bad scored 0/3 = 0% on this task. Post-Gen 7.2 (this PR): 5/5 = 100% AND beats browser-use: | metric | bad (Gen 7.2) | browser-use 0.12.6 | verdict | |-----------|--------------:|-------------------:|------------------| | pass rate | 100% (5/5) | 100% (3/3) | tied | | wall-time | 7.7s | 20.6s | bad 2.7x faster | | LLM calls | 1.0 | 3.0 | bad 3x fewer | | tokens | 3835 | 19908 | bad 5.2x fewer | | cost | $0.0131 | $0.0258 | bad 49% cheaper | Tests: 944 passing (937 + 7 new): - placeholder substitution happy path - leave-unchanged when no placeholders - auto-complete-from-runScript when plan ends with successful runScript - does NOT auto-complete on empty runScript output - hasPlaceholderPattern: detects all 3 placeholder types, ignores prose and clean JSON Tier1 deterministic gate: PASSED (no regressions). Honest caveat: a concurrent 3-rep run during the full competitive grid (with parallel chromium contention from tier1-gate running) showed 2/3 = 67% — one rep had the LLM-generated runScript picking the wrong DOM element ("+12.5% from last month" subtitle instead of "12,847" value). That's LLM script quality, separate from this fix; tracked as a future Gen 7.3 follow-up. The Gen 7.2 mechanism is verified deterministic by the 7 unit tests.

…sks (#55) The competitive bench at v0.19.0 surfaced a real architectural bug in bad's planner: on extraction tasks the planner emits runScript→complete with placeholder values (null, "<from prior step>") in complete.result because the planner has to commit to result text BEFORE runScript runs. Pre-fix: dashboard-extract pass rate 0/3 = 0%. Three layers of defense: 1. executePlan placeholder substitution (deterministic, runner-side). Tracks lastRunScriptOutput across plan steps. When a complete.result matches hasPlaceholderPattern() AND there's a captured runScript output, substitutes the runScript output as the actual final result. Marks the substituted turn with "Gen 7.2 substituted" for forensics. 2. executePlan auto-complete-from-runScript. When the planner correctly emits ONLY runScript (no complete) per the new prompt rule, and the plan exhausts, the runner synthesizes a complete with the runScript output instead of falling through to the per-action loop. Eliminates 4-5 wasted per-action LLM calls. 3. Planner system prompt rule #7. New rule explicitly tells the planner: "EXTRACTION TASKS: when the goal asks you to READ, EXTRACT, REPORT, or RETURN values from the page, the LAST step of your plan MUST be runScript. Do NOT emit a complete step after the runScript with literal values, because at planning time you cannot know what runScript will return." Byte-stable so prompt cache still hits. hasPlaceholderPattern detects: - JSON null literals: {"x": null, "y": null} - Angle-bracket placeholders: <from prior step>, <placeholder>, <value from ...>, <extracted ...>, <observed ...>, <previous step> - Double-curly templates: {{userCount}} Conservative — "null" in prose like "null pointer exception" does NOT match because we look for the JSON null literal pattern (`: null`). Verified result (5 reps × dashboard-extract, isolated run, gpt-5.2): | metric | n | mean | min | max | |-----------------|---|--------:|--------:|--------:| | pass rate | 5 | 100% | - | - | | wall-time (s) | 5 | 7.7 | 5.1 | 9.4 | | turns | 5 | 2.0 | 2 | 2 | | LLM calls | 5 | 1.0 | 1 | 1 | | total tokens | 5 | 3835 | 3700 | 4015 | | cost ($) | 5 | 0.0131 | 0.0112 | 0.0156 | Wilson 95% CI on pass rate: [57%, 100%]. Pre-Gen 7.2 (v0.19.0): bad scored 0/3 = 0% on this task. Post-Gen 7.2 (this PR): 5/5 = 100% AND beats browser-use: | metric | bad (Gen 7.2) | browser-use 0.12.6 | verdict | |-----------|--------------:|-------------------:|------------------| | pass rate | 100% (5/5) | 100% (3/3) | tied | | wall-time | 7.7s | 20.6s | bad 2.7x faster | | LLM calls | 1.0 | 3.0 | bad 3x fewer | | tokens | 3835 | 19908 | bad 5.2x fewer | | cost | $0.0131 | $0.0258 | bad 49% cheaper | Tests: 944 passing (937 + 7 new): - placeholder substitution happy path - leave-unchanged when no placeholders - auto-complete-from-runScript when plan ends with successful runScript - does NOT auto-complete on empty runScript output - hasPlaceholderPattern: detects all 3 placeholder types, ignores prose and clean JSON Tier1 deterministic gate: PASSED (no regressions). Honest caveat: a concurrent 3-rep run during the full competitive grid (with parallel chromium contention from tier1-gate running) showed 2/3 = 67% — one rep had the LLM-generated runScript picking the wrong DOM element ("+12.5% from last month" subtitle instead of "12,847" value). That's LLM script quality, separate from this fix; tracked as a future Gen 7.3 follow-up. The Gen 7.2 mechanism is verified deterministic by the 7 unit tests.

ci: harden npm publish workflow for tag and release triggers

4d5c31e

drewstone merged commit 3b678a7 into main Mar 4, 2026
3 checks passed

drewstone mentioned this pull request Apr 8, 2026

feat(runner): Gen 7.2 — fix planner placeholder bug for extraction tasks #55

Merged

7 tasks

github-actions Bot mentioned this pull request Apr 8, 2026

Release: version packages #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: harden npm publish workflow#7

ci: harden npm publish workflow#7
drewstone merged 1 commit into
mainfrom
chore/publish-workflow-hardening

drewstone commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewstone commented Mar 4, 2026 •

edited

Loading