Skip to content

ci: harden npm publish workflow#7

Merged
drewstone merged 1 commit into
mainfrom
chore/publish-workflow-hardening
Mar 4, 2026
Merged

ci: harden npm publish workflow#7
drewstone merged 1 commit into
mainfrom
chore/publish-workflow-hardening

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

@drewstone drewstone commented Mar 4, 2026

Summary

  • support both release publish and tag push triggers (v*, agent-browser-driver-v*)
  • add strict tag-to-package-version guard
  • keep provenance publish and add NPM_TOKEN support

Validation

  • pnpm install --frozen-lockfile
  • pnpm build
  • pnpm test

@drewstone drewstone merged commit 3b678a7 into main Mar 4, 2026
3 checks passed
drewstone added a commit that referenced this pull request Apr 8, 2026
The competitive bench at v0.19.0 surfaced a real architectural bug in
bad's planner: on extraction tasks the planner emits runScript→complete
with placeholder values (null, "<from prior step>") in complete.result
because the planner has to commit to result text BEFORE runScript runs.
Pre-fix: dashboard-extract pass rate 0/3 = 0%.

Three layers of defense:

1. executePlan placeholder substitution (deterministic, runner-side).
   Tracks lastRunScriptOutput across plan steps. When a complete.result
   matches hasPlaceholderPattern() AND there's a captured runScript
   output, substitutes the runScript output as the actual final result.
   Marks the substituted turn with "Gen 7.2 substituted" for forensics.

2. executePlan auto-complete-from-runScript. When the planner correctly
   emits ONLY runScript (no complete) per the new prompt rule, and the
   plan exhausts, the runner synthesizes a complete with the runScript
   output instead of falling through to the per-action loop. Eliminates
   4-5 wasted per-action LLM calls.

3. Planner system prompt rule #7. New rule explicitly tells the planner:
   "EXTRACTION TASKS: when the goal asks you to READ, EXTRACT, REPORT,
   or RETURN values from the page, the LAST step of your plan MUST be
   runScript. Do NOT emit a complete step after the runScript with
   literal values, because at planning time you cannot know what
   runScript will return." Byte-stable so prompt cache still hits.

hasPlaceholderPattern detects:
- JSON null literals: {"x": null, "y": null}
- Angle-bracket placeholders: <from prior step>, <placeholder>,
  <value from ...>, <extracted ...>, <observed ...>, <previous step>
- Double-curly templates: {{userCount}}
Conservative — "null" in prose like "null pointer exception" does NOT
match because we look for the JSON null literal pattern (`: null`).

Verified result (5 reps × dashboard-extract, isolated run, gpt-5.2):

| metric          | n | mean    | min     | max     |
|-----------------|---|--------:|--------:|--------:|
| pass rate       | 5 | 100%    | -       | -       |
| wall-time (s)   | 5 | 7.7     | 5.1     | 9.4     |
| turns           | 5 | 2.0     | 2       | 2       |
| LLM calls       | 5 | 1.0     | 1       | 1       |
| total tokens    | 5 | 3835    | 3700    | 4015    |
| cost ($)        | 5 | 0.0131  | 0.0112  | 0.0156  |

Wilson 95% CI on pass rate: [57%, 100%].

Pre-Gen 7.2 (v0.19.0): bad scored 0/3 = 0% on this task.
Post-Gen 7.2 (this PR): 5/5 = 100% AND beats browser-use:

| metric    | bad (Gen 7.2) | browser-use 0.12.6 | verdict          |
|-----------|--------------:|-------------------:|------------------|
| pass rate | 100% (5/5)    | 100% (3/3)         | tied             |
| wall-time | 7.7s          | 20.6s              | bad 2.7x faster  |
| LLM calls | 1.0           | 3.0                | bad 3x fewer     |
| tokens    | 3835          | 19908              | bad 5.2x fewer   |
| cost      | $0.0131       | $0.0258            | bad 49% cheaper  |

Tests: 944 passing (937 + 7 new):
- placeholder substitution happy path
- leave-unchanged when no placeholders
- auto-complete-from-runScript when plan ends with successful runScript
- does NOT auto-complete on empty runScript output
- hasPlaceholderPattern: detects all 3 placeholder types, ignores prose
  and clean JSON

Tier1 deterministic gate: PASSED (no regressions).

Honest caveat: a concurrent 3-rep run during the full competitive grid
(with parallel chromium contention from tier1-gate running) showed
2/3 = 67% — one rep had the LLM-generated runScript picking the wrong
DOM element ("+12.5% from last month" subtitle instead of "12,847"
value). That's LLM script quality, separate from this fix; tracked as
a future Gen 7.3 follow-up. The Gen 7.2 mechanism is verified
deterministic by the 7 unit tests.
drewstone added a commit that referenced this pull request Apr 8, 2026
…sks (#55)

The competitive bench at v0.19.0 surfaced a real architectural bug in
bad's planner: on extraction tasks the planner emits runScript→complete
with placeholder values (null, "<from prior step>") in complete.result
because the planner has to commit to result text BEFORE runScript runs.
Pre-fix: dashboard-extract pass rate 0/3 = 0%.

Three layers of defense:

1. executePlan placeholder substitution (deterministic, runner-side).
   Tracks lastRunScriptOutput across plan steps. When a complete.result
   matches hasPlaceholderPattern() AND there's a captured runScript
   output, substitutes the runScript output as the actual final result.
   Marks the substituted turn with "Gen 7.2 substituted" for forensics.

2. executePlan auto-complete-from-runScript. When the planner correctly
   emits ONLY runScript (no complete) per the new prompt rule, and the
   plan exhausts, the runner synthesizes a complete with the runScript
   output instead of falling through to the per-action loop. Eliminates
   4-5 wasted per-action LLM calls.

3. Planner system prompt rule #7. New rule explicitly tells the planner:
   "EXTRACTION TASKS: when the goal asks you to READ, EXTRACT, REPORT,
   or RETURN values from the page, the LAST step of your plan MUST be
   runScript. Do NOT emit a complete step after the runScript with
   literal values, because at planning time you cannot know what
   runScript will return." Byte-stable so prompt cache still hits.

hasPlaceholderPattern detects:
- JSON null literals: {"x": null, "y": null}
- Angle-bracket placeholders: <from prior step>, <placeholder>,
  <value from ...>, <extracted ...>, <observed ...>, <previous step>
- Double-curly templates: {{userCount}}
Conservative — "null" in prose like "null pointer exception" does NOT
match because we look for the JSON null literal pattern (`: null`).

Verified result (5 reps × dashboard-extract, isolated run, gpt-5.2):

| metric          | n | mean    | min     | max     |
|-----------------|---|--------:|--------:|--------:|
| pass rate       | 5 | 100%    | -       | -       |
| wall-time (s)   | 5 | 7.7     | 5.1     | 9.4     |
| turns           | 5 | 2.0     | 2       | 2       |
| LLM calls       | 5 | 1.0     | 1       | 1       |
| total tokens    | 5 | 3835    | 3700    | 4015    |
| cost ($)        | 5 | 0.0131  | 0.0112  | 0.0156  |

Wilson 95% CI on pass rate: [57%, 100%].

Pre-Gen 7.2 (v0.19.0): bad scored 0/3 = 0% on this task.
Post-Gen 7.2 (this PR): 5/5 = 100% AND beats browser-use:

| metric    | bad (Gen 7.2) | browser-use 0.12.6 | verdict          |
|-----------|--------------:|-------------------:|------------------|
| pass rate | 100% (5/5)    | 100% (3/3)         | tied             |
| wall-time | 7.7s          | 20.6s              | bad 2.7x faster  |
| LLM calls | 1.0           | 3.0                | bad 3x fewer     |
| tokens    | 3835          | 19908              | bad 5.2x fewer   |
| cost      | $0.0131       | $0.0258            | bad 49% cheaper  |

Tests: 944 passing (937 + 7 new):
- placeholder substitution happy path
- leave-unchanged when no placeholders
- auto-complete-from-runScript when plan ends with successful runScript
- does NOT auto-complete on empty runScript output
- hasPlaceholderPattern: detects all 3 placeholder types, ignores prose
  and clean JSON

Tier1 deterministic gate: PASSED (no regressions).

Honest caveat: a concurrent 3-rep run during the full competitive grid
(with parallel chromium contention from tier1-gate running) showed
2/3 = 67% — one rep had the LLM-generated runScript picking the wrong
DOM element ("+12.5% from last month" subtitle instead of "12,847"
value). That's LLM script quality, separate from this fix; tracked as
a future Gen 7.3 follow-up. The Gen 7.2 mechanism is verified
deterministic by the 7 unit tests.
@github-actions github-actions Bot mentioned this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant