Skip to content

feat: add cross-model analysis step to eval workflow#1159

Merged
stack72 merged 1 commit intomainfrom
eval-cross-model-analysis
Apr 10, 2026
Merged

feat: add cross-model analysis step to eval workflow#1159
stack72 merged 1 commit intomainfrom
eval-cross-model-analysis

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 10, 2026

Summary

  • Add a final analysis job to the multi-model eval workflow that runs after all model evals complete
  • Each eval job now uploads its results.json as an artifact
  • The analysis job downloads all results, produces a cross-model comparison, and writes a GitHub Actions step summary with:
    • Results table (model, pass rate, tokens, duration, pass/fail)
    • Cross-model failures (same test fails on multiple models → likely skill description issue)
    • Model-specific failures (only one model fails → likely model quirk)
    • Overall verdict

Test plan

  • Tested analysis script locally against Gemini results — produces correct output
  • CI workflow run should show the new analysis step

🤖 Generated with Claude Code

Add a final analysis job that runs after all model evals complete.
It downloads each model's results.json, produces a cross-model
comparison, identifies shared vs model-specific failures, and
writes a summary to the GitHub Actions step summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI Security Review

Medium

  1. denoland/setup-deno@v2 tag-only pin (.github/workflows/multi-model-eval.yml:88): Third-party action pinned to a mutable tag rather than a full commit SHA. A compromised tag could deliver malicious code. However, this is a pre-existing pattern (same action at line 49, not changed in this PR) and denoland is an established trusted publisher per repo conventions. No action required for this PR, but consider SHA-pinning in a future cleanup pass.

Low

  1. Unscoped --allow-write Deno permission (.github/workflows/multi-model-eval.yml:98): The analysis step uses --allow-write without restricting the path. Could be tightened to --allow-write=$GITHUB_STEP_SUMMARY to follow least-privilege. Practical risk is negligible since the job has no secrets and runs on an ephemeral runner with contents: read only.

Verdict

PASS — The changes are security-clean. The new analysis job has properly scoped job-level permissions (contents: read), uses no secrets, processes only workflow-generated artifacts, and introduces no injection vectors. All new actions are GitHub-owned and appropriately pinned.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Suggestions

  1. Resilience to malformed results: In scripts/analyze_eval_results.ts:117, the destructuring const { stats, results } = data.results is outside the try/catch that handles missing files. If a model's eval job writes a partial or malformed results.json (e.g., data.results is undefined), this would crash the entire analysis rather than skipping that model. Consider wrapping the per-model processing block (lines 117–143) in its own try/catch with a console.warn + continue, matching the existing pattern for missing files.

Overall this is a clean, well-structured addition. The script follows project conventions (license header, no any types, proper unknown usage, named interfaces), the workflow permissions are appropriately scoped (contents: read), and the always() + artifact upload pattern correctly captures partial results from failed eval jobs. The scripts/ exclusion in deno.json is consistent with existing scripts not having tests or type-check requirements.

@stack72 stack72 merged commit 5236b24 into main Apr 10, 2026
11 checks passed
@stack72 stack72 deleted the eval-cross-model-analysis branch April 10, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant