Conversation
There was a problem hiding this comment.
Pull request overview
Updates the evaluation harness and GitHub Actions workflow to run LLM-based evals with newer Gemini configuration and to allow Claude Code to execute Bash tools during tests.
Changes:
- Enable Claude Code CLI Bash tool usage in the login eval.
- Update eval documentation to reference a newer Gemini model.
- Adjust the eval workflow triggers/config and switch the configured Gemini model string.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
evals/test_login.py |
Allows Claude Code CLI to use the Bash tool during the eval run. |
evals/README.md |
Updates the documented Gemini model used by GEval. |
.github/workflows/run-evals.yml |
Changes workflow triggers/config and updates the Gemini model configuration; enables Vertex usage flag for Claude. |
Comments suppressed due to low confidence (2)
evals/test_login.py:14
subprocess.run(...)doesn't check the Claude CLI exit status; if the CLI fails (e.g., due to an unrecognized flag), the test will still proceed with empty/partial stdout and produce misleading evaluation results. Consider usingcheck=True(or explicitly validatingreturncode) and surfacingstderrin the failure message.
result = subprocess.run(
['claude', '-p', prompt, '--allowedTools', 'Bash'],
capture_output=True,
text=True,
timeout=3000
)
return result.stdout
.github/workflows/run-evals.yml:73
- With Bash tools enabled for Claude during evals, model output (and generated reports/artifacts) could inadvertently include sensitive data from the runner environment. Consider isolating secrets from the Claude execution context (e.g., separate steps/jobs, least-privilege credentials, or redact/suppress tool output in artifacts) to reduce the blast radius.
- name: Run evaluations
working-directory: ./evals
env:
GOOGLE_APPLICATION_CREDENTIALS: ${{ steps.auth.outputs.credentials_file_path }}
ANTHROPIC_VERTEX_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
ANTHROPIC_VERTEX_REGION: "global"
ANTHROPIC_DEFAULT_HAIKU_MODEL: "claude-haiku-4-5"
ANTHROPIC_DEFAULT_SONNET_MODEL: "claude-sonnet-4-5"
ANTHROPIC_DEFAULT_OPUS_MODEL: "claude-opus-4-5"
CLAUDE_CODE_SUBAGENT_MODEL: "claude-sonnet-4-5"
CLAUDE_CODE_USE_VERTEX: "1"
run: |
source .venv/bin/activate
deepeval set-gemini \
--model=gemini-3.1-pro-preview \
--project=${{ secrets.GCP_PROJECT_ID }} \
--location=global
deepeval test run . \
--junitxml=results.xml \
--html=report.html \
--self-contained-html
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
This reverts commit 1aa8c06.
Co-authored-by: Patrick Dawkins <pjcdawkins@users.noreply.github.com>
Replace the `push` trigger with `pull_request` and restore the fork-safety guard to skip PRs from forks (which lack access to repository secrets). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uh oh!
There was an error while loading. Please reload this page.