fix eval in the GH action by ganeshdipdumbare · Pull Request #4 · upsun/ai

ganeshdipdumbare · 2026-03-10T13:11:21Z

Upgrade gemini model
Run on only push to current branch and merge to main branch
Allow tools for Claude to run bash

Copilot

Pull request overview

Updates the evaluation harness and GitHub Actions workflow to run LLM-based evals with newer Gemini configuration and to allow Claude Code to execute Bash tools during tests.

Changes:

Enable Claude Code CLI Bash tool usage in the login eval.
Update eval documentation to reference a newer Gemini model.
Adjust the eval workflow triggers/config and switch the configured Gemini model string.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`evals/test_login.py`	Allows Claude Code CLI to use the Bash tool during the eval run.
`evals/README.md`	Updates the documented Gemini model used by GEval.
`.github/workflows/run-evals.yml`	Changes workflow triggers/config and updates the Gemini model configuration; enables Vertex usage flag for Claude.

Comments suppressed due to low confidence (2)

evals/test_login.py:14

subprocess.run(...) doesn't check the Claude CLI exit status; if the CLI fails (e.g., due to an unrecognized flag), the test will still proceed with empty/partial stdout and produce misleading evaluation results. Consider using check=True (or explicitly validating returncode) and surfacing stderr in the failure message.

  result = subprocess.run(
    ['claude', '-p', prompt, '--allowedTools', 'Bash'],
    capture_output=True,
    text=True,
    timeout=3000
  )
  return result.stdout

.github/workflows/run-evals.yml:73

With Bash tools enabled for Claude during evals, model output (and generated reports/artifacts) could inadvertently include sensitive data from the runner environment. Consider isolating secrets from the Claude execution context (e.g., separate steps/jobs, least-privilege credentials, or redact/suppress tool output in artifacts) to reduce the blast radius.

    - name: Run evaluations
      working-directory: ./evals
      env:
        GOOGLE_APPLICATION_CREDENTIALS: ${{ steps.auth.outputs.credentials_file_path }}
        ANTHROPIC_VERTEX_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
        ANTHROPIC_VERTEX_REGION: "global"
        ANTHROPIC_DEFAULT_HAIKU_MODEL: "claude-haiku-4-5"
        ANTHROPIC_DEFAULT_SONNET_MODEL: "claude-sonnet-4-5"
        ANTHROPIC_DEFAULT_OPUS_MODEL: "claude-opus-4-5"
        CLAUDE_CODE_SUBAGENT_MODEL: "claude-sonnet-4-5"
        CLAUDE_CODE_USE_VERTEX: "1"

      run: |
        source .venv/bin/activate
        deepeval set-gemini \
          --model=gemini-3.1-pro-preview \
          --project=${{ secrets.GCP_PROJECT_ID }} \
          --location=global
        deepeval test run . \
          --junitxml=results.xml \
          --html=report.html \
          --self-contained-html

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

This reverts commit 1aa8c06.

Co-authored-by: Patrick Dawkins <pjcdawkins@users.noreply.github.com>

Replace the `push` trigger with `pull_request` and restore the fork-safety guard to skip PRs from forks (which lack access to repository secrets). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ganeshdipdumbare added 4 commits March 10, 2026 14:10

fix eval in the GH action

f7368d9

add allowed tool

e5bb834

remove unnecessary vars

0e7748f

fix eval

1ff1089

ganeshdipdumbare marked this pull request as ready for review March 10, 2026 14:36

Copilot AI review requested due to automatic review settings March 10, 2026 14:36

Copilot started reviewing on behalf of ganeshdipdumbare March 10, 2026 14:36 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread evals/README.md Outdated

Comment thread evals/test_login.py Outdated

ganeshdipdumbare and others added 8 commits March 10, 2026 15:40

Update evals/README.md

f9fa329

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

allow only upsun specific tools

249d4fe

allow upsun mcp tool as well

f9683a9

lets try non gemini by default

76871bf

remove model

d0c57a2

add model back

e79f50f

update uv.lock

752a244

skip permission for allowed tools

0daeceb

ganeshdipdumbare requested a review from Copilot March 10, 2026 15:39

Copilot started reviewing on behalf of ganeshdipdumbare March 10, 2026 15:40 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread evals/test_login.py

add preapproved tools for claude

1aa8c06

pjcdawkins reviewed Mar 10, 2026

View reviewed changes

Comment thread .github/workflows/run-evals.yml Outdated

ganeshdipdumbare and others added 2 commits March 10, 2026 17:09

Revert "add preapproved tools for claude"

3bc8f7b

This reverts commit 1aa8c06.

Update .github/workflows/run-evals.yml

275fc8a

Co-authored-by: Patrick Dawkins <pjcdawkins@users.noreply.github.com>

ganeshdipdumbare requested a review from pjcdawkins March 10, 2026 16:14

Actually trigger eval workflow on PRs instead of every push

d6c8d63

Replace the `push` trigger with `pull_request` and restore the fork-safety guard to skip PRs from forks (which lack access to repository secrets). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pjcdawkins merged commit 70262a5 into main Mar 10, 2026
1 check passed

pjcdawkins deleted the fix-eval branch March 10, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix eval in the GH action#4

fix eval in the GH action#4
pjcdawkins merged 16 commits intomainfrom
fix-eval

ganeshdipdumbare commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ganeshdipdumbare commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ganeshdipdumbare commented Mar 10, 2026 •

edited

Loading