Skip to content

fix eval in the GH action#4

Merged
pjcdawkins merged 16 commits intomainfrom
fix-eval
Mar 10, 2026
Merged

fix eval in the GH action#4
pjcdawkins merged 16 commits intomainfrom
fix-eval

Conversation

@ganeshdipdumbare
Copy link
Copy Markdown
Contributor

@ganeshdipdumbare ganeshdipdumbare commented Mar 10, 2026

  • Upgrade gemini model
  • Run on only push to current branch and merge to main branch
  • Allow tools for Claude to run bash

@ganeshdipdumbare ganeshdipdumbare marked this pull request as ready for review March 10, 2026 14:36
Copilot AI review requested due to automatic review settings March 10, 2026 14:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the evaluation harness and GitHub Actions workflow to run LLM-based evals with newer Gemini configuration and to allow Claude Code to execute Bash tools during tests.

Changes:

  • Enable Claude Code CLI Bash tool usage in the login eval.
  • Update eval documentation to reference a newer Gemini model.
  • Adjust the eval workflow triggers/config and switch the configured Gemini model string.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
evals/test_login.py Allows Claude Code CLI to use the Bash tool during the eval run.
evals/README.md Updates the documented Gemini model used by GEval.
.github/workflows/run-evals.yml Changes workflow triggers/config and updates the Gemini model configuration; enables Vertex usage flag for Claude.
Comments suppressed due to low confidence (2)

evals/test_login.py:14

  • subprocess.run(...) doesn't check the Claude CLI exit status; if the CLI fails (e.g., due to an unrecognized flag), the test will still proceed with empty/partial stdout and produce misleading evaluation results. Consider using check=True (or explicitly validating returncode) and surfacing stderr in the failure message.
  result = subprocess.run(
    ['claude', '-p', prompt, '--allowedTools', 'Bash'],
    capture_output=True,
    text=True,
    timeout=3000
  )
  return result.stdout

.github/workflows/run-evals.yml:73

  • With Bash tools enabled for Claude during evals, model output (and generated reports/artifacts) could inadvertently include sensitive data from the runner environment. Consider isolating secrets from the Claude execution context (e.g., separate steps/jobs, least-privilege credentials, or redact/suppress tool output in artifacts) to reduce the blast radius.
    - name: Run evaluations
      working-directory: ./evals
      env:
        GOOGLE_APPLICATION_CREDENTIALS: ${{ steps.auth.outputs.credentials_file_path }}
        ANTHROPIC_VERTEX_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
        ANTHROPIC_VERTEX_REGION: "global"
        ANTHROPIC_DEFAULT_HAIKU_MODEL: "claude-haiku-4-5"
        ANTHROPIC_DEFAULT_SONNET_MODEL: "claude-sonnet-4-5"
        ANTHROPIC_DEFAULT_OPUS_MODEL: "claude-opus-4-5"
        CLAUDE_CODE_SUBAGENT_MODEL: "claude-sonnet-4-5"
        CLAUDE_CODE_USE_VERTEX: "1"

      run: |
        source .venv/bin/activate
        deepeval set-gemini \
          --model=gemini-3.1-pro-preview \
          --project=${{ secrets.GCP_PROJECT_ID }} \
          --location=global
        deepeval test run . \
          --junitxml=results.xml \
          --html=report.html \
          --self-contained-html

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread evals/README.md Outdated
Comment thread evals/test_login.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread evals/test_login.py
Comment thread .github/workflows/run-evals.yml Outdated
ganeshdipdumbare and others added 2 commits March 10, 2026 17:09
Co-authored-by: Patrick Dawkins <pjcdawkins@users.noreply.github.com>
Replace the `push` trigger with `pull_request` and restore the
fork-safety guard to skip PRs from forks (which lack access to
repository secrets).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pjcdawkins pjcdawkins merged commit 70262a5 into main Mar 10, 2026
1 check passed
@pjcdawkins pjcdawkins deleted the fix-eval branch March 10, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants