Skip to content

feat(benchmark): add benchmark job run, status#142

Merged
ross-rl merged 6 commits intomainfrom
ross/nnn
Mar 5, 2026
Merged

feat(benchmark): add benchmark job run, status#142
ross-rl merged 6 commits intomainfrom
ross/nnn

Conversation

@ross-rl
Copy link
Contributor

@ross-rl ross-rl commented Mar 5, 2026

Description

rli bmj run - Run a benchmark job with an agent

  • --agent - Agent to use (claude-code, codex, opencode, goose, gemini-cli)
  • --model - Model name for the agent
  • --benchmark - Benchmark ID or name (searches both user and public benchmarks)
  • --scenarios <ids...> - Alternative: list of scenario IDs
  • -n, --job-name - Job name
  • --env-vars, --secrets, --timeout, orchestrator options

rli bmj status - Get benchmark job status and results

  • -w, --wait - Wait for job completion (polls every 10s, up to 1 hour)
  • Displays results table with pass/fail percentages per agent/model

Features

  1. Auto-upsert secrets: Automatically creates BMJ_* secrets from environment variables
    - E.g., ANTHROPIC_API_KEY → BMJ_ANTHROPIC_API_KEY
    - Skips creation if secret already exists
    - Logs all secret operations

  2. Agent configurations with automatic env var handling:
    | Agent | Env Vars | Required |
    |-------------|---------------------------------------------------|-----------|
    | claude-code | ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN | Yes (any) |
    | codex | OPENAI_API_KEY | Yes |
    | opencode | ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY | No |
    | goose | ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY | No |
    | gemini-cli | GEMINI_API_KEY, GOOGLE_API_KEY | Yes (any) |

  3. Benchmark resolution: Searches both list and listPublic endpoints when resolving benchmark names

  4. Default orchestrator config: n_concurrent_trials=10, n_attempts=1, timeout_multiplier=1.0, quiet=false

  5. Default agent timeout: 1800 seconds (30 minutes)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Test updates

Related Issues

Closes #

Changes Made

Testing

  • I have tested locally
  • I have added/updated tests
  • All existing tests pass

Checklist

  • My code follows the code style of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

@ross-rl ross-rl requested a review from dines-rl March 5, 2026 20:46
@ross-rl ross-rl requested a review from james-rl March 5, 2026 20:46
@ross-rl ross-rl changed the title feat(benchmarks): Add benchmark job run, status feat(benchmark): Add benchmark job run, status Mar 5, 2026
@ross-rl ross-rl changed the title feat(benchmark): Add benchmark job run, status feat(benchmark): add benchmark job run, status Mar 5, 2026
Copy link
Contributor

@james-rl james-rl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some questions for you


// Polling config
const POLL_INTERVAL_MS = 10 * 1000; // 10 seconds
const MAX_WAIT_MS = 60 * 60 * 1000; // 1 hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this long enough? It looks like this is the time for the entire job to complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will raise

.option(
"--scenarios <ids...>",
"Scenario IDs to run (alternative to --benchmark)",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding short flags -b and -s for benchmark and scenario

benchmarkJob
.command("status <id>")
.description("Get benchmark job status and results")
.option("-w, --wait", "Wait for job to complete before showing results")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw -w and assumed it meant watch -- I think that this letter is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed updating

@ross-rl ross-rl merged commit 80e26c1 into main Mar 5, 2026
14 checks passed
@ross-rl ross-rl deleted the ross/nnn branch March 5, 2026 22:06
ross-rl pushed a commit that referenced this pull request Mar 5, 2026
🤖 I have created a release *beep* *boop*
---


##
[1.12.0](v1.11.2...v1.12.0)
(2026-03-05)


### Features

* **benchmark:** add benchmark job run, status
([#142](#142))
([80e26c1](80e26c1))
* **blueprint:** support blueprint create metadata
([#141](#141))
([4579d91](4579d91))
* **cli:** add llms.txt
([#139](#139))
([db21f81](db21f81))


### Bug Fixes

* using the new format for mcp-configs
([#132](#132))
([9deeb1c](9deeb1c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants