feat(benchmark): add benchmark job run, status by ross-rl · Pull Request #142 · runloopai/rl-cli

ross-rl · 2026-03-05T20:46:27Z

Description

rli bmj run - Run a benchmark job with an agent

--agent - Agent to use (claude-code, codex, opencode, goose, gemini-cli)
--model - Model name for the agent
--benchmark - Benchmark ID or name (searches both user and public benchmarks)
--scenarios <ids...> - Alternative: list of scenario IDs
-n, --job-name - Job name
--env-vars, --secrets, --timeout, orchestrator options

rli bmj status - Get benchmark job status and results

-w, --wait - Wait for job completion (polls every 10s, up to 1 hour)
Displays results table with pass/fail percentages per agent/model

Features

Auto-upsert secrets: Automatically creates BMJ_* secrets from environment variables
- E.g., ANTHROPIC_API_KEY → BMJ_ANTHROPIC_API_KEY
- Skips creation if secret already exists
- Logs all secret operations
Agent configurations with automatic env var handling:
| Agent | Env Vars | Required |
|-------------|---------------------------------------------------|-----------|
| claude-code | ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN | Yes (any) |
| codex | OPENAI_API_KEY | Yes |
| opencode | ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY | No |
| goose | ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY | No |
| gemini-cli | GEMINI_API_KEY, GOOGLE_API_KEY | Yes (any) |
Benchmark resolution: Searches both list and listPublic endpoints when resolving benchmark names
Default orchestrator config: n_concurrent_trials=10, n_attempts=1, timeout_multiplier=1.0, quiet=false
Default agent timeout: 1800 seconds (30 minutes)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring
Performance improvement
Test updates

Related Issues

Closes #

Changes Made

Testing

I have tested locally
I have added/updated tests
All existing tests pass

Checklist

My code follows the code style of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

james-rl

Looks good. Some questions for you

src/commands/benchmark-job/run.ts

james-rl · 2026-03-05T21:22:49Z

src/commands/benchmark-job/status.ts

+
+// Polling config
+const POLL_INTERVAL_MS = 10 * 1000; // 10 seconds
+const MAX_WAIT_MS = 60 * 60 * 1000; // 1 hour


is this long enough? It looks like this is the time for the entire job to complete.

james-rl · 2026-03-05T21:25:38Z

src/utils/commands.ts

+    .option(
+      "--scenarios <ids...>",
+      "Scenario IDs to run (alternative to --benchmark)",
+    )


consider adding short flags -b and -s for benchmark and scenario

james-rl · 2026-03-05T21:25:23Z

src/utils/commands.ts

+  benchmarkJob
+    .command("status <id>")
+    .description("Get benchmark job status and results")
+    .option("-w, --wait", "Wait for job to complete before showing results")


I saw -w and assumed it meant watch -- I think that this letter is confusing.

agreed updating

🤖 I have created a release *beep* *boop* --- ## [1.12.0](v1.11.2...v1.12.0) (2026-03-05) ### Features * **benchmark:** add benchmark job run, status ([#142](#142)) ([80e26c1](80e26c1)) * **blueprint:** support blueprint create metadata ([#141](#141)) ([4579d91](4579d91)) * **cli:** add llms.txt ([#139](#139)) ([db21f81](db21f81)) ### Bug Fixes * using the new format for mcp-configs ([#132](#132)) ([9deeb1c](9deeb1c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

ross-rl added 5 commits March 5, 2026 11:18

add benchmark job cli

3868d96

cp

e0d2419

cp

7bcb047

cp

4090043

cp

523b89c

ross-rl requested a review from dines-rl March 5, 2026 20:46

ross-rl assigned dines-rl and james-rl Mar 5, 2026

ross-rl requested a review from james-rl March 5, 2026 20:46

ross-rl changed the title ~~feat(benchmarks): Add benchmark job run, status~~ feat(benchmark): Add benchmark job run, status Mar 5, 2026

ross-rl changed the title ~~feat(benchmark): Add benchmark job run, status~~ feat(benchmark): add benchmark job run, status Mar 5, 2026

james-rl approved these changes Mar 5, 2026

View reviewed changes

pr feedback

3e364c2

ross-rl merged commit 80e26c1 into main Mar 5, 2026
14 checks passed

ross-rl deleted the ross/nnn branch March 5, 2026 22:06

github-actions bot mentioned this pull request Mar 5, 2026

chore(main): release 1.12.0 #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add benchmark job run, status#142

feat(benchmark): add benchmark job run, status#142
ross-rl merged 6 commits intomainfrom
ross/nnn

ross-rl commented Mar 5, 2026

Uh oh!

james-rl left a comment

Uh oh!

Uh oh!

james-rl Mar 5, 2026

Uh oh!

ross-rl Mar 5, 2026

Uh oh!

james-rl Mar 5, 2026

Uh oh!

james-rl Mar 5, 2026

Uh oh!

ross-rl Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ross-rl commented Mar 5, 2026

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

james-rl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

james-rl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ross-rl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

james-rl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

james-rl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ross-rl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants