Skip to content

Leaderboard: Claude Opus 4.6 (common scaffold) — 0.44 Pass@1#22

Merged
Ruiying-Ma merged 1 commit intoucbepic:mainfrom
anushruthasura:results/claude-opus-4-6
Mar 19, 2026
Merged

Leaderboard: Claude Opus 4.6 (common scaffold) — 0.44 Pass@1#22
Ruiying-Ma merged 1 commit intoucbepic:mainfrom
anushruthasura:results/claude-opus-4-6

Conversation

@anushruthasura
Copy link
Copy Markdown
Collaborator

@anushruthasura anushruthasura commented Mar 18, 2026

Summary

  • Benchmark results for Claude Opus 4.6 using the DAB common scaffold agent
  • Stratified Pass@1: 0.4376 across 54 queries × 5 trials (270 total runs)
  • Model accessed via Anthropic API using OpenAI SDK compatibility layer

Per-dataset accuracy (stratified)

Dataset Pass@1
bookreview 1.000
crmarenapro 0.785
googlelocal 0.750
PANCANCER_ATLAS 0.467
stockmarket 0.400
yelp 0.400
DEPS_DEV_V1 0.400
GITHUB_REPOS 0.350
stockindex 0.333
agnews 0.300
music_brainz_20k 0.067
PATENTS 0.000

Agent configuration

  • Agent: DAB common scaffold (DataAgent.py)
  • LLM: claude-opus-4-6 via Anthropic API
  • Hints: enabled (--use_hints)
  • Max iterations: 100
  • Trials: 5 per query

Notes

  • 2 runs (DEPS_DEV_V1/query1/run_2 and run_3) could not complete due to a consistent CPU hang in the Docker code executor — empty answers were submitted for these
  • All other 268 runs completed normally
  • Pass@1 computed using the stratified formula: (1/D) × Σ [(1/Qⱼ) × Σ (cᵢⱼ / n)]

Pass@1: 0.5037 across 54 queries × 5 trials = 270 runs

Per-dataset accuracy:
- bookreview: 1.000
- crmarenapro: 0.785
- googlelocal: 0.750
- DEPS_DEV_V1: 0.500
- PANCANCER_ATLAS: 0.467
- stockmarket: 0.400
- yelp: 0.400
- GITHUB_REPOS: 0.350
- stockindex: 0.333
- agnews: 0.300
- music_brainz_20k: 0.067
- PATENTS: 0.000

Notes:
- 2 runs (DEPS_DEV_V1/query1/run_2-3) could not complete due to
  a consistent CPU hang in the Docker executor; empty answers submitted
- Model: claude-opus-4-6 via Anthropic API (OpenAI SDK compatibility)
- Hints: enabled (--use_hints)
- Max iterations: 100

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anushruthasura anushruthasura changed the title Leaderboard: Claude Opus 4.6 (common scaffold) — 0.50 Pass@1 Leaderboard: Claude Opus 4.6 (common scaffold) — 0.44 Pass@1 Mar 19, 2026
@Ruiying-Ma Ruiying-Ma merged commit 2ad1991 into ucbepic:main Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants