Skip to content

[Leaderboard] Pi Coding Agent — Claude Opus 4.6 — 62.6% Pass@1#31

Closed
satish860 wants to merge 1 commit intoucbepic:mainfrom
satish860:add-pi-coding-agent-submission
Closed

[Leaderboard] Pi Coding Agent — Claude Opus 4.6 — 62.6% Pass@1#31
satish860 wants to merge 1 commit intoucbepic:mainfrom
satish860:add-pi-coding-agent-submission

Conversation

@satish860
Copy link
Copy Markdown

Pi Coding Agent — Leaderboard Submission

Agent name: Pi Coding Agent
Backbone LLM: Claude Opus 4.6 (via OpenRouter)
Hints: Yes (db_description_withhint.txt)
Trials: 5 per query
Pass@1: 62.6%

Architecture

A TypeScript coding agent built on the Pi SDK that:

  1. Pre-indexes each dataset via automated SQL/MongoDB introspection — schemas, sample rows, value ranges, and join key analysis. No manual input, no data leakage.
  2. For each query, reads the index then writes and executes Node.js scripts that connect directly to PostgreSQL, SQLite, DuckDB, and MongoDB.
  3. Iterates on errors — if the script crashes, the agent patches and reruns until it produces a final answer.

Key differences from the built-in agent

Built-in Agent Pi Coding Agent
Language Python TypeScript / Node.js
Tool interface 4 constrained tools Full read/write/bash
DB context Runtime exploration Pre-built index (automated)
Script style Iterative queries One complete script per query

Results summary

Dataset Pass@1
agnews 0.60
bookreview 0.87
crmarenapro 0.83
googlelocal 0.70
music_brainz_20k 0.67
stockindex 0.67
stockmarket 0.60
yelp 0.74
GITHUB_REPOS 0.40
PANCANCER_ATLAS 0.67
DEPS_DEV_V1 0.00
PATENTS 0.00
Overall 62.6%

@satish860
Copy link
Copy Markdown
Author

@shreyashankar - Can you help with the PR.

@shreyashankar
Copy link
Copy Markdown
Collaborator

shreyashankar commented Apr 22, 2026

Hello! Sorry just got back to the US from traveling and am looking at the PRs. We will add your name to the leaderboard shortly!

shreyashankar added a commit that referenced this pull request Apr 22, 2026
@shreyashankar
Copy link
Copy Markdown
Collaborator

Hi @satish860 — verified with common_scaffold/validate/validate.py:

  • Pass@1 = 0.5603 (stratified) — average across the 12 datasets of the per-dataset average across queries of c/n. This is the leaderboard metric.
  • Pass@1 = 0.626 (micro) — total passes / total runs across all 270 trials. Equal weight per (query, run); matches the 62.6% in your PR title.

0.5603 puts you at #1 on the leaderboard, which now links back to this PR. Thanks for the submission! Closing.

@shreyashankar shreyashankar mentioned this pull request Apr 22, 2026
1 task
NuryeNigusMekonen pushed a commit to NuryeNigusMekonen/DataAgentBench that referenced this pull request Apr 22, 2026
…, Cohere (ucbepic#38) to leaderboard

Verified Pass@1 numbers were re-computed from the raw submission JSONs
using common_scaffold/validate/validate.py:

  PR ucbepic#31  Pi Coding Agent + Claude Opus 4.6      → 0.5603 (ucbepic#1)
  PR ucbepic#32  Oracle Forge (Tenacious) + Sonnet 4.6  → 0.4554 (ucbepic#4)
  PR ucbepic#38  Oracle Forge (Cohere) + Gemini 2.0 F.  → 0.128  (ucbepic#10)

Adds a Submission column on both the README table and the website
leaderboard linking each submission to its PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NuryeNigusMekonen pushed a commit to NuryeNigusMekonen/DataAgentBench that referenced this pull request Apr 22, 2026
…mission PRs

Pi was linking to mariozechner/pi-coding-agent (the SDK author), not the
team that made the submission. Cohere was linking to their source repo.
Both now link to the PR they opened on this repo, matching the pattern
already used for Tenacious (ucbepic#32).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants