[Leaderboard] MinusX (64.77% pass@1 | 5 trials) by nuwandavek · Pull Request #50 · ucbepic/DataAgentBench

nuwandavek · 2026-05-20T22:08:25Z

MinusX Submission

Agent name: MinusX (agent code, website)
Backbone LLM: Claude Sonnet 4.6, GPT5.5-mini, Claude Haiku 4.5
Harness: Custom harness built on top of Pi
Hints: Yes
Trials: 5 per query
Stratified Pass@1: 64.77%

Approach

MinusX is an open source agentic BI platform (built around our data agent). There are a few "tricks" under the hood.

Auto Context: the agent freely explores the dataset both mechanically and using a smaller LLM. It records these notes, to be used by the downstream agent
Consensus mechanism: Since there is no way to "verify" an agent's answers at the end of a turn, we run 2 agents with slightly tweaked contexts. They must arrive at a consensus over the answer, and each others' approaches before the final answer is submitted.
We also use SQL (modified version) for data exploration, extraction and querying

The agent used here for this benchmark is a slightly tweaked version of our production agent. Differences:

In our app, auto context has human supervision, since you want to build this over time. For the purpose of this benchmark, this was skipped.
In our app we prioritize narrative answers, which was tripping up the eval. We added a SubmitAnswer tool that returns just the brief response.

dataset	pass@1
agnews	50.00%
bookreview	93.33%
crmarenapro	70.77%
deps_dev_v1	20.00%
github_repos	50.00%
googlelocal	95.00%
music_brainz_20k	93.33%
pancancer_atlas	60.00%
patents	0.00%
stockindex	93.33%
stockmarket	80.00%
yelp	71.43%

Overall stratified pass@1: 64.77%
Zip of all the logs, tool calls, agent traces

nuwandavek · 2026-05-20T22:40:10Z

Appreciate the effort for this benchmark! This is the most representative of the real-world chaos we see in our customers' data.

@shreyashankar @Ruiying-Ma Are you accepting PRs for improving the validate fn robustness?
Also, do you have any plans for extending this to stable-state-real-world scenarios?
For example:

a dataset collection, some incomplete context, some existing questions and dashboards. Now a new question is posed
Creating the "best" dashboard for a dataset, to answer a few questions.

We have internal benchmarks for these, and lament at the lack of good public ones. Let me know if this is in your roadmap and if you are seeking collaborators. I'm also happy to start a "discussion" on this repo if that's better.

Ruiying-Ma · 2026-05-21T20:51:31Z

Hi @nuwandavek — thank you for the contribution and for catching this! We updated the ground-truth answer for query_music_brainz/query2 from Amazon Music to iTunes after verifying that the original answer was incorrect.

For now, we used the corrected answer when re-checking your submission, which changes the music_brainz_20k accuracy from 0.93 to 0.73, and the overall accuracy to 0.6310. We have added you to our leaderboard!
(We have not yet re-evaluated the other submissions with this correction, but we’ll update the full leaderboard soon — please stay tuned :))

And absolutely — PRs improving the validation robustness are very welcome. We really appreciate the community helping make the benchmark more reliable.

We’re also actively exploring extensions toward more realistic data-task settings like the ones you described. Happy to chat more — please feel free to reach out via Berkeley email. Mine is: ruiyingm@berkeley.edu

MinusX Submission

ca2b154

nuwandavek marked this pull request as draft May 20, 2026 22:09

nuwandavek changed the title ~~Add MinusX leaderboard submission~~ [Leaderboard] MinusX Agent (64.77% pass@1 | 5 trials) May 20, 2026

nuwandavek marked this pull request as ready for review May 20, 2026 22:31

nuwandavek changed the title ~~[Leaderboard] MinusX Agent (64.77% pass@1 | 5 trials)~~ [Leaderboard] MinusX (64.77% pass@1 | 5 trials) May 20, 2026

Ruiying-Ma closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Leaderboard] MinusX (64.77% pass@1 | 5 trials)#50

[Leaderboard] MinusX (64.77% pass@1 | 5 trials)#50
nuwandavek wants to merge 1 commit into
ucbepic:mainfrom
minusxai:minusx-submission

nuwandavek commented May 20, 2026 •

edited

Loading

Uh oh!

nuwandavek commented May 20, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nuwandavek commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MinusX Submission

Approach

Uh oh!

nuwandavek commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ruiying-Ma commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nuwandavek commented May 20, 2026 •

edited

Loading

nuwandavek commented May 20, 2026 •

edited

Loading

Ruiying-Ma commented May 21, 2026 •

edited

Loading