Skip to content

[Leaderboard] MinusX (64.77% pass@1 | 5 trials)#50

Closed
nuwandavek wants to merge 1 commit into
ucbepic:mainfrom
minusxai:minusx-submission
Closed

[Leaderboard] MinusX (64.77% pass@1 | 5 trials)#50
nuwandavek wants to merge 1 commit into
ucbepic:mainfrom
minusxai:minusx-submission

Conversation

@nuwandavek
Copy link
Copy Markdown

@nuwandavek nuwandavek commented May 20, 2026

MinusX Submission

Agent name: MinusX (agent code, website)
Backbone LLM: Claude Sonnet 4.6, GPT5.5-mini, Claude Haiku 4.5
Harness: Custom harness built on top of Pi
Hints: Yes
Trials: 5 per query
Stratified Pass@1: 64.77%


image

Approach

MinusX is an open source agentic BI platform (built around our data agent). There are a few "tricks" under the hood.

  1. Auto Context: the agent freely explores the dataset both mechanically and using a smaller LLM. It records these notes, to be used by the downstream agent
  2. Consensus mechanism: Since there is no way to "verify" an agent's answers at the end of a turn, we run 2 agents with slightly tweaked contexts. They must arrive at a consensus over the answer, and each others' approaches before the final answer is submitted.
  3. We also use SQL (modified version) for data exploration, extraction and querying

The agent used here for this benchmark is a slightly tweaked version of our production agent. Differences:

  • In our app, auto context has human supervision, since you want to build this over time. For the purpose of this benchmark, this was skipped.
  • In our app we prioritize narrative answers, which was tripping up the eval. We added a SubmitAnswer tool that returns just the brief response.

dataset pass@1
agnews 50.00%
bookreview 93.33%
crmarenapro 70.77%
deps_dev_v1 20.00%
github_repos 50.00%
googlelocal 95.00%
music_brainz_20k 93.33%
pancancer_atlas 60.00%
patents 0.00%
stockindex 93.33%
stockmarket 80.00%
yelp 71.43%

Overall stratified pass@1: 64.77%
Zip of all the logs, tool calls, agent traces

@nuwandavek nuwandavek marked this pull request as draft May 20, 2026 22:09
@nuwandavek nuwandavek changed the title Add MinusX leaderboard submission [Leaderboard] MinusX Agent (64.77% pass@1 | 5 trials) May 20, 2026
@nuwandavek nuwandavek marked this pull request as ready for review May 20, 2026 22:31
@nuwandavek nuwandavek changed the title [Leaderboard] MinusX Agent (64.77% pass@1 | 5 trials) [Leaderboard] MinusX (64.77% pass@1 | 5 trials) May 20, 2026
@nuwandavek
Copy link
Copy Markdown
Author

nuwandavek commented May 20, 2026

Appreciate the effort for this benchmark! This is the most representative of the real-world chaos we see in our customers' data.

@shreyashankar @Ruiying-Ma Are you accepting PRs for improving the validate fn robustness?
Also, do you have any plans for extending this to stable-state-real-world scenarios?
For example:

  1. a dataset collection, some incomplete context, some existing questions and dashboards. Now a new question is posed
  2. Creating the "best" dashboard for a dataset, to answer a few questions.

We have internal benchmarks for these, and lament at the lack of good public ones. Let me know if this is in your roadmap and if you are seeking collaborators. I'm also happy to start a "discussion" on this repo if that's better.

@Ruiying-Ma
Copy link
Copy Markdown
Collaborator

Ruiying-Ma commented May 21, 2026

Hi @nuwandavek — thank you for the contribution and for catching this! We updated the ground-truth answer for query_music_brainz/query2 from Amazon Music to iTunes after verifying that the original answer was incorrect.

For now, we used the corrected answer when re-checking your submission, which changes the music_brainz_20k accuracy from 0.93 to 0.73, and the overall accuracy to 0.6310. We have added you to our leaderboard!
(We have not yet re-evaluated the other submissions with this correction, but we’ll update the full leaderboard soon — please stay tuned :))

And absolutely — PRs improving the validation robustness are very welcome. We really appreciate the community helping make the benchmark more reliable.

We’re also actively exploring extensions toward more realistic data-task settings like the ones you described. Happy to chat more — please feel free to reach out via Berkeley email. Mine is: ruiyingm@berkeley.edu

@Ruiying-Ma Ruiying-Ma closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants