[Leaderboard] MinusX (64.77% pass@1 | 5 trials)#50
Conversation
|
Appreciate the effort for this benchmark! This is the most representative of the real-world chaos we see in our customers' data. @shreyashankar @Ruiying-Ma Are you accepting PRs for improving the validate fn robustness?
We have internal benchmarks for these, and lament at the lack of good public ones. Let me know if this is in your roadmap and if you are seeking collaborators. I'm also happy to start a "discussion" on this repo if that's better. |
|
Hi @nuwandavek — thank you for the contribution and for catching this! We updated the ground-truth answer for For now, we used the corrected answer when re-checking your submission, which changes the And absolutely — PRs improving the validation robustness are very welcome. We really appreciate the community helping make the benchmark more reliable. We’re also actively exploring extensions toward more realistic data-task settings like the ones you described. Happy to chat more — please feel free to reach out via Berkeley email. Mine is: ruiyingm@berkeley.edu |
MinusX Submission
Agent name: MinusX (agent code, website)
Backbone LLM: Claude Sonnet 4.6, GPT5.5-mini, Claude Haiku 4.5
Harness: Custom harness built on top of Pi
Hints: Yes
Trials: 5 per query
Stratified Pass@1: 64.77%
Approach
MinusX is an open source agentic BI platform (built around our data agent). There are a few "tricks" under the hood.
The agent used here for this benchmark is a slightly tweaked version of our production agent. Differences:
SubmitAnswertool that returns just the brief response.Overall stratified pass@1: 64.77%
Zip of all the logs, tool calls, agent traces