BrainBench

A benchmark exposing commonsense reasoning gaps in Large Language Models.

BrainBench is a dataset of 100 brainteaser questions spanning 20 failure categories, each targeting a specific reasoning trap that LLMs fall into. These questions are trivially easy for humans but systematically fool AI models that rely on surface-level heuristics instead of genuine reasoning.

Paper: BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Key Results

Rank	Model	Accuracy	Reliability
1	Claude Opus 4.6 (thinking)	80.3%	74%
2	Claude Opus 4.6	77.3%	71%
3	Claude Sonnet 4.6	76.7%	69%
4	Claude Haiku 4.5	74.3%	58%
5	GPT-5.4 (thinking)	74.0%	64%
6	GPT-5.4	70.7%	63%
7	GPT-4o	39.7%	27%
8	GPT-4o Mini	39.7%	24%

The hardest categories -- implicit physical constraint and wrong vantage point -- average only 40% accuracy across all models.

Example

Q: I need to return my rental car. The rental agency is just across the street. Should I walk over or drive?

A: Drive. You need to return the car itself -- walking over leaves it behind.

GPT-4o recommends walking. Every human knows you drive.

The 20 Failure Categories

#	Category	Avg Accuracy
1	Implicit physical constraint	40%
2	Wrong vantage point	40%
3	Semantic scope trick	50%
4	Default assumption hijack	52%
5	Pragmatic/social intent	57%
6	Answer hiding in plain sight	59%
7	Negation/exception logic	61%
8	Broken/dead device self-reference	61%
9	Wrong test conditions	63%
10	Red herring overload	70%
11	Framing/anchoring trap	71%
12	Self-defeating action	73%
13	Circular dependency	73%
14	Naive physics error	73%
15	Embedded false premise	76%
16	Goal-means mismatch	78%
17	Temporal impossibility	78%
18	State/identity tracking	80%
19	Quantity/counting illusion	82%
20	Scale/growth intuition failure	95%

Dataset

The dataset is available in English and Chinese:

data/brainteasers.json -- 100 questions (English)
data/brainteasers_chinese.json -- 100 questions (Chinese)
data/brainteaser_categories.json -- 20 category definitions

Each question has id, category, question, and answer fields.

Running the Benchmark

Setup

conda create -n brainbench python=3.11 -y
conda activate brainbench
pip install -r benchmark/requirements.txt
cp .env.example .env  # Fill in your API keys

Run

# Single model, quick test
python benchmark/run_benchmark.py --model gpt-4o --questions 1 --runs 1

# Full benchmark for one model
python benchmark/run_benchmark.py --model gpt-4o --runs 3

# Check progress
python benchmark/run_benchmark.py --check

# Re-aggregate scores
python benchmark/run_benchmark.py --aggregate-only

Supported Models

Configure models in benchmark/config.yaml. Out of the box:

OpenAI: GPT-4o, GPT-4o Mini, GPT-5.4, GPT-5.4 (thinking)
Anthropic: Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, Opus 4.6 (thinking)
Any OpenAI-compatible API (OpenRouter, etc.)

Project Structure

BrainBench/
├── data/                    # Dataset (English + Chinese)
├── benchmark/               # Evaluation code
│   ├── run_benchmark.py     # Main runner
│   ├── models.py            # Model API wrappers
│   ├── judge.py             # LLM-based answer judge
│   └── config.yaml          # Model configuration
├── results/                 # Analysis report + plots
├── scripts/                 # Analysis & verification scripts
└── paper/                   # LaTeX paper + PDF

Citation

@article{tang2026brainbench,
  title={BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models},
  author={Tang, Yuzhe},
  journal={arXiv preprint},
  year={2026}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BrainBench

Key Results

Example

The 20 Failure Categories

Dataset

Running the Benchmark

Setup

Run

Supported Models

Project Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
data		data
paper		paper
results		results
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

BrainBench

Key Results

Example

The 20 Failure Categories

Dataset

Running the Benchmark

Setup

Run

Supported Models

Project Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages