Project Page | Dataset | Quick Start | Citation
MagicBench is a deception-sensitive cognitive benchmark for large language models built around magic-trick understanding. Instead of testing recall alone, it evaluates whether a model can reason about hidden causes, audience beliefs, violated expectations, and counterfactual changes in scenarios where the visible events are intentionally misleading.
The benchmark is inspired by cognitive abilities often discussed in AGI evaluation frameworks, including perception, attention, memory, reasoning, metacognition, social cognition, executive function, and transfer.
Magic tricks are useful evaluation cases because they separate:
- what is observed
- what the audience believes
- what is actually happening
That makes them a compact way to probe whether a model can track deception, infer plausible hidden mechanisms, and stay calibrated about uncertainty.
MagicBench currently defines:
- 50 magic scenarios
- 6 task types per scenario
- 300 total benchmark items per run
Each scenario includes:
- an audience-facing description of the effect
- the relevant violated expectations
- an abstract gold explanation of the method
- a belief trace contrasting audience belief vs. reality
- counterfactual variants
- difficulty annotations
- primary cognitive faculties
Each scenario generates one item for each of the following task types:
effect_recognitionIdentify the type of effect the audience experienced.violation_identificationDetermine which expectations or rules appear to be violated.best_explanationChoose the most plausible hidden method.belief_traceInfer what the audience believes at a specific moment.metacognitive_calibrationDistribute confidence across competing explanations.counterfactual_reasoningJudge whether the method would still work under a changed condition.
Scores are aggregated into five benchmark dimensions:
recognitioncausal_inferencedeception_modelingmetacognitive_calibrationtransfer_robustness
The script also maps task performance onto a broader faculty profile and reports performance by difficulty axis and trick family.
The reported overall score is a weighted average across task types:
effect_recognition: 10%violation_identification: 10%best_explanation: 20%belief_trace: 25%metacognitive_calibration: 10%counterfactual_reasoning: 25%
.
├── magicbench.py # benchmark loader, task generation, scoring, CLI
└── scripts/
├── api.sh # local API keys (gitignored)
└── run.sh # local helper script
- Python 3.9+ recommended
datasetsis required to loadhsiung/MagicBenchfrom Hugging Face- An API key is needed for live model evaluation with supported providers
Clone the repo and run from the project root:
pip install datasets
python magicbench.py --helpmagicbench.py loads scenarios from the Hugging Face dataset hsiung/MagicBench.
The first run will download the dataset into the local Hugging Face cache automatically.
For local runs, store API credentials in scripts/api.sh.
Example:
#!/bin/bash
export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"source scripts/api.sh
python magicbench.py \
--model gpt-4o \
--provider openai \
--api-key "$OPENAI_API_KEY"Print generated prompts without calling any API:
python magicbench.py --dry-runRun the benchmark interactively as a person:
python magicbench.py --humanusage: magicbench.py [-h] [--model MODEL] [--provider {anthropic,openai}]
[--api-key API_KEY] [--n-repeats N_REPEATS]
[--delay DELAY] [--dry-run] [--human]
[--seed SEED] [--output-dir OUTPUT_DIR]
Notable flags:
--model: model name to send to the provider API--provider: currentlyanthropicoropenai--n-repeats: repeat each task multiple times--delay: sleep between API calls--seed: controls randomized option ordering and sampled task variants--output-dir: directory for results artifacts
Each run writes timestamped artifacts to results/ by default:
*_results.json: per-item responses and scores*_profile.json: aggregated benchmark profile, faculty mapping, and difficulty analysis*_report.txt: human-readable report
If you find our work helpful or inspiring to your research, please cite our project as follows:
@misc{hsiung2026magicbench,
title={{MagicBench: A Deception-Sensitive Cognitive Benchmark for LLMs}},
author={Hsiung, Lei},
year={2026},
howpublished={\url{https://hsiung.cc/MagicBench/}},
}
