CreativityBench is a benchmark for evaluating creative reasoning through affordance-based tool repurposing. The repository contains the full pipeline for:
- building affordance-rich entity annotations,
- constructing benchmark tasks from those annotations,
- evaluating model predictions and judging the results.
CreativityBench-Open/
├── dataset/
├── annotation/
├── task_creation/
├── evaluation/
├── assets/
├── requirements.txt
└── README.md
dataset/: sample task file for evaluationannotation/: generate partonomy, physical attributes, state attributes, and functional affordancestask_creation/: turn annotated entities into benchmark tasksevaluation/: run model inference and judge predictions
Each subfolder has its own concise README with folder-specific details.
Install the Python dependencies from the repository root:
pip install -r requirements.txtThe current requirements cover the code in annotation/, task_creation/, and evaluation/.
Before using the released benchmark data, first download the dataset from the link below:
- Newest Version: 🔗 Download
The link above leads to our release of the first batch of 3.3K tasks. This repository now only includes a sample task file by default:
dataset/sample_tasks.json
If you only want a quick evaluation demo, you can use this sample task file directly. For larger-scale evaluation, point TASK_FILE to the full downloaded task file.
The sample evaluation file is a JSON list of task objects. Each task currently includes:
task_id: unique task identifierscenario: scenario labelsetting: metadata such as difficulty and entity countgolds: gold entity-part affordance referenceentities: candidate entities and their descriptionsitems: extra scene itemsenvironment: natural-language scene descriptiontask: user-facing task descriptionsolution: gold solution steps
Run the benchmark pipeline in this order:
Generate structured entity annotations from seed entities:
cd annotation
./run.shThis stage starts from 1_sample_entities.json and produces:
- partonomy graphs,
- physical and state variants,
- part-level functional affordances,
- final assembled entities in
annotation/outputs/.
Build benchmark tasks from the annotation outputs:
cd task_creation
./run.shThis stage clusters affordances, samples gold examples and alternatives, and writes benchmark tasks to task_creation/outputs/.
Run model inference on the benchmark tasks and judge the outputs:
cd evaluation
./run.shBy default, evaluation/run.sh runs both:
evaluate.pyjudge.py
Set RUN_JUDGE=0 if you only want raw model outputs.
The evaluation source file is changeable through TASK_FILE. The bundled sample file is only a default example, and larger downloaded task files can be swapped in later without changing the evaluation code.
The repository uses environment variables for model and API configuration. Do not hard-code credentials in the codebase.
OpenAI:
export OPENAI_API_KEY="your_openai_api_key"Optional:
export OPENAI_MODEL="gpt-5.2"
export OPENAI_BASE_URL="https://api.openai.com/v1"For evaluation/, local or hosted vLLM-compatible endpoints are also supported:
export VLLM_API_KEY="EMPTY"
export VLLM_BASE_URL="http://localhost:8000/v1"annotation/andtask_creation/use OpenAI-compatible API calls.evaluation/currently supports OpenAI and vLLM-compatible model endpoints.dataset/sample_tasks.jsonis only a bundled sample input for evaluation, andTASK_FILEcan point to any compatible task JSON.- All stage runners are relative-path based and avoid machine-specific hard-coded
.envsourcing.
@article{qian2026creativitybench,
title={CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing},
author={Qian, Cheng and Ha, Hyeonjeong and Liu, Jiayu and Kim, Jeonghwan and Liu, Jiateng and Li, Bingxuan and Tiwari, Aditi and Dalal, Dwip and Wang, Zhenhailong and Chen, Xiusi and Namazifar, Mahdi and Li, Yunzhu and Ji, Heng},
journal={arXiv preprint arXiv:2605.02910},
year={2026}
}

