PRISM is a research codebase for evaluating LLM reasoning methods across multiple benchmark types:
- multiple-choice QA (
gpqa) - math competition problems (
hmmt,aime)
The repo supports both single-shot baselines and composable multi-stage methods such as recursive aggregation, debate-style refinement, and PRISM.
This repository corresponds to the paper:
At a high level, the benchmark runner:
- loads examples from one or more datasets
- runs a method on each example
- checks correctness when possible
- appends results to CSV files in
data/outputs/
There are two main output files:
data/outputs/shared-results.csv: final answers onlydata/outputs/depth_accuracy.csv: per-step / per-chain depth traces plus final rows
- Python
>=3.12inpyproject.toml uvrecommended for dependency management- access to at least one model backend:
- Gemini via
GEMINI_API_KEY - Together via
TOGETHER_API_KEY - OpenAI-compatible endpoint via
--model_url
- Gemini via
HF_TOKENrecommended for dataset access, especiallylivecodebench_v6
uv syncIf you want dev tools too:
uv sync --group devCreate a .env file in the repo root if you do not want to pass secrets through the shell:
GEMINI_API_KEY =... (if using Gemini)
OPENAI_API_KEY=... (if using OpenAI)
TOGETHER_API_KEY=... (if using Together)
HF_TOKEN=... (for Hugging Face datasets)Only set the variables you actually need.
Show all CLI options:
uv run python src/main.py --helpRun a small zero-shot evaluation:
uv run python src/main.py \
--datasets gpqa \
--method zero-shot \
-n 10 \
--samples 1 \
-m openai/gpt-oss-20b \
--model_url http://localhost:8089/v1Run a composable method:
uv run python src/main.py \
--datasets gpqa \
--cp sample_n \
--p2p recursive_aggregate \
--p2a majority_vote \
--samples 10 \
-w 10 \
-d 5 \
-t 0.8 \
-m openai/gpt-oss-20b \
--model_url http://localhost:8089/v1Run PRISM:
uv run python src/main.py \
--datasets gpqa \
--cp sample_n \
--p2p prism \
--p2a prm_score_vote \
--samples 10 \
-w 10 \
-d 5 \
-t 0.8 \
--prism_t 0.8 \
--prism_ess 0.5 \
--prism_noise 0.1 \
-m openai/gpt-oss-20b \
--model_url http://localhost:8089/v1Common arguments:
--datasets: dataset names to run-n,--max_samples_per_dataset: cap examples per dataset--start: skip the firstNexamples in each selected dataset--question_ids: run only specific question IDs-m: model name--model_url: OpenAI-compatible endpoint for local or remote servers-t: sampling temperature--output_csv: final-result CSV path--depth_metrics_csv: depth-trace CSV path
Composable method arguments:
--cp: create-population stage--p2p: population-to-population stage--p2a: population-to-answer stage--samples: number of seed samples-w: width-d: depth--agg: aggregation pool size
PRISM-specific arguments:
--prism_t--prism_ess--prism_noise--follower_ratio
Registered datasets in the current code:
gpqahmmtaime
Built-in full methods:
zero-shot
Composable stages:
- create-population:
sample_n
- population-to-population:
refineagentic_debaterecursive_aggregatemad_conformistmad_followerprism
- population-to-answer:
majority_voteprm_score_votellm_aggregate
When you use --cp/--p2p/--p2a, the runner builds a dynamic method name such as:
sample_n_recursive_aggregate_majority_vote
The repo includes helper scripts in script/:
script/bench.sh: run the benchmark on a CPU partition against an existing model endpointscript/run_bench.sh: start a vLLM server on a GPU node, then run the benchmark against itscript/serve_vllm.sh: standalone vLLM server jobscript/sweep_methods_parallel.sh: submit one Slurm job per method combination
sbatch script/bench.sh src/main.py \
-m openai/gpt-oss-20b \
--model_url http://tc-gpu001:8089/v1 \
--datasets gpqa \
--cp sample_n \
--p2p recursive_aggregate \
--p2a majority_vote \
--samples 10 \
-w 10 \
-d 5sbatch script/run_bench.sh python src/main.py \
--datasets gpqa \
--method zero-shot \
-n 10 \
--samples 1 \
-m openai/gpt-oss-20bCluster logs are written to job-outputs/.
shared-results.csv includes one final row per example. Typical columns:
run_idmethoddatasetquestion_idquestion_indexstepraw_answernormalized_answerpredicted_labelis_correct- token usage fields
depth_accuracy.csv includes intermediate responses too:
- per-chain seed rows
- per-depth transformed populations
- final answer rows
This is the file to use for depth curves and chain-level analysis.
Run tests:
uv run pytestRun linting:
uv run ruff check .Run type checking:
uv run pyrightCurrent tests cover:
- answer normalization / verification
- composable cache behavior
src/main.py CLI entrypoint
src/settings.py CLI / env configuration
src/data_sources.py dataset registry and loaders
src/shared.py model setup, schemas, CSV logging
src/methods/ methods and composable stages
script/bench.sh Slurm benchmark wrapper
script/run_bench.sh Slurm vLLM + benchmark wrapper
script/serve_vllm.sh Slurm vLLM server launcher
script/sweep_methods_parallel.sh Slurm sweep helper
data/outputs/ result CSVs
job-outputs/ Slurm stdout / stderr logs
tests/ test suite
- Non-Gemini models require either
--model_url,TOGETHER_API_KEY, or a supported OpenAI setup. - For OpenAI-compatible local servers,
--model_urlis usually the simplest path. - Output CSVs are append-only by default.
- Large runs can produce very large CSV files because code prompts, traces, and reasoning are stored inline.
This project is licensed under the MIT License. See LICENSE.