PRBench: Professional Reasoning Benchmark

A lightweight evaluation harness for PRBench, a large-scale expert-annotated benchmark for high-stakes reasoning in professional domains. Current version covers Legal and Finance domains.

This repository provides the code necessary to run rubric-based evaluations using the PRBench dataset and pipeline. The main entry point is evals.py, which conducts model response generation, automated scoring via an LLM-judge, caching, retries, and reporting.

PRBench consists of:

1,100 expert-authored conversations across Finance and Legal domains
19,356 expert-curated rubric criteria (10–30 per task)
Coverage of 114 countries, 47 U.S. jurisdictions, and 25 total professional topics.
Hard subsets (Finance-300, Legal-250) representing the most challenging tasks

See the release for the paper and full details at: https://scale.com/research/prbench

Explore PRBench using our visualizer at: https://prbench-explorer.vercel.app/

Quickstart

Install dependencies in requirements.txt
Set API key, url for the endpoint to sample responses from and model grading.
Configure the config.yaml to point to the right API key path litellm_key_path and base url. The code uses OpenAI SDK--make sure to use the right key for your endpoint.
Select which response models to evaluate in response_model_names.
For debugging set debug to true which will only run evaluation for the first 2 samples.
Run with python evals.py --config config.yaml.

Source of Responses

By default, the script will sample responses to the conversations from the models listed under response_model_names when final_response_source is set as sampled. Alternatively, setting final_response_source: prefilled, will load responses from filename.json. Format the json to be task->response.

Outputs

Results are saved under results/. We report mean_clipped scores in the paper. Results for individual data points may be found under outputs.

Acknowledgments

We are grateful to the domain experts who contributed their time and expertise to PRBench.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
README.md		README.md
config.py		config.py
config.yaml		config.yaml
constants.py		constants.py
criteria.py		criteria.py
evals.py		evals.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PRBench: Professional Reasoning Benchmark

Quickstart

Source of Responses

Outputs

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

scaleapi/PRBench

Folders and files

Latest commit

History

Repository files navigation

PRBench: Professional Reasoning Benchmark

Quickstart

Source of Responses

Outputs

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages