A lightweight evaluation harness for PRBench, a large-scale expert-annotated benchmark for high-stakes reasoning in professional domains. Current version covers Legal and Finance domains.
This repository provides the code necessary to run rubric-based evaluations using the PRBench dataset and pipeline. The main entry point is evals.py, which conducts model response generation, automated scoring via an LLM-judge, caching, retries, and reporting.
PRBench consists of:
-
1,100 expert-authored conversations across Finance and Legal domains
-
19,356 expert-curated rubric criteria (10–30 per task)
-
Coverage of 114 countries, 47 U.S. jurisdictions, and 25 total professional topics.
-
Hard subsets (Finance-300, Legal-250) representing the most challenging tasks
See the release for the paper and full details at: https://scale.com/research/prbench
Explore PRBench using our visualizer at: https://prbench-explorer.vercel.app/
- Install dependencies in
requirements.txt - Set API key, url for the endpoint to sample responses from and model grading.
- Configure the config.yaml to point to the right API key path
litellm_key_pathand base url. The code uses OpenAI SDK--make sure to use the right key for your endpoint. - Select which response models to evaluate in
response_model_names. - For debugging set
debugtotruewhich will only run evaluation for the first 2 samples. - Run with
python evals.py --config config.yaml.
By default, the script will sample responses to the conversations from the models listed under response_model_names when final_response_source is set as sampled. Alternatively, setting final_response_source: prefilled, will load responses from filename.json. Format the json to be task->response.
Results are saved under results/. We report mean_clipped scores in the paper. Results for individual data points may be found under outputs.
We are grateful to the domain experts who contributed their time and expertise to PRBench.