BenchmarkCards offer a standardized way to document LLM benchmarks clearly and transparently. Inspired by Model Cards and Datasheets, BenchmarkCards help researchers and practitioners understand exactly what benchmarks test, how they relate to real-world risks, and how to interpret their results responsibly.
Who is this for? AI researchers, data scientists, auditors, policymakers, and anyone concerned with responsible AI deployment.
Paper link: https://arxiv.org/abs/2410.12974
Our paper BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks has been accepted to NeurIPS 2025 (Datasets and Benchmarks Track) β yay!
- π BenchmarkCards/ β Generated BenchmarkCards in Markdown. Filenames prefixed with β indicate that the benchmark author reviewed and approved the card.
- π BenchmarkCards_JSON/ β JSON versions of the BenchmarkCards for programmatic access.
- π₯οΈ platform/ β Source code for the upcoming BenchmarkCards Platform.
- πΈ screenshots/ β Screenshots illustrating the platform interface.
- π BenchmarkCard_Template.md β A readyβtoβuse template for creating new BenchmarkCards.
- π AI_Risk_Atlas.md β Maps benchmarks to IBM Atlas AI risk categories and associated benchmarks.
We gratefully thank all benchmark authors who provided feedback and approval for the BenchmarkCards in this repository. Benchmarks approved by their original authors are marked with a β in the filename. Your collaboration is essential for making LLM evaluation more transparent, accurate, and useful. Thank you!
π€ View BenchmarkCards Dataset on Hugging Face
Our complete collection of over 4,000 BenchmarkCards is now available on Hugging Face for easy programmatic access and integration into your research workflows.
- Formats Available: JSON, Markdown
- Regular Updates: We continuously add new benchmarks - check back often! = load_dataset("ASokol/BenchmarkCards")
Note: This dataset is actively maintained and regularly updated with new benchmarks.
We're excited to introduce our new BenchmarkCards Platform - an automated tool designed to streamline the process of generating benchmark cards from research papers!
Status: ποΈ Currently under construction - coming soon!
The platform will allow users to:
- Upload benchmark papers in PDF format
- Automatically extract key information
- Generate structured BenchmarkCards
- Download cards in JSON format for easy integration
Here's a sneak peek at what the platform will look like:
Generated BenchmarkCard Example:

Platform Structure:
app.py: Main application entry point.config.py: Configuration settings.requirements.txt: Project dependencies.static/: Static files (CSS, JS, images).src/: Core source code modules:models.py: Pydantic data models.pdf_extractor.py: PDF text extraction functionality.ai_service.py: Integration with OpenAI APIs.templates.py: HTML templates management.markdown_converter.py: Convert content to Markdown.
# 1. Clone the repository
git clone https://github.com/SokolAnn/BenchmarkCards.git
cd BenchmarkCards
# 2. Set up your environment
pip install -r requirements.txt
# 3. Add your API key
# Open the config.py file and replace the placeholder with your OpenAI API key
# Example:
# OPENAI_API_KEY = "your-openai-api-key-here"
# 4. Run the app
python app.py
# 5. Access the app
# Open your browser and go to:
# http://localhost:8000/
To use the AI processing features, you need to set your OpenAI API key.
- Environment Variable: Set
OPENAI_API_KEYin your environment.export OPENAI_API_KEY="sk-..." # Windows PowerShell $env:OPENAI_API_KEY="sk-..."
- Config File: Alternatively, you can paste your key directly into
platform/config.py.
The AI generation logic is located in platform/src/ai_service.py. You can modify this file to:
- Change the system prompt.
- Use a different model (default is
gpt-4o-mini). - specific a different API provider.
We now provide command-line tools for batch processing and a search interface:
Process multiple PDFs into BenchmarkCards (JSON & Markdown):
python process_pdfs.py path/to/pdf_directory --output_dir output- Build the search index from existing JSON cards:
python build_search_index.py BenchmarkCards_JSON- Start the platform to access the search page at
/search:
cd platform
python app.pyThen visit: http://localhost:8000/search
If you use this work in your research, please cite:
@misc{sokol2025benchmarkcardsstandardizeddocumentationlarge,
title={BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks},
author={Anna Sokol and Elizabeth Daly and Michael Hind and David Piorkowski and Xiangliang Zhang and Nuno Moniz and Nitesh Chawla},
year={2025},
eprint={2410.12974},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.12974},
}Would you be interested in contributing to the BenchmarkCards initiative? Feel free to explore, fork the repository, and open issues to suggest improvements or new benchmarks. Let's collaborate and shape the future of LLM benchmarking!
All source code in this repository is licensed under the MIT License.
BenchmarkCard content (Markdown and JSON files) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
