BENCHMARKCARDS

BenchmarkCards offer a standardized way to document LLM benchmarks clearly and transparently. Inspired by Model Cards and Datasheets, BenchmarkCards help researchers and practitioners understand exactly what benchmarks test, how they relate to real-world risks, and how to interpret their results responsibly.

Who is this for? AI researchers, data scientists, auditors, policymakers, and anyone concerned with responsible AI deployment.

Paper link: https://arxiv.org/abs/2410.12974

🎉 Exciting News!

Our paper BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks has been accepted to NeurIPS 2025 (Datasets and Benchmarks Track) — yay!

🛠️ How to Use this Repository

🔗 Key Contents

📁 BenchmarkCards/ – Generated BenchmarkCards in Markdown. Filenames prefixed with ⭐ indicate that the benchmark author reviewed and approved the card.
📁 BenchmarkCards_JSON/ – JSON versions of the BenchmarkCards for programmatic access.
🖥️ platform/ – Source code for the upcoming BenchmarkCards Platform.
📸 screenshots/ – Screenshots illustrating the platform interface.
📋 BenchmarkCard_Template.md – A ready‑to‑use template for creating new BenchmarkCards.
📊 AI_Risk_Atlas.md – Maps benchmarks to IBM Atlas AI risk categories and associated benchmarks.

🙏 Acknowledgments

We gratefully thank all benchmark authors who provided feedback and approval for the BenchmarkCards in this repository. Benchmarks approved by their original authors are marked with a ⭐ in the filename. Your collaboration is essential for making LLM evaluation more transparent, accurate, and useful. Thank you!

📊 Access to Hugging Face

🤗 View BenchmarkCards Dataset on Hugging Face

Our complete collection of over 4,000 BenchmarkCards is now available on Hugging Face for easy programmatic access and integration into your research workflows.

Formats Available: JSON, Markdown
Regular Updates: We continuously add new benchmarks - check back often! = load_dataset("ASokol/BenchmarkCards")

Note: This dataset is actively maintained and regularly updated with new benchmarks.

🚧 BenchmarkCards Platform 🚧

We're excited to introduce our new BenchmarkCards Platform - an automated tool designed to streamline the process of generating benchmark cards from research papers!

Status: 🏗️ Currently under construction - coming soon!

The platform will allow users to:

Upload benchmark papers in PDF format
Automatically extract key information
Generate structured BenchmarkCards
Download cards in JSON format for easy integration

Preview Screenshots

Here's a sneak peek at what the platform will look like:

Main Interface:

Generated BenchmarkCard Example:

Platform Structure:

app.py: Main application entry point.
config.py: Configuration settings.
requirements.txt: Project dependencies.
static/: Static files (CSS, JS, images).
src/: Core source code modules:
- models.py: Pydantic data models.
- pdf_extractor.py: PDF text extraction functionality.
- ai_service.py: Integration with OpenAI APIs.
- templates.py: HTML templates management.
- markdown_converter.py: Convert content to Markdown.

⚙️ Quick Setup Instructions

# 1. Clone the repository
git clone https://github.com/SokolAnn/BenchmarkCards.git
cd BenchmarkCards

# 2. Set up your environment
pip install -r requirements.txt

# 3. Add your API key
# Open the config.py file and replace the placeholder with your OpenAI API key
# Example:
# OPENAI_API_KEY = "your-openai-api-key-here"

# 4. Run the app
python app.py

# 5. Access the app
# Open your browser and go to:
# http://localhost:8000/

Configuration

OpenAI API Key

To use the AI processing features, you need to set your OpenAI API key.

Environment Variable: Set OPENAI_API_KEY in your environment.

export OPENAI_API_KEY="sk-..."
# Windows PowerShell
$env:OPENAI_API_KEY="sk-..."

Config File: Alternatively, you can paste your key directly into platform/config.py.

Customizing Generation

The AI generation logic is located in platform/src/ai_service.py. You can modify this file to:

Change the system prompt.
Use a different model (default is gpt-4o-mini).
specific a different API provider.

🛠️ CLI Tools & Search

We now provide command-line tools for batch processing and a search interface:

Batch PDF Processing

Process multiple PDFs into BenchmarkCards (JSON & Markdown):

python process_pdfs.py path/to/pdf_directory --output_dir output

Search Interface

Build the search index from existing JSON cards:

python build_search_index.py BenchmarkCards_JSON

Start the platform to access the search page at /search:

cd platform
python app.py

Then visit: http://localhost:8000/search

Citation

If you use this work in your research, please cite:

@misc{sokol2025benchmarkcardsstandardizeddocumentationlarge,
      title={BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks}, 
      author={Anna Sokol and Elizabeth Daly and Michael Hind and David Piorkowski and Xiangliang Zhang and Nuno Moniz and Nitesh Chawla},
      year={2025},
      eprint={2410.12974},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.12974}, 
}

🤝 Get Involved!

Would you be interested in contributing to the BenchmarkCards initiative? Feel free to explore, fork the repository, and open issues to suggest improvements or new benchmarks. Let's collaborate and shape the future of LLM benchmarking!

📄 License

All source code in this repository is licensed under the MIT License.
BenchmarkCard content (Markdown and JSON files) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
BenchmarkCards		BenchmarkCards
BenchmarkCards_JSON		BenchmarkCards_JSON
Coverage_Map		Coverage_Map
platform		platform
screenshots		screenshots
AI_Risk_Atlas.md		AI_Risk_Atlas.md
BenchmarkCard_ Template.md		BenchmarkCard_ Template.md
Benchmark_Network_OLD.md		Benchmark_Network_OLD.md
Benchmarks_and_Risk_Table_OLD.md		Benchmarks_and_Risk_Table_OLD.md
ComparisonBenchmarks_OLD.md		ComparisonBenchmarks_OLD.md
README.md		README.md
build_search_index.py		build_search_index.py
license.md		license.md
process_pdfs.py		process_pdfs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BENCHMARKCARDS

🎉 Exciting News!

🛠️ How to Use this Repository

🔗 Key Contents

🙏 Acknowledgments

📊 Access to Hugging Face

🚧 BenchmarkCards Platform 🚧

Preview Screenshots

⚙️ Quick Setup Instructions

Configuration

OpenAI API Key

Customizing Generation

🛠️ CLI Tools & Search

Batch PDF Processing

Search Interface

Citation

🤝 Get Involved!

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BENCHMARKCARDS

🎉 Exciting News!

🛠️ How to Use this Repository

🔗 Key Contents

🙏 Acknowledgments

📊 Access to Hugging Face

🚧 BenchmarkCards Platform 🚧

Preview Screenshots

⚙️ Quick Setup Instructions

Configuration

OpenAI API Key

Customizing Generation

🛠️ CLI Tools & Search

Batch PDF Processing

Search Interface

Citation

🤝 Get Involved!

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages