Skip to content

SokolAnn/BenchmarkCards

Repository files navigation

BENCHMARKCARDS

MIT License Last Commit Top Language Language Count JSON FastAPI Python OpenAI Pydantic

BenchmarkCards offer a standardized way to document LLM benchmarks clearly and transparently. Inspired by Model Cards and Datasheets, BenchmarkCards help researchers and practitioners understand exactly what benchmarks test, how they relate to real-world risks, and how to interpret their results responsibly.

Who is this for? AI researchers, data scientists, auditors, policymakers, and anyone concerned with responsible AI deployment.

Paper link: https://arxiv.org/abs/2410.12974

πŸŽ‰ Exciting News!

Our paper BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks has been accepted to NeurIPS 2025 (Datasets and Benchmarks Track) β€” yay!

πŸ› οΈ How to Use this Repository

πŸ”— Key Contents

  • πŸ“ BenchmarkCards/ – Generated BenchmarkCards in Markdown. Filenames prefixed with ⭐ indicate that the benchmark author reviewed and approved the card.
  • πŸ“ BenchmarkCards_JSON/ – JSON versions of the BenchmarkCards for programmatic access.
  • πŸ–₯️ platform/ – Source code for the upcoming BenchmarkCards Platform.
  • πŸ“Έ screenshots/ – Screenshots illustrating the platform interface.
  • πŸ“‹ BenchmarkCard_Template.md – A ready‑to‑use template for creating new BenchmarkCards.
  • πŸ“Š AI_Risk_Atlas.md – Maps benchmarks to IBM Atlas AI risk categories and associated benchmarks.

πŸ™ Acknowledgments

We gratefully thank all benchmark authors who provided feedback and approval for the BenchmarkCards in this repository. Benchmarks approved by their original authors are marked with a ⭐ in the filename. Your collaboration is essential for making LLM evaluation more transparent, accurate, and useful. Thank you!

πŸ“Š Access to Hugging Face

πŸ€— View BenchmarkCards Dataset on Hugging Face

Our complete collection of over 4,000 BenchmarkCards is now available on Hugging Face for easy programmatic access and integration into your research workflows.

  • Formats Available: JSON, Markdown
  • Regular Updates: We continuously add new benchmarks - check back often! = load_dataset("ASokol/BenchmarkCards")

Note: This dataset is actively maintained and regularly updated with new benchmarks.


🚧 BenchmarkCards Platform 🚧

We're excited to introduce our new BenchmarkCards Platform - an automated tool designed to streamline the process of generating benchmark cards from research papers!

Status: πŸ—οΈ Currently under construction - coming soon!

The platform will allow users to:

  • Upload benchmark papers in PDF format
  • Automatically extract key information
  • Generate structured BenchmarkCards
  • Download cards in JSON format for easy integration

Preview Screenshots

Here's a sneak peek at what the platform will look like:

Main Interface: BenchmarkCards Platform Interface

Generated BenchmarkCard Example: Generated BenchmarkCard

Platform Structure:

  • app.py: Main application entry point.
  • config.py: Configuration settings.
  • requirements.txt: Project dependencies.
  • static/: Static files (CSS, JS, images).
  • src/: Core source code modules:
    • models.py: Pydantic data models.
    • pdf_extractor.py: PDF text extraction functionality.
    • ai_service.py: Integration with OpenAI APIs.
    • templates.py: HTML templates management.
    • markdown_converter.py: Convert content to Markdown.

βš™οΈ Quick Setup Instructions

# 1. Clone the repository
git clone https://github.com/SokolAnn/BenchmarkCards.git
cd BenchmarkCards

# 2. Set up your environment
pip install -r requirements.txt

# 3. Add your API key
# Open the config.py file and replace the placeholder with your OpenAI API key
# Example:
# OPENAI_API_KEY = "your-openai-api-key-here"

# 4. Run the app
python app.py

# 5. Access the app
# Open your browser and go to:
# http://localhost:8000/

Configuration

OpenAI API Key

To use the AI processing features, you need to set your OpenAI API key.

  1. Environment Variable: Set OPENAI_API_KEY in your environment.
    export OPENAI_API_KEY="sk-..."
    # Windows PowerShell
    $env:OPENAI_API_KEY="sk-..."
  2. Config File: Alternatively, you can paste your key directly into platform/config.py.

Customizing Generation

The AI generation logic is located in platform/src/ai_service.py. You can modify this file to:

  • Change the system prompt.
  • Use a different model (default is gpt-4o-mini).
  • specific a different API provider.

πŸ› οΈ CLI Tools & Search

We now provide command-line tools for batch processing and a search interface:

Batch PDF Processing

Process multiple PDFs into BenchmarkCards (JSON & Markdown):

python process_pdfs.py path/to/pdf_directory --output_dir output

Search Interface

  1. Build the search index from existing JSON cards:
python build_search_index.py BenchmarkCards_JSON
  1. Start the platform to access the search page at /search:
cd platform
python app.py

Then visit: http://localhost:8000/search


Citation

If you use this work in your research, please cite:

@misc{sokol2025benchmarkcardsstandardizeddocumentationlarge,
      title={BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks}, 
      author={Anna Sokol and Elizabeth Daly and Michael Hind and David Piorkowski and Xiangliang Zhang and Nuno Moniz and Nitesh Chawla},
      year={2025},
      eprint={2410.12974},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.12974}, 
}

🀝 Get Involved!

Would you be interested in contributing to the BenchmarkCards initiative? Feel free to explore, fork the repository, and open issues to suggest improvements or new benchmarks. Let's collaborate and shape the future of LLM benchmarking!


πŸ“„ License

All source code in this repository is licensed under the MIT License.
BenchmarkCard content (Markdown and JSON files) is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors