Skip to content

πŸ€— Benchmark Large Language Models Reliably On Your Data

Notifications You must be signed in to change notification settings

huggingface/yourbench

 
 

Repository files navigation

πŸ€— Yourbench

Dynamic Evaluation Set Generation for LLM Benchmarking [NAACL '25]

Python 3.12+ Code style: ruff License: MIT πŸ€— Hugging Face

🌟 Overview

Yourbench is a powerful framework for dynamically generating evaluation sets from source documents. It addresses the limitations of static benchmarks and benchmark saturation by creating diverse, contextually-rich questions tailored to specific educational levels.

πŸ”„ Process Flow

Process Flow

✨ Features

  • πŸ”„ Dynamic Generation: Create evaluation sets on-the-fly from any source documents
  • πŸ“š Semantic Chunking: Smart document splitting that maintains context and meaning
  • πŸ€” Multi-hop Questions: Generate questions that require synthesizing information across document sections
  • πŸ“Š Configurable Difficulty: Tailor questions to specific educational levels
  • πŸ” Diverse Question Types: Support for 10 different question types
  • πŸ€– Model Flexibility: Works with OpenAI and Azure OpenAI models via LiteLLM
  • πŸ“¦ Hugging Face Integration: Direct dataset publishing to Hugging Face Hub

πŸ› οΈ Requirements

πŸ“¦ Installation

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

πŸš€ Quick Start

  1. Set up your environment:
# For OpenAI / OpenAI compatible APIs
export MODEL_BASE_URL=your_openai_url
export MODEL_API_KEY=your_openai_key

# For Azure OpenAI
export AZURE_BASE_URL=your_azure_url
export AZURE_API_KEY=your_azure_key
  1. Create a task configuration (config.yaml). Here is some more information!. You can also look at an example task configuration

  2. Run the example task (after setting your πŸ€— username / organization in the config!):

python yourbench/main.py --task-name yourbench_y1

πŸ“š Documentation

Detailed documentation is available in the docs directory:

πŸ—οΈ Pipeline Components

1. Dataset Generation

  • Processes source documents
  • Creates structured datasets
  • Supports local files and Hugging Face datasets

2. Document Summarization

  • Generates document summaries
  • Provides context for question generation
  • Uses configured language model

3. Semantic Chunking

  • Splits documents intelligently
  • Maintains semantic coherence
  • Configurable chunk sizes and overlap

4. Multi-hop Chunk Creation

  • Pairs related document chunks
  • Enables complex reasoning questions
  • Smart chunk selection

5. Question Generation

  • Single-shot questions from individual chunks
  • Multi-hop questions from chunk pairs
  • 10 different question types
  • Difficulty calibration
  • Educational level targeting

6. Dataset Management

  • Hugging Face integration
  • Local storage options
  • Dataset versioning

🎯 Question Types

  1. Analytical: Break down complex ideas
  2. Application-based: Apply concepts to scenarios
  3. Clarification: Deep dive into specifics
  4. Counterfactual: Explore alternatives
  5. Conceptual: Examine theories
  6. True-false: Verify understanding
  7. Factual: Test recall
  8. Open-ended: Encourage discussion
  9. False-premise: Correct misconceptions
  10. Edge-case: Test boundaries

βš™οΈ Configuration

Example configuration:

task_name: yourbench_y1
configurations:
  push_to_huggingface: true
  set_hf_repo_visibility: public
  hf_organization: your-org
  model:
    model_name: gpt-4
    model_type: openai
    max_concurrent_requests: 512

selected_choices:
  generate_dataset:
    execute: true
    files_directory: examples/data
    dataset_name: my_dataset

See Configuration Guide for detailed options.

🧰 Development

We use:

  • Ruff for code formatting and linting
  • pytest for testing

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Install development dependencies
  4. Make your changes
  5. Run tests and ensure code style compliance
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

About

πŸ€— Benchmark Large Language Models Reliably On Your Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.1%
  • Python 1.9%