Skip to content

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

License

Notifications You must be signed in to change notification settings

codecentric/llm-eval

Repository files navigation

LLM-Eval

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

πŸš€ Key Features

  • πŸ“š Document-Based Q&A Generation: Transform your technical documentation, guides, and knowledge bases into comprehensive question-answer test catalogs
  • πŸ“Š Multi-Dimensional Evaluation Metrics:
    • βœ… Answer Relevancy: Measures how well responses address the actual question
    • 🧠 G-Eval: Sophisticated evaluation using other LLMs as judges
    • πŸ” Faithfulness: Assesses adherence to source material facts
    • 🚫 Hallucination Detection: Identifies fabricated information not present in source documents
  • πŸ“ˆ Long-Term Quality Tracking:
    • πŸ“† Temporal Performance Analysis: Monitor model degradation or improvement over time
    • πŸ”„ Regression Testing: Automatically detect when model updates negatively impact performance
    • πŸ“Š Trend Visualization: Track quality metrics across model versions with interactive charts
  • πŸ”„ Universal Compatibility: Seamlessly works with all OpenAI-compatible endpoints including local solutions like Ollama
  • 🏷️ Version Control for Q&A Catalogs: Easily track changes in your evaluation sets over time
  • πŸ“Š Comparative Analysis: Visualize performance differences between models on identical question sets
  • πŸš€ Batch Processing: Evaluate multiple models simultaneously for efficient workflows
  • πŸ”Œ Extensible Plugin System: Add new providers, metrics, and dataset generation techniques

Available Providers

  • OpenAI: Integrate and evaluate models from OpenAI's API, including support for custom base URLs, temperature, and language control
  • Azure OpenAI: Use Azure-hosted OpenAI models with deployment, API version, and custom language output support
  • C4: Connect to C4 endpoints for LLM evaluation with custom configuration and API key support

πŸ“– Table of Contents

  1. πŸš€ Key Features
  2. πŸ“– Table of Contents
  3. πŸ“ Introduction
  4. Getting Started
    1. Running LLM-Eval Locally
    2. Development Setup
  5. 🀝 Contributing & Code of Conduct
  6. πŸ“œ License

πŸ“ Introduction

LLM-Eval is an open-source toolkit designed to evaluate large language model workflows, applications, retrieval-augmented generation pipelines, and standalone models. Whether you're developing a conversational agent, a summarization service, or a RAG-based search tool, LLM-Eval provides a clear, reproducible framework to test and compare performance across providers, metrics, and datasets.

Key benefits include: end-to-end evaluation of real-world applications, reproducible reports, and an extensible platform for custom metrics and datasets.

Getting Started

Running LLM-Eval Locally

To run LLM-Eval locally (for evaluation and usage, not development), use our pre-configured Docker Compose setup.

Prerequisites

  • Docker
  • Docker Compose

Quick Start - for local usage

  1. Clone the repository:

    git clone <LLM-Eval github url>
    cd llm-eval
  2. Copy and configure environment:

    cp .env.example .env
    # Edit .env to add your API keys and secrets as needed

    Required: Generate the encryption keys set to CHANGEME with the respective commands commented next to them in .env

  3. Enable host networking in docker desktop (for macos users):

    Go to Settings -> Resources -> Network and check Enable host networking, without this step on macos, the frontend wouldn't be reachable on localhost.

  4. Start the stack:

    docker compose -f docker-compose.yaml -f docker-compose.local.yaml up -d
  5. Access the application:

  6. Login using default user:

    Default user for llmeval username: username, password: password.

To stop the app:

docker compose -f docker-compose.yaml -f docker-compose.local.yaml down

Development Setup

If you want to contribute to LLM-Eval or run it in a development environment, follow these steps:

Development prerequisites

  • Python 3.12
  • Poetry
  • Docker (for required services)
  • Node.js & npm (for frontend)

Installation & Local Development

git clone <LLM-Eval github url>
cd llm-eval
poetry install --only=main,dev,test
poetry self add poetry-plugin-shell
  • Install Git pre-commit hook:

    pre-commit install
  1. Start Poetry shell:

    poetry shell
  2. Copy and configure environment:

    cp .env.example .env
    # Add your API keys and secrets to .env
    # Fill CHANGEME with appropriate keys
  3. Comment the following in .env

    from

    # container variables
    KEYCLOAK_HOST=keycloak
    CELERY_BROKER_HOST=rabbit-mq
    PG_HOST=eval-db

    to

    # container variables
    # KEYCLOAK_HOST=keycloak
    # CELERY_BROKER_HOST=rabbit-mq
    # PG_HOST=eval-db
  4. Start databases and other services:

    docker compose up -d
  5. Start backend:

    cd backend
    uvicorn llm_eval.main:app --host 0.0.0.0 --port 8070 --reload
  6. Start Celery worker:

    cd backend
    celery -A llm_eval.tasks worker --loglevel=INFO --concurrency=4
  7. Start frontend:

    cd frontend
    npm install
    npm run dev
  8. Login using default user:

    Default user for llmeval username: username, password: password.

Keycloak Setup (Optional if you want to override defaults)

User access is managed through Keycloak, available at localhost:8080 (Default admin credentials: admin:admin). Select the llm-eval realm to manage users.

Acquiring tokens from keycloak

Once keycloak is up and running, tokens might be requested by calling:

Without session by service client dev-ide (direct backend api calls):

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=dev-ide' \
  -d 'client_secret=dev-ide' \
  -d 'grant_type=client_credentials' | jq

Or with session using client llm-eval-ui (frontend calls) :

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=llm-eval-ui' \
  -d 'client_secret=llm-eval-ui' \
  -d 'username=username' \
  -d 'password=password' \
  -d 'grant_type=password' | jq

🀝 Contributing & Code of Conduct

As the repo isn't fully prepared for contributions, we aren't open for them for the moment.

πŸ“œ License

This project is licensed under the Apache 2.0 License.

About

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •