LLM-Eval

A flexible, extensible, and reproducible framework for evaluating LLM workflows, applications, retrieval-augmented generation pipelines, and standalone models across custom and standard datasets.

🚀 Key Features

📚 Document-Based Q&A Generation: Transform your technical documentation, guides, and knowledge bases into comprehensive question-answer test catalogs
📊 Multi-Dimensional Evaluation Metrics:
- ✅ Answer Relevancy: Measures how well responses address the actual question
- 🧠 G-Eval: Sophisticated evaluation using other LLMs as judges
- 🔍 Faithfulness: Assesses adherence to source material facts
- 🚫 Hallucination Detection: Identifies fabricated information not present in source documents
📈 Long-Term Quality Tracking:
- 📆 Temporal Performance Analysis: Monitor model degradation or improvement over time
- 🔄 Regression Testing: Automatically detect when model updates negatively impact performance
- 📊 Trend Visualization: Track quality metrics across model versions with interactive charts
🔄 Universal Compatibility: Seamlessly works with all OpenAI-compatible endpoints including local solutions like Ollama
🏷️ Version Control for Q&A Catalogs: Easily track changes in your evaluation sets over time
📊 Comparative Analysis: Visualize performance differences between models on identical question sets
🚀 Batch Processing: Evaluate multiple models simultaneously for efficient workflows
🔌 Extensible Plugin System: Add new providers, metrics, and dataset generation techniques

Available Providers

OpenAI: Integrate and evaluate models from OpenAI's API, including support for custom base URLs, temperature, and language control
Azure OpenAI: Use Azure-hosted OpenAI models with deployment, API version, and custom language output support
C4: Connect to C4 endpoints for LLM evaluation with custom configuration and API key support

📖 Table of Contents

🚀 Key Features
📖 Table of Contents
📝 Introduction
Getting Started
1. Running LLM-Eval Locally
  - Prerequisites
  - Quick Start - for local usage
2. Development Setup
🤝 Contributing & Code of Conduct
📜 License

📝 Introduction

LLM-Eval is an open-source toolkit designed to evaluate large language model workflows, applications, retrieval-augmented generation pipelines, and standalone models. Whether you're developing a conversational agent, a summarization service, or a RAG-based search tool, LLM-Eval provides a clear, reproducible framework to test and compare performance across providers, metrics, and datasets.

Key benefits include: end-to-end evaluation of real-world applications, reproducible reports, and an extensible platform for custom metrics and datasets.

Getting Started

Running LLM-Eval Locally

To run LLM-Eval locally (for evaluation and usage, not development), use our pre-configured Docker Compose setup.

Prerequisites

Docker
Docker Compose

Quick Start - for local usage

Clone the repository:

git clone <LLM-Eval github url>
cd llm-eval

Copy and configure environment:
```
cp .env.example .env
# Edit .env to add your API keys and secrets as needed
```
Required: Generate the encryption keys set to CHANGEME with the respective commands commented next to them in .env
Enable host networking in docker desktop (for macos users):

Go to Settings -> Resources -> Network and check Enable host networking, without this step on macos, the frontend wouldn't be reachable on localhost.

Start the stack:

docker compose -f docker-compose.yaml -f docker-compose.local.yaml up -d

Access the application:
- Web UI: http://localhost:3000 (Default login: username:password)
- API: http://localhost:8070/docs
Login using default user:

Default user for llmeval username: username, password: password.

To stop the app:

docker compose -f docker-compose.yaml -f docker-compose.local.yaml down

Development Setup

If you want to contribute to LLM-Eval or run it in a development environment, follow these steps:

Development prerequisites

Python 3.12
Poetry
Docker (for required services)
Node.js & npm (for frontend)

Installation & Local Development

git clone <LLM-Eval github url>
cd llm-eval
poetry install --only=main,dev,test
poetry self add poetry-plugin-shell

Install Git pre-commit hook:
```
pre-commit install
```

Start Poetry shell:
```
poetry shell
```

Copy and configure environment:

cp .env.example .env
# Add your API keys and secrets to .env
# Fill CHANGEME with appropriate keys

Comment the following in .env

from

# container variables
KEYCLOAK_HOST=keycloak
CELERY_BROKER_HOST=rabbit-mq
PG_HOST=eval-db

to

# container variables
# KEYCLOAK_HOST=keycloak
# CELERY_BROKER_HOST=rabbit-mq
# PG_HOST=eval-db

Start databases and other services:
```
docker compose up -d
```

Start backend:

cd backend
uvicorn llm_eval.main:app --host 0.0.0.0 --port 8070 --reload

Start Celery worker:

cd backend
celery -A llm_eval.tasks worker --loglevel=INFO --concurrency=4

Start frontend:
```
cd frontend
npm install
npm run dev
```
Login using default user:

Default user for llmeval username: username, password: password.

Keycloak Setup (Optional if you want to override defaults)

User access is managed through Keycloak, available at localhost:8080 (Default admin credentials: admin:admin). Select the llm-eval realm to manage users.

If you want to adjust keycloak manually see docs/keycloak-setup-guide.md for step-by-step guide.
Otherwise it will use default configuration found in keycloak-config, when docker compose launchs.

Acquiring tokens from keycloak

Once keycloak is up and running, tokens might be requested by calling:

Without session by service client dev-ide (direct backend api calls):

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=dev-ide' \
  -d 'client_secret=dev-ide' \
  -d 'grant_type=client_credentials' | jq

Or with session using client llm-eval-ui (frontend calls) :

$ curl -X POST \
  'http://localhost:8080/realms/llm-eval/protocol/openid-connect/token' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'client_id=llm-eval-ui' \
  -d 'client_secret=llm-eval-ui' \
  -d 'username=username' \
  -d 'password=password' \
  -d 'grant_type=password' | jq

🤝 Contributing & Code of Conduct

As the repo isn't fully prepared for contributions, we aren't open for them for the moment.

📜 License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
data		data
frontend		frontend
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.env.services		.env.services
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.local.yaml		docker-compose.local.yaml
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Eval

🚀 Key Features

Available Providers

📖 Table of Contents

📝 Introduction

Getting Started

Running LLM-Eval Locally

Prerequisites

Quick Start - for local usage

Development Setup

Development prerequisites

Installation & Local Development

Keycloak Setup (Optional if you want to override defaults)

Acquiring tokens from keycloak

🤝 Contributing & Code of Conduct

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

codecentric/llm-eval

Folders and files

Latest commit

History

Repository files navigation

LLM-Eval

🚀 Key Features

Available Providers

📖 Table of Contents

📝 Introduction

Getting Started

Running LLM-Eval Locally

Prerequisites

Quick Start - for local usage

Development Setup

Development prerequisites

Installation & Local Development

Keycloak Setup (Optional if you want to override defaults)

Acquiring tokens from keycloak

🤝 Contributing & Code of Conduct

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages