A Python package + web dashboard for evaluating RAG (Retrieval-Augmented Generation) systems with dual-mode support: Function Call (LangChain) and HTTP Client.
┌──────────────────────────────────────────────────────────────────────┐
│ RAG EVAL PIPELINE FLOW │
└──────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ INPUT LAYER │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
│ Dataset JSON │ │ Config │ │ Adapter │
│ [{question, │ │ YAML │ │ Selection │
│ ground_truth,│ │ (thres- │ │ │
│ contexts}] │ │ holds) │ │ HTTP │ FN │
└────────┬───────┘ └────┬─────┘ └──────┬──────┘
│ │ │
└──────────────┼──────────────┘
│
┌────────▼────────┐
│ ADAPTER LAYER │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │
┌────────▼────────┐ ┌─────────▼────────┐
│ HTTPAdapter │ │ FunctionAdapter │
│ │ │ │
│ POST /query │ │ chain.invoke() │
│ Bearer Auth │ │ LangChain QA │
│ Retry (3x) │ │ Direct call │
│ Backoff on 5xx │ │ │
└────────┬─────────┘ └─────────┬────────┘
│ │
└──────────────┬──────────────┘
│
┌────────▼────────┐
│ {answer, ctx} │
│ per question │
└────────┬────────┘
│
┌────────▼────────┐
│ EVALUATOR LAYER │
│ │
│ Try RAGAS ──┐ │
│ │ │ │
│ fallback │ │
│ │ │ │
│ LLM Judge │ │
│ (GPT-4o) │ │
└────────┬────────┘
│
┌────────▼────────┐
│ EVAL REPORT │
│ │
│ faithfulness │
│ relevancy │
│ context_recall │
│ context_prec │
│ halluc. rate │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
│ Reporter │ │ Thresh- │ │ LangSmith │
│ │ │ olds │ │ (optional) │
│ JSON file │ │ Check │ │ │
│ MD summary │ │ │ │ Experiment │
└───────────────┘ │ PASS/FAIL│ │ push │
└──────────┘ └─────────────┘
│
┌────────▼────────┐
│ EXIT CODE │
│ 0 = pass │
│ 1 = fail │
└─────────────────┘
rag_eval/
├── __init__.py # Package exports
├── __main__.py # CLI entrypoint (python -m rag_eval)
├── adapters.py # RAGAdapter ABC, FunctionAdapter, HTTPAdapter
├── dataset.py # load_dataset(), generate_dataset()
├── evaluator.py # evaluate() → EvalReport
├── reporter.py # JSON, Markdown, LangSmith reporters
└── thresholds.py # load_config(), check()
backend/server.py # FastAPI dashboard API
frontend/src/App.js # React monitoring dashboard
tests/ # pytest unit + integration tests
.github/workflows/ # CI/CD pipeline
cd /app
python3.12 -m venv backend/.venv
source backend/.venv/bin/activate
python3.12 -m pip install -r backend/requirements.txtUse Python 3.11 or 3.12 for the smoothest install experience. Some pinned dependencies
in backend/requirements.txt may not yet publish compatible wheels for Python 3.14+.
export OPENAI_API_KEY=sk-your-key-here
# Optional: LangSmith
export LANGSMITH_API_KEY=ls-your-key-hereHTTP Mode — test any RAG service exposing a /query endpoint:
python3.12 -m rag_eval \
--mode http \
--url https://my-rag-service.com \
--api-key $RAG_API_KEY \
--dataset ./tests/fixtures/dataset.json \
--config ./eval_config.yamlFunction Mode — test a LangChain chain directly:
python3.12 -m rag_eval \
--mode function \
--module myapp.chain \
--attr rag_chain \
--dataset ./tests/fixtures/dataset.json \
--config ./eval_config.yamlThe dashboard runs at http://localhost:3000 (frontend) backed by http://localhost:8001 (FastAPI API).
- Dashboard: Overview metrics, charts, recent runs
- Eval Runs: List runs, start new evaluations, view details
- Datasets: Upload/manage evaluation datasets (JSON)
- Config: Set pass/fail thresholds via sliders
Run the backend API (from repo root):
source backend/.venv/bin/activate
python3.12 -m uvicorn backend.server:app --reload --port 8001# Unit tests (no API key needed)
python3.12 -m pytest tests/test_adapters.py tests/test_thresholds.py tests/test_dataset.py -v
# Integration test (requires OPENAI_API_KEY)
python3.12 -m pytest tests/test_integration.py -v| Flag | Description |
|---|---|
--mode |
http or function (required) |
--url |
Target URL for HTTP mode |
--api-key |
Bearer token for HTTP auth |
--module |
Python module path for function mode |
--attr |
Attribute name in module |
--dataset |
Path to dataset JSON file (required) |
--config |
Path to eval_config.yaml (required) |
--name |
Name for this evaluation run |
--output-dir |
Directory for output files (default: .) |
--langsmith |
Push results to LangSmith |
--langsmith-project |
LangSmith project name |
[
{
"question": "What is RAG?",
"ground_truth": "RAG combines retrieval with generation.",
"relevant_contexts": [
"RAG is a technique that enhances LLMs..."
]
}
]faithfulness_min: 0.85
answer_relevancy_min: 0.80
context_recall_min: 0.75
context_precision_min: 0.75
hallucination_rate_max: 0.15To evaluate a new RAG service:
-
HTTP Mode: Ensure your service exposes a
POST /queryendpoint that accepts{"question": "..."}and returns{"answer": "...", "contexts": ["..."]}. Then run:python -m rag_eval --mode http --url https://your-service.com --dataset dataset.json --config eval_config.yaml
-
Function Mode: Import your LangChain RetrievalQA chain:
# myapp/chain.py from langchain.chains import RetrievalQA rag_chain = RetrievalQA.from_chain_type(...)
Then run:
python -m rag_eval --mode function --module myapp.chain --attr rag_chain --dataset dataset.json --config eval_config.yaml
-
Custom Adapter: Extend
RAGAdapterinrag_eval/adapters.py:from rag_eval.adapters import RAGAdapter class MyCustomAdapter(RAGAdapter): def query(self, question: str) -> dict: # Your logic here return {"answer": "...", "contexts": ["..."]}
The included GitHub Actions workflow (.github/workflows/rag_eval.yml) runs on every PR to main:
- Installs dependencies
- Runs CLI in HTTP mode against your staging URL
- Uploads
eval_report.jsonas an artifact - Posts a score table as a PR comment
- Fails the check if any metric is below threshold
| Secret | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key for RAGAS/LLM evaluation |
LANGSMITH_API_KEY |
LangSmith API key (optional) |
RAG_API_KEY |
Bearer token for your staging RAG service |
| Variable | Description |
|---|---|
RAG_STAGING_URL |
URL of your staging RAG service |
| Metric | Description | Range |
|---|---|---|
| Faithfulness | Factual consistency of answer with context | 0–1 |
| Answer Relevancy | How relevant the answer is to the question | 0–1 |
| Context Recall | Coverage of ground truth by retrieved contexts | 0–1 |
| Context Precision | Relevance of retrieved contexts | 0–1 |
| Hallucination Rate | 1 - faithfulness (lower is better) |
0–1 |
langchain,langchain-openai,langchain-community— LangChain frameworkchromadb— Vector store for integration testsragas— RAG evaluation metricslangsmith— Experiment trackinghttpx— HTTP client with retry supportpyyaml— Config file parsingopenai— GPT-4o for dataset generation & LLM-as-judgepytest— Testing framework