RAG Eval — Agentic RAG Evaluation Pipeline

A Python package + web dashboard for evaluating RAG (Retrieval-Augmented Generation) systems with dual-mode support: Function Call (LangChain) and HTTP Client.

Flow Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                    RAG EVAL PIPELINE FLOW                            │
└──────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────┐
                    │   INPUT LAYER   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
     │  Dataset JSON  │ │  Config  │ │  Adapter    │
     │  [{question,   │ │  YAML    │ │  Selection  │
     │   ground_truth,│ │ (thres-  │ │             │
     │   contexts}]   │ │  holds)  │ │  HTTP  │ FN │
     └────────┬───────┘ └────┬─────┘ └──────┬──────┘
              │              │              │
              └──────────────┼──────────────┘
                             │
                    ┌────────▼────────┐
                    │  ADAPTER LAYER  │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │                             │
     ┌────────▼────────┐          ┌─────────▼────────┐
     │  HTTPAdapter     │          │  FunctionAdapter  │
     │                  │          │                   │
     │  POST /query     │          │  chain.invoke()   │
     │  Bearer Auth     │          │  LangChain QA     │
     │  Retry (3x)     │          │  Direct call      │
     │  Backoff on 5xx │          │                   │
     └────────┬─────────┘          └─────────┬────────┘
              │                             │
              └──────────────┬──────────────┘
                             │
                    ┌────────▼────────┐
                    │  {answer, ctx}  │
                    │  per question   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ EVALUATOR LAYER │
                    │                 │
                    │  Try RAGAS ──┐  │
                    │      │       │  │
                    │  fallback    │  │
                    │      │       │  │
                    │  LLM Judge   │  │
                    │  (GPT-4o)    │  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   EVAL REPORT   │
                    │                 │
                    │  faithfulness   │
                    │  relevancy      │
                    │  context_recall │
                    │  context_prec   │
                    │  halluc. rate   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
     │  Reporter     │ │ Thresh-  │ │  LangSmith  │
     │               │ │  olds    │ │  (optional) │
     │  JSON file    │ │  Check   │ │             │
     │  MD summary   │ │          │ │  Experiment │
     └───────────────┘ │ PASS/FAIL│ │  push       │
                       └──────────┘ └─────────────┘
                             │
                    ┌────────▼────────┐
                    │   EXIT CODE     │
                    │  0 = pass       │
                    │  1 = fail       │
                    └─────────────────┘

Architecture

rag_eval/
├── __init__.py        # Package exports
├── __main__.py        # CLI entrypoint (python -m rag_eval)
├── adapters.py        # RAGAdapter ABC, FunctionAdapter, HTTPAdapter
├── dataset.py         # load_dataset(), generate_dataset()
├── evaluator.py       # evaluate() → EvalReport
├── reporter.py        # JSON, Markdown, LangSmith reporters
└── thresholds.py      # load_config(), check()

backend/server.py      # FastAPI dashboard API
frontend/src/App.js    # React monitoring dashboard
tests/                 # pytest unit + integration tests
.github/workflows/     # CI/CD pipeline

Quick Start

1. Install Dependencies

cd /app
python3.12 -m venv backend/.venv
source backend/.venv/bin/activate
python3.12 -m pip install -r backend/requirements.txt

Python version note

Use Python 3.11 or 3.12 for the smoothest install experience. Some pinned dependencies in backend/requirements.txt may not yet publish compatible wheels for Python 3.14+.

2. Set Environment Variables

export OPENAI_API_KEY=sk-your-key-here

# Optional: LangSmith
export LANGSMITH_API_KEY=ls-your-key-here

3. Run via CLI

HTTP Mode — test any RAG service exposing a /query endpoint:

python3.12 -m rag_eval \
  --mode http \
  --url https://my-rag-service.com \
  --api-key $RAG_API_KEY \
  --dataset ./tests/fixtures/dataset.json \
  --config ./eval_config.yaml

Function Mode — test a LangChain chain directly:

python3.12 -m rag_eval \
  --mode function \
  --module myapp.chain \
  --attr rag_chain \
  --dataset ./tests/fixtures/dataset.json \
  --config ./eval_config.yaml

4. Web Dashboard

The dashboard runs at http://localhost:3000 (frontend) backed by http://localhost:8001 (FastAPI API).

Dashboard: Overview metrics, charts, recent runs
Eval Runs: List runs, start new evaluations, view details
Datasets: Upload/manage evaluation datasets (JSON)
Config: Set pass/fail thresholds via sliders

Run the backend API (from repo root):

source backend/.venv/bin/activate
python3.12 -m uvicorn backend.server:app --reload --port 8001

5. Run Tests

# Unit tests (no API key needed)
python3.12 -m pytest tests/test_adapters.py tests/test_thresholds.py tests/test_dataset.py -v

# Integration test (requires OPENAI_API_KEY)
python3.12 -m pytest tests/test_integration.py -v

CLI Options

Flag	Description
`--mode`	`http` or `function` (required)
`--url`	Target URL for HTTP mode
`--api-key`	Bearer token for HTTP auth
`--module`	Python module path for function mode
`--attr`	Attribute name in module
`--dataset`	Path to dataset JSON file (required)
`--config`	Path to eval_config.yaml (required)
`--name`	Name for this evaluation run
`--output-dir`	Directory for output files (default: `.`)
`--langsmith`	Push results to LangSmith
`--langsmith-project`	LangSmith project name

Dataset Format

[
  {
    "question": "What is RAG?",
    "ground_truth": "RAG combines retrieval with generation.",
    "relevant_contexts": [
      "RAG is a technique that enhances LLMs..."
    ]
  }
]

Threshold Configuration (`eval_config.yaml`)

faithfulness_min: 0.85
answer_relevancy_min: 0.80
context_recall_min: 0.75
context_precision_min: 0.75
hallucination_rate_max: 0.15

Adding a New Target Service

To evaluate a new RAG service:

HTTP Mode: Ensure your service exposes a POST /query endpoint that accepts {"question": "..."} and returns {"answer": "...", "contexts": ["..."]}. Then run:
```
python -m rag_eval --mode http --url https://your-service.com --dataset dataset.json --config eval_config.yaml
```

Function Mode: Import your LangChain RetrievalQA chain:

# myapp/chain.py
from langchain.chains import RetrievalQA
rag_chain = RetrievalQA.from_chain_type(...)

Then run:

python -m rag_eval --mode function --module myapp.chain --attr rag_chain --dataset dataset.json --config eval_config.yaml

Custom Adapter: Extend RAGAdapter in rag_eval/adapters.py:

from rag_eval.adapters import RAGAdapter

class MyCustomAdapter(RAGAdapter):
    def query(self, question: str) -> dict:
        # Your logic here
        return {"answer": "...", "contexts": ["..."]}

CI/CD Integration

The included GitHub Actions workflow (.github/workflows/rag_eval.yml) runs on every PR to main:

Installs dependencies
Runs CLI in HTTP mode against your staging URL
Uploads eval_report.json as an artifact
Posts a score table as a PR comment
Fails the check if any metric is below threshold

Required GitHub Secrets

Secret	Description
`OPENAI_API_KEY`	OpenAI API key for RAGAS/LLM evaluation
`LANGSMITH_API_KEY`	LangSmith API key (optional)
`RAG_API_KEY`	Bearer token for your staging RAG service

Required GitHub Variables

Variable	Description
`RAG_STAGING_URL`	URL of your staging RAG service

Evaluation Metrics

Metric	Description	Range
Faithfulness	Factual consistency of answer with context	0–1
Answer Relevancy	How relevant the answer is to the question	0–1
Context Recall	Coverage of ground truth by retrieved contexts	0–1
Context Precision	Relevance of retrieved contexts	0–1
Hallucination Rate	`1 - faithfulness` (lower is better)	0–1

Dependencies

langchain, langchain-openai, langchain-community — LangChain framework
chromadb — Vector store for integration tests
ragas — RAG evaluation metrics
langsmith — Experiment tracking
httpx — HTTP client with retry support
pyyaml — Config file parsing
openai — GPT-4o for dataset generation & LLM-as-judge
pytest — Testing framework

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.emergent		.emergent
.github/workflows		.github/workflows
backend		backend
demo_rag		demo_rag
frontend		frontend
memory		memory
rag_eval		rag_eval
test_reports		test_reports
tests		tests
.gitconfig		.gitconfig
.gitignore		.gitignore
README.md		README.md
design_guidelines.json		design_guidelines.json
eval_config.yaml		eval_config.yaml
test_result.md		test_result.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Eval — Agentic RAG Evaluation Pipeline

Flow Diagram

Architecture

Quick Start

1. Install Dependencies

Python version note

2. Set Environment Variables

3. Run via CLI

4. Web Dashboard

5. Run Tests

CLI Options

Dataset Format

Threshold Configuration (`eval_config.yaml`)

Adding a New Target Service

CI/CD Integration

Required GitHub Secrets

Required GitHub Variables

Evaluation Metrics

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Eval — Agentic RAG Evaluation Pipeline

Flow Diagram

Architecture

Quick Start

1. Install Dependencies

Python version note

2. Set Environment Variables

3. Run via CLI

4. Web Dashboard

5. Run Tests

CLI Options

Dataset Format

Threshold Configuration (eval_config.yaml)

Adding a New Target Service

CI/CD Integration

Required GitHub Secrets

Required GitHub Variables

Evaluation Metrics

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Threshold Configuration (`eval_config.yaml`)

Packages