Skip to content

serverf21/rag_lens

Repository files navigation

RAG Eval — Agentic RAG Evaluation Pipeline

A Python package + web dashboard for evaluating RAG (Retrieval-Augmented Generation) systems with dual-mode support: Function Call (LangChain) and HTTP Client.

Flow Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                    RAG EVAL PIPELINE FLOW                            │
└──────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────┐
                    │   INPUT LAYER   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
     │  Dataset JSON  │ │  Config  │ │  Adapter    │
     │  [{question,   │ │  YAML    │ │  Selection  │
     │   ground_truth,│ │ (thres-  │ │             │
     │   contexts}]   │ │  holds)  │ │  HTTP  │ FN │
     └────────┬───────┘ └────┬─────┘ └──────┬──────┘
              │              │              │
              └──────────────┼──────────────┘
                             │
                    ┌────────▼────────┐
                    │  ADAPTER LAYER  │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │                             │
     ┌────────▼────────┐          ┌─────────▼────────┐
     │  HTTPAdapter     │          │  FunctionAdapter  │
     │                  │          │                   │
     │  POST /query     │          │  chain.invoke()   │
     │  Bearer Auth     │          │  LangChain QA     │
     │  Retry (3x)     │          │  Direct call      │
     │  Backoff on 5xx │          │                   │
     └────────┬─────────┘          └─────────┬────────┘
              │                             │
              └──────────────┬──────────────┘
                             │
                    ┌────────▼────────┐
                    │  {answer, ctx}  │
                    │  per question   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ EVALUATOR LAYER │
                    │                 │
                    │  Try RAGAS ──┐  │
                    │      │       │  │
                    │  fallback    │  │
                    │      │       │  │
                    │  LLM Judge   │  │
                    │  (GPT-4o)    │  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   EVAL REPORT   │
                    │                 │
                    │  faithfulness   │
                    │  relevancy      │
                    │  context_recall │
                    │  context_prec   │
                    │  halluc. rate   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
     │  Reporter     │ │ Thresh-  │ │  LangSmith  │
     │               │ │  olds    │ │  (optional) │
     │  JSON file    │ │  Check   │ │             │
     │  MD summary   │ │          │ │  Experiment │
     └───────────────┘ │ PASS/FAIL│ │  push       │
                       └──────────┘ └─────────────┘
                             │
                    ┌────────▼────────┐
                    │   EXIT CODE     │
                    │  0 = pass       │
                    │  1 = fail       │
                    └─────────────────┘

Architecture

rag_eval/
├── __init__.py        # Package exports
├── __main__.py        # CLI entrypoint (python -m rag_eval)
├── adapters.py        # RAGAdapter ABC, FunctionAdapter, HTTPAdapter
├── dataset.py         # load_dataset(), generate_dataset()
├── evaluator.py       # evaluate() → EvalReport
├── reporter.py        # JSON, Markdown, LangSmith reporters
└── thresholds.py      # load_config(), check()

backend/server.py      # FastAPI dashboard API
frontend/src/App.js    # React monitoring dashboard
tests/                 # pytest unit + integration tests
.github/workflows/     # CI/CD pipeline

Quick Start

1. Install Dependencies

cd /app
python3.12 -m venv backend/.venv
source backend/.venv/bin/activate
python3.12 -m pip install -r backend/requirements.txt

Python version note

Use Python 3.11 or 3.12 for the smoothest install experience. Some pinned dependencies in backend/requirements.txt may not yet publish compatible wheels for Python 3.14+.

2. Set Environment Variables

export OPENAI_API_KEY=sk-your-key-here

# Optional: LangSmith
export LANGSMITH_API_KEY=ls-your-key-here

3. Run via CLI

HTTP Mode — test any RAG service exposing a /query endpoint:

python3.12 -m rag_eval \
  --mode http \
  --url https://my-rag-service.com \
  --api-key $RAG_API_KEY \
  --dataset ./tests/fixtures/dataset.json \
  --config ./eval_config.yaml

Function Mode — test a LangChain chain directly:

python3.12 -m rag_eval \
  --mode function \
  --module myapp.chain \
  --attr rag_chain \
  --dataset ./tests/fixtures/dataset.json \
  --config ./eval_config.yaml

4. Web Dashboard

The dashboard runs at http://localhost:3000 (frontend) backed by http://localhost:8001 (FastAPI API).

  • Dashboard: Overview metrics, charts, recent runs
  • Eval Runs: List runs, start new evaluations, view details
  • Datasets: Upload/manage evaluation datasets (JSON)
  • Config: Set pass/fail thresholds via sliders

Run the backend API (from repo root):

source backend/.venv/bin/activate
python3.12 -m uvicorn backend.server:app --reload --port 8001

5. Run Tests

# Unit tests (no API key needed)
python3.12 -m pytest tests/test_adapters.py tests/test_thresholds.py tests/test_dataset.py -v

# Integration test (requires OPENAI_API_KEY)
python3.12 -m pytest tests/test_integration.py -v

CLI Options

Flag Description
--mode http or function (required)
--url Target URL for HTTP mode
--api-key Bearer token for HTTP auth
--module Python module path for function mode
--attr Attribute name in module
--dataset Path to dataset JSON file (required)
--config Path to eval_config.yaml (required)
--name Name for this evaluation run
--output-dir Directory for output files (default: .)
--langsmith Push results to LangSmith
--langsmith-project LangSmith project name

Dataset Format

[
  {
    "question": "What is RAG?",
    "ground_truth": "RAG combines retrieval with generation.",
    "relevant_contexts": [
      "RAG is a technique that enhances LLMs..."
    ]
  }
]

Threshold Configuration (eval_config.yaml)

faithfulness_min: 0.85
answer_relevancy_min: 0.80
context_recall_min: 0.75
context_precision_min: 0.75
hallucination_rate_max: 0.15

Adding a New Target Service

To evaluate a new RAG service:

  1. HTTP Mode: Ensure your service exposes a POST /query endpoint that accepts {"question": "..."} and returns {"answer": "...", "contexts": ["..."]}. Then run:

    python -m rag_eval --mode http --url https://your-service.com --dataset dataset.json --config eval_config.yaml
  2. Function Mode: Import your LangChain RetrievalQA chain:

    # myapp/chain.py
    from langchain.chains import RetrievalQA
    rag_chain = RetrievalQA.from_chain_type(...)

    Then run:

    python -m rag_eval --mode function --module myapp.chain --attr rag_chain --dataset dataset.json --config eval_config.yaml
  3. Custom Adapter: Extend RAGAdapter in rag_eval/adapters.py:

    from rag_eval.adapters import RAGAdapter
    
    class MyCustomAdapter(RAGAdapter):
        def query(self, question: str) -> dict:
            # Your logic here
            return {"answer": "...", "contexts": ["..."]}

CI/CD Integration

The included GitHub Actions workflow (.github/workflows/rag_eval.yml) runs on every PR to main:

  1. Installs dependencies
  2. Runs CLI in HTTP mode against your staging URL
  3. Uploads eval_report.json as an artifact
  4. Posts a score table as a PR comment
  5. Fails the check if any metric is below threshold

Required GitHub Secrets

Secret Description
OPENAI_API_KEY OpenAI API key for RAGAS/LLM evaluation
LANGSMITH_API_KEY LangSmith API key (optional)
RAG_API_KEY Bearer token for your staging RAG service

Required GitHub Variables

Variable Description
RAG_STAGING_URL URL of your staging RAG service

Evaluation Metrics

Metric Description Range
Faithfulness Factual consistency of answer with context 0–1
Answer Relevancy How relevant the answer is to the question 0–1
Context Recall Coverage of ground truth by retrieved contexts 0–1
Context Precision Relevance of retrieved contexts 0–1
Hallucination Rate 1 - faithfulness (lower is better) 0–1

Dependencies

  • langchain, langchain-openai, langchain-community — LangChain framework
  • chromadb — Vector store for integration tests
  • ragas — RAG evaluation metrics
  • langsmith — Experiment tracking
  • httpx — HTTP client with retry support
  • pyyaml — Config file parsing
  • openai — GPT-4o for dataset generation & LLM-as-judge
  • pytest — Testing framework

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors