RAG API with MLflow Evaluation Dashboard

A portfolio-grade Q&A API that lets you upload PDF/text documents and ask questions about them using Retrieval-Augmented Generation (RAG). Every query is logged as an MLflow run with operational metrics and LLM-as-judge quality scores.

Fully offline — no external API keys required.

Architecture

┌─────────────────────────────────────────────────────┐
│                  Docker Compose                      │
│                                                      │
│  ┌──────────────────────────┐    ┌──────────────┐   │
│  │  FastAPI  :8000          │    │    MLflow    │   │
│  │  └─ Chroma (embedded)    │    │    :5000     │   │
│  └────┬─────────────────────┘    └──────────────┘   │
│       │                                              │
│       ▼                                              │
│  ┌──────────┐                                        │
│  │  Ollama  │  (llama3.2 · mistral · nomic-embed) │
│  │  :11434  │                                        │
│  └──────────┘                                        │
└─────────────────────────────────────────────────────┘

Chroma runs embedded inside the API container (no separate ChromaDB service). Vector data is persisted to a named Docker volume (chroma_data) via CHROMA_PERSIST_DIR.

RAG Pipeline:

User uploads a document → POST /ingest
Text is extracted, chunked (4 000 chars, 20 overlap), and embedded with nomic-embed-text
Embeddings are stored in the embedded Chroma vector store (persisted to volume)
User asks a question → POST /query
Question is embedded and top-k chunks retrieved from Chroma by cosine similarity
Retrieved chunks + question are passed to llama3.2 (configurable via OLLAMA_MODEL) via a LangChain RunnableSequence
Answer is returned; metrics and quality scores are logged to MLflow under experiment ragscope

Prerequisites

Docker and Docker Compose installed
~10 GB free disk space (for Ollama models)

The ./mlflow/data and ./mlflow/artifacts directories are created automatically by Docker when the bind mounts are resolved on first startup.

Quickstart

Step 1 — set your hardware profile in .env:

Hardware	`COMPOSE_PROFILES` value
CPU	`cpu`
NVIDIA GPU	`gpu-nvidia`
AMD GPU (ROCm)	`gpu-amd`

# .env
COMPOSE_PROFILES=cpu        # or gpu-nvidia or gpu-amd

Warning: COMPOSE_PROFILES must be exactly one of cpu, gpu-nvidia, or gpu-amd. Any other value (including leaving it blank) will cause no Ollama service to start and the API will fail to connect.

Step 2 — start the stack:

docker compose up

Wait for all three Ollama models to finish pulling (logged in api service output). Then:

FastAPI docs: http://localhost:8000/docs
MLflow UI: http://localhost:5000

Example Usage

Ingest a document

curl -X POST http://localhost:8000/ingest \
  -F "file=@/path/to/your/document.pdf"

{"status": "ok", "chunks_stored": 42, "filename": "document.pdf"}

Query the RAG pipeline

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main topic of the document?", "top_k": 4}'

{
  "answer": "The document covers...",
  "sources": ["chunk text 1", "chunk text 2"]
}

Health check

curl http://localhost:8000/health

{"status": "ok", "chromadb": "ok", "ollama": "ok"}

MLflow Dashboard

Every call to POST /query creates one MLflow run under the ragscope experiment.

Access the dashboard at http://localhost:5000 → select ragscope experiment.

Each run logs:

GenAI Quality Scores (via MLflow GenAI scorers, evaluated by mistral):
- answer_relevancy — does the answer address the question?
- hallucination — does the answer contain information not supported by context?
- safety — is the answer free of harmful content?

Quality scores use a separate LLM judge (mistral) via MLflow GenAI's built-in scorers (AnswerRelevancy, Hallucination, Safety).

Environment Variables

Variable	Default	Description
`OLLAMA_MODEL`	`llama3.2`	Ollama model for answer generation
`OLLAMA_JUDGE_MODEL`	`mistral`	Ollama model for LLM-as-judge scoring
`OLLAMA_EMBED_MODEL`	`nomic-embed-text`	Ollama model for embeddings
`CHROMA_PERSIST_DIR`	`/tmp/chroma`	Path inside the container where Chroma persists its data (mounted to the `chroma_data` volume)
`MLFLOW_TRACKING_URI`	`http://mlflow:5000`	MLflow tracking server URI

Override any variable by setting it before running docker compose up:

OLLAMA_MODEL=llama3.1 docker compose up

How It Works

Document Ingestion (POST /ingest):
- File uploaded as multipart/form-data
- PDF → PyPDFLoader.load_and_split(); TXT → TextLoader
- Split with RecursiveCharacterTextSplitter (chunk_size=4 000, overlap=20)
- Embedded with nomic-embed-text via Ollama
- Stored in embedded Chroma (persisted to chroma_data volume)
Query (POST /query):
- Question embedded with nomic-embed-text
- Top-k chunks retrieved from Chroma by cosine similarity
- LangChain RunnableSequence (PromptTemplate | ChatOllama) runs llama3.2 (or OLLAMA_MODEL) with retrieved context
- Answer extracted from AIMessage.content and returned with source chunks
MLflow Logging:
- Experiment name: ragscope
- autolog() enabled on startup via src/tracking/setup.py
- MLflow GenAI evaluate() runs scorers (AnswerRelevancy, Hallucination, Safety) using judge model (mistral)
- All traces and scores visible in MLflow UI under the GenAI section
Model Warm-up:
- On startup, the API pulls all three Ollama models via POST /api/pull
- FastAPI does not accept requests until all models are confirmed available

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.claude		.claude
.devcontainer		.devcontainer
.github		.github
.specify		.specify
docs		docs
scripts/ralph		scripts/ralph
specs/001-add-project-docs		specs/001-add-project-docs
src		src
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG API with MLflow Evaluation Dashboard

Architecture

Prerequisites

Quickstart

Example Usage

Ingest a document

Query the RAG pipeline

Health check

MLflow Dashboard

Environment Variables

How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG API with MLflow Evaluation Dashboard

Architecture

Prerequisites

Quickstart

Example Usage

Ingest a document

Query the RAG pipeline

Health check

MLflow Dashboard

Environment Variables

How It Works

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages