A portfolio-grade Q&A API that lets you upload PDF/text documents and ask questions about them using Retrieval-Augmented Generation (RAG). Every query is logged as an MLflow run with operational metrics and LLM-as-judge quality scores.
Fully offline — no external API keys required.
┌─────────────────────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────────────────┐ ┌──────────────┐ │
│ │ FastAPI :8000 │ │ MLflow │ │
│ │ └─ Chroma (embedded) │ │ :5000 │ │
│ └────┬─────────────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Ollama │ (llama3.2 · mistral · nomic-embed) │
│ │ :11434 │ │
│ └──────────┘ │
└─────────────────────────────────────────────────────┘
Chroma runs embedded inside the API container (no separate ChromaDB service). Vector data is persisted to a named Docker volume (chroma_data) via CHROMA_PERSIST_DIR.
RAG Pipeline:
- User uploads a document →
POST /ingest - Text is extracted, chunked (4 000 chars, 20 overlap), and embedded with
nomic-embed-text - Embeddings are stored in the embedded Chroma vector store (persisted to volume)
- User asks a question →
POST /query - Question is embedded and top-k chunks retrieved from Chroma by cosine similarity
- Retrieved chunks + question are passed to
llama3.2(configurable viaOLLAMA_MODEL) via a LangChainRunnableSequence - Answer is returned; metrics and quality scores are logged to MLflow under experiment
ragscope
- Docker and Docker Compose installed
- ~10 GB free disk space (for Ollama models)
The ./mlflow/data and ./mlflow/artifacts directories are created automatically by Docker when the bind mounts are resolved on first startup.
Step 1 — set your hardware profile in .env:
| Hardware | COMPOSE_PROFILES value |
|---|---|
| CPU | cpu |
| NVIDIA GPU | gpu-nvidia |
| AMD GPU (ROCm) | gpu-amd |
# .env
COMPOSE_PROFILES=cpu # or gpu-nvidia or gpu-amdWarning:
COMPOSE_PROFILESmust be exactly one ofcpu,gpu-nvidia, orgpu-amd. Any other value (including leaving it blank) will cause no Ollama service to start and the API will fail to connect.
Step 2 — start the stack:
docker compose upWait for all three Ollama models to finish pulling (logged in api service output). Then:
- FastAPI docs: http://localhost:8000/docs
- MLflow UI: http://localhost:5000
curl -X POST http://localhost:8000/ingest \
-F "file=@/path/to/your/document.pdf"{"status": "ok", "chunks_stored": 42, "filename": "document.pdf"}curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the main topic of the document?", "top_k": 4}'{
"answer": "The document covers...",
"sources": ["chunk text 1", "chunk text 2"]
}curl http://localhost:8000/health{"status": "ok", "chromadb": "ok", "ollama": "ok"}Every call to POST /query creates one MLflow run under the ragscope experiment.
Access the dashboard at http://localhost:5000 → select ragscope experiment.
Each run logs:
- GenAI Quality Scores (via MLflow GenAI scorers, evaluated by
mistral):answer_relevancy— does the answer address the question?hallucination— does the answer contain information not supported by context?safety— is the answer free of harmful content?
Quality scores use a separate LLM judge (mistral) via MLflow GenAI's built-in scorers (AnswerRelevancy, Hallucination, Safety).
| Variable | Default | Description |
|---|---|---|
OLLAMA_MODEL |
llama3.2 |
Ollama model for answer generation |
OLLAMA_JUDGE_MODEL |
mistral |
Ollama model for LLM-as-judge scoring |
OLLAMA_EMBED_MODEL |
nomic-embed-text |
Ollama model for embeddings |
CHROMA_PERSIST_DIR |
/tmp/chroma |
Path inside the container where Chroma persists its data (mounted to the chroma_data volume) |
MLFLOW_TRACKING_URI |
http://mlflow:5000 |
MLflow tracking server URI |
Override any variable by setting it before running docker compose up:
OLLAMA_MODEL=llama3.1 docker compose up-
Document Ingestion (
POST /ingest):- File uploaded as
multipart/form-data - PDF →
PyPDFLoader.load_and_split(); TXT →TextLoader - Split with
RecursiveCharacterTextSplitter(chunk_size=4 000, overlap=20) - Embedded with
nomic-embed-textvia Ollama - Stored in embedded Chroma (persisted to
chroma_datavolume)
- File uploaded as
-
Query (
POST /query):- Question embedded with
nomic-embed-text - Top-k chunks retrieved from Chroma by cosine similarity
- LangChain
RunnableSequence(PromptTemplate | ChatOllama) runsllama3.2(orOLLAMA_MODEL) with retrieved context - Answer extracted from
AIMessage.contentand returned with source chunks
- Question embedded with
-
MLflow Logging:
- Experiment name:
ragscope autolog()enabled on startup viasrc/tracking/setup.py- MLflow GenAI
evaluate()runs scorers (AnswerRelevancy,Hallucination,Safety) using judge model (mistral) - All traces and scores visible in MLflow UI under the GenAI section
- Experiment name:
-
Model Warm-up:
- On startup, the API pulls all three Ollama models via
POST /api/pull - FastAPI does not accept requests until all models are confirmed available
- On startup, the API pulls all three Ollama models via