A small, production-shaped Retrieval-Augmented Generation API built with FastAPI. Ingest documents, then ask questions that are answered only from the indexed content, with inline citations and a hallucination guard.
The design is deliberately dependency-light and readable — every layer (chunking, embeddings, vector store, LLM client) is a single focused module you can swap out (e.g. FAISS or pgvector for the store, any OpenAI-compatible endpoint for the LLM) without touching the rest.
┌─────────────┐ ingest ┌──────────┐ ┌──────────────┐
documents ───▶│ chunking │────────────▶│ embedder │──▶│ vector store │
└─────────────┘ └──────────┘ └──────────────┘
│ cosine top-k
question ──▶ embed ──▶ retrieve ───────────────────────────────┘
│
▼
grounded prompt ──▶ LLM (Groq / OpenRouter / …) ──▶ answer + citations
app/rag/chunking.py— overlapping, boundary-aware text splitting.app/rag/embeddings.py—sentence-transformerswhen available, with a deterministic hashing fallback so it always runs.app/rag/store.py— numpy cosine-similarity store with JSON persistence.app/llm/client.py— provider-agnostic OpenAI-compatible async client.app/rag/pipeline.py— ties it together; grounded system prompt.app/main.py— FastAPI endpoints.
| Method | Path | Description |
|---|---|---|
| GET | /health |
Store size, embedder backend, model |
| POST | /ingest |
Index a document ({text, source}) |
| POST | /query |
Ask a grounded question ({question}) |
Interactive docs at /docs (Swagger UI) once running.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # set LLM_API_KEY (free key at console.groq.com)
uvicorn app.main:app --reload# index a document
curl -X POST localhost:8000/ingest -H 'content-type: application/json' \
-d '{"text":"The Eiffel Tower is in Paris, France.","source":"facts"}'
# ask a grounded question
curl -X POST localhost:8000/query -H 'content-type: application/json' \
-d '{"question":"Where is the Eiffel Tower?"}'pytest -qTests cover chunking edge cases, embedding normalisation, retrieval relevance,
store round-trip persistence, and the full /ingest → /query flow with the
LLM mocked (no API key needed).
FastAPI · Pydantic v2 · httpx · numpy · sentence-transformers (optional) ·
Docker. LLM provider is pluggable via LLM_BASE_URL / LLM_MODEL.