A minimal, production-ready RAG service built with FastAPI. It embeds queries, retrieves relevant context from a vector store, and generates answers using an LLM. The project uses clean service boundaries and FastAPI dependency injection for testability and performance.
- Accepts a question via an HTTP API.
- Creates an embedding for the question.
- Retrieves top-k relevant documents from a vector database.
- Calls an LLM to generate a grounded answer using those documents.
- Returns the answer, sources, and processing time.
simple-rag-python/
api/ # FastAPI routes (health, query)
config/ # Settings and environment configuration
models/ # Pydantic request/response models
server/ # FastAPI app setup and middleware
services/ # RAGService, embedder, vector store, LLM adapters
main.py # Entry point
README.md # This file
api/routes.py: HTTP endpoints. Uses DI:Depends(get_rag_service).services/pincecone_service.py:RAGServiceorchestrates embed → retrieve → generate.services/embedder.py: Creates embeddings.services/vector_store.py: Vector search (e.g., Pinecone).services/llm_service.py: Generates a response from the LLM.server/server.py: App factory, CORS, request timing middleware.config/settings.py: Centralized configuration via environment variables.
- Python 3.12+
- Optional:
uvorpip - Vector DB and LLM credentials (see Environment variables)
Layers
- API:
api/routes.py(FastAPI routes, validation, DI) - Service:
services/pincecone_service.py(RAGServiceorchestration) - Integrations:
services/embedder.py,services/vector_store.py,services/llm_service.py - Platform:
server/server.py(app factory, CORS, timing),config/settings.py(env),models/data_models.py(schemas)
Request flow
- Client → POST
/api/v1/query - FastAPI validates
QueryRequest, injectsRAGService RAGService→ Embedder → VectorStore → LLM- Return
QueryResponsewithanswer,sources,processing_time
ASCII diagram
Client
|
v
API (FastAPI) --> RAGService --> Embedder
| |
v v
VectorStore --> LLM
Notes
- DI uses a cached provider, avoiding global singletons
- Clear boundaries make providers swappable
- Request latency exposed via
X-Process-Time-ms - Structured logs via
structlogin service layer
Bootstrapping the project using uv and pyenv:
# 1) Initialize a uv project (creates pyproject.toml if missing)
uv init
# 2) Create and activate a virtual environment
uv venv
source .venv/bin/activate
# 3) Import requirements into pyproject (keeps TOML as source of truth)
uv add --requirements requirements.txt
# 4) Sync/install dependencies
uv sync
# 5) Editable install for this project (if not already)
uv pip install -e .Makefile shortcuts:
make setup # Initializes venv, runs uv init, imports requirements, syncs deps
make run # Runs the app via uv: uv run .venv/bin/python main.py
make test # Runs tests if present
make fmt # Formats with Black (add with: uv add --dev black)
make deps # Shows dependency tree (uv tree)
make lock # Rebuilds uv lock file
make clean # Removes cachesThe following diagram illustrates the flow across client, API, business logic, data/config, and external services.
graph TB
subgraph Client_Layer[Client Layer]
C[Client Apps]
end
subgraph API_Layer[API Layer - api-routes.py]
R[Router /api/v1]
H1[Health Handler: GET /health]
H3[Query Handler: POST /query]
end
subgraph Business_Logic[Business Logic - services]
RS[RAGService]
EMB[Embedder: OpenAIEmbedder or MockEmbedder]
VEC[VectorStore: PineconeVectorStore or MockVectorStore]
LLM[LLM Service: OpenAILLM or MockLLM]
end
subgraph External_Services[External Services]
L1[OpenAI API Chat Completions]
L2[OpenAI API Embeddings]
P[Pinecone Index]
end
subgraph Data_Config[Data and Config]
M[Models: QueryRequest, QueryResponse, Document]
CFG[Settings: config/settings.py]
end
C --> R
R --> H1
R --> H3
H3 --> RS
RS --> EMB
RS --> VEC
RS --> LLM
EMB --> L2
LLM --> L1
VEC --> P
RS --> M
RS --> CFG
EMB --> CFG
VEC --> CFG
LLM --> CFG