Simple RAG (Retrieval-Augmented Generation) — Python

A minimal, production-ready RAG service built with FastAPI. It embeds queries, retrieves relevant context from a vector store, and generates answers using an LLM. The project uses clean service boundaries and FastAPI dependency injection for testability and performance.

What it does

Accepts a question via an HTTP API.
Creates an embedding for the question.
Retrieves top-k relevant documents from a vector database.
Calls an LLM to generate a grounded answer using those documents.
Returns the answer, sources, and processing time.

Project structure

simple-rag-python/
  api/                # FastAPI routes (health, query)
  config/             # Settings and environment configuration
  models/             # Pydantic request/response models
  server/             # FastAPI app setup and middleware
  services/           # RAGService, embedder, vector store, LLM adapters
  main.py             # Entry point
  README.md           # This file

Key components

api/routes.py: HTTP endpoints. Uses DI: Depends(get_rag_service).
services/pincecone_service.py: RAGService orchestrates embed → retrieve → generate.
services/embedder.py: Creates embeddings.
services/vector_store.py: Vector search (e.g., Pinecone).
services/llm_service.py: Generates a response from the LLM.
server/server.py: App factory, CORS, request timing middleware.
config/settings.py: Centralized configuration via environment variables.

Prerequisites

Python 3.12+
Optional: uv or pip
Vector DB and LLM credentials (see Environment variables)

Architecture overview

Layers

API: api/routes.py (FastAPI routes, validation, DI)
Service: services/pincecone_service.py (RAGService orchestration)
Integrations: services/embedder.py, services/vector_store.py, services/llm_service.py
Platform: server/server.py (app factory, CORS, timing), config/settings.py (env), models/data_models.py (schemas)

Request flow

Client → POST /api/v1/query
FastAPI validates QueryRequest, injects RAGService
RAGService → Embedder → VectorStore → LLM
Return QueryResponse with answer, sources, processing_time

ASCII diagram

Client
  |
  v
API (FastAPI) --> RAGService --> Embedder
                         |           |
                         v           v
                   VectorStore -->  LLM

Notes

DI uses a cached provider, avoiding global singletons
Clear boundaries make providers swappable
Request latency exposed via X-Process-Time-ms
Structured logs via structlog in service layer

uv setup and Makefile usage (appendix)

Bootstrapping the project using uv and pyenv:

# 1) Initialize a uv project (creates pyproject.toml if missing)
uv init

# 2) Create and activate a virtual environment
uv venv
source .venv/bin/activate

# 3) Import requirements into pyproject (keeps TOML as source of truth)
uv add --requirements requirements.txt

# 4) Sync/install dependencies
uv sync

# 5) Editable install for this project (if not already)
uv pip install -e .

Makefile shortcuts:

make setup   # Initializes venv, runs uv init, imports requirements, syncs deps
make run     # Runs the app via uv: uv run .venv/bin/python main.py
make test    # Runs tests if present
make fmt     # Formats with Black (add with: uv add --dev black)
make deps    # Shows dependency tree (uv tree)
make lock    # Rebuilds uv lock file
make clean   # Removes caches

High-level system graph

System Architecture Overview

The following diagram illustrates the flow across client, API, business logic, data/config, and external services.

graph TB
    subgraph Client_Layer[Client Layer]
        C[Client Apps]
    end

    subgraph API_Layer[API Layer - api-routes.py]
        R[Router /api/v1]
        H1[Health Handler: GET /health]
        H3[Query Handler: POST /query]
    end

    subgraph Business_Logic[Business Logic - services]
        RS[RAGService]
        EMB[Embedder: OpenAIEmbedder or MockEmbedder]
        VEC[VectorStore: PineconeVectorStore or MockVectorStore]
        LLM[LLM Service: OpenAILLM or MockLLM]
    end

    subgraph External_Services[External Services]
        L1[OpenAI API Chat Completions]
        L2[OpenAI API Embeddings]
        P[Pinecone Index]
    end

    subgraph Data_Config[Data and Config]
        M[Models: QueryRequest, QueryResponse, Document]
        CFG[Settings: config/settings.py]
    end

   
    C --> R
    R --> H1
    R --> H3
    H3 --> RS

    RS --> EMB
    RS --> VEC
    RS --> LLM

    EMB --> L2
    LLM --> L1
    VEC --> P

    RS --> M
    RS --> CFG
    EMB --> CFG
    VEC --> CFG
    LLM --> CFG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple RAG (Retrieval-Augmented Generation) — Python

What it does

Project structure

Key components

Prerequisites

Architecture overview

uv setup and Makefile usage (appendix)

High-level system graph

System Architecture Overview

More code Documentation for Services

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api		api
config		config
models		models
server		server
services		services
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
documentation.md		documentation.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

umangsingh123/simple-rag-python

Folders and files

Latest commit

History

Repository files navigation

Simple RAG (Retrieval-Augmented Generation) — Python

What it does

Project structure

Key components

Prerequisites

Architecture overview

uv setup and Makefile usage (appendix)

High-level system graph

System Architecture Overview

More code Documentation for Services

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages