A retrieval-augmented generation (RAG) system that serves PDF textbooks as conversational Q&A APIs. It extracts content from PDF files, builds vector indexes for semantic search, and answers questions using a local LLM through Ollama. The system exposes a FastAPI server with streaming support and ships a Python SDK client for integration into any application.
- Architecture
- Prerequisites
- Installation
- Quick Start
- Project Structure
- Configuration
- API Reference
- Python SDK
- Adding a New Book
- Development
- Contributing
- License
Server
┌──────────────────────────────────────────────────┐
│ FastAPI :8000 │
│ │
│ ┌────────────────┐ ┌───────────────────────┐ │
│ │ RAGEngine #1 │ │ RAGEngine #2 │ │
│ │ FAISS Index │ │ FAISS Index │ │
│ │ (Book A) │ │ (Book B) │ │
│ └───────┬────────┘ └──────────┬────────────┘ │
│ │ │ │
│ └───────┬────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Ollama LLM │ │
│ │ (qwen2.5:3b) │ │
│ └────────────────┘ │
└──────────────────────┬──────────────────────────-┘
│ HTTP / SSE
│
┌──────────────────────▼──────────────────────────-┐
│ Client Application │
│ from textbookapi import Principlesofdatascience │
└──────────────────────────────────────────────────┘
The pipeline works as follows:
- PDF text is extracted using PyMuPDF and split into overlapping chunks.
- Chunks are embedded with
all-MiniLM-L6-v2(sentence-transformers) and stored in a FAISS index. - On query, the question is embedded and the top matching chunks are retrieved.
- Retrieved context and the question are sent to Ollama, which generates a streamed response.
Embeddings and FAISS indexes are cached to disk. Subsequent startups skip the embedding step entirely.
| Dependency | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Runtime |
| Ollama | Latest | Local LLM inference |
| pip | Latest | Package management |
Install Ollama from https://ollama.com and verify it is running:
ollama serve
The server will automatically pull the required model (qwen2.5:3b) on first startup if it is not already available.
Clone the repository and install dependencies:
git clone https://github.com/ujjwalredd/textbookapi.git
cd textbookapi
pip install -r requirements.txt
PyMuPDF>=1.24.0
faiss-cpu>=1.8.0
sentence-transformers>=3.0.0
requests>=2.31.0
fastapi>=0.104.0
uvicorn[standard]>=0.24.0
httpx>=0.25.0
ollama serve
python -m textbookapi.server
The server initializes a RAG engine for each registered book. On first run, this includes extracting text from PDFs, generating embeddings, and building FAISS indexes. The indexes are cached to .cache/ for fast subsequent startups.
An API key is auto-generated on first run and written to api_keys.json. The key is printed in the server logs.
from textbookapi import Principlesofdatascience
client = Principlesofdatascience(api_key="ujjwal-your-key-here")
# Streaming
for token in client.ask("What is data science?", stream=True):
print(token, end="", flush=True)
# Non-streaming
result = client.ask("Explain supervised learning")
print(result.answer)
print(result.sources)python Test.py
This presents a book selection menu and opens an interactive chat session.
python LLM.py
This runs the RAG pipeline directly without the API server, useful for local testing.
textbookapi/
__init__.py Package exports
config.py Book registry and RAGConfig dataclass
engine.py PDF extraction, chunking, FAISS vector store, Ollama queries
server.py FastAPI application with SSE streaming
client.py Python SDK with per-book client classes
models.py Pydantic request and response schemas
auth.py API key management and authentication middleware
textbooks/
*.pdf PDF textbook files
LLM.py CLI entry point for direct RAG queries
Test.py Interactive chatbot using the Python SDK
requirements.txt Python dependencies
api_keys.json Auto-generated API keys (created on first run)
.cache/ Cached FAISS indexes (created on first run)
All configuration is centralized in textbookapi/config.py.
Books are registered in the BOOKS dictionary:
BOOKS = {
"principlesofdatascience": {
"title": "Principles of Data Science",
"pdf": "Principles-of-Data-Science-WEB.pdf",
},
"introductiontopythonprogramming": {
"title": "Introduction to Python Programming",
"pdf": "Introduction_to_Python_Programming_-_WEB.pdf",
},
}| Parameter | Default | Description |
|---|---|---|
model_name |
qwen2.5:3b |
Ollama model for generation |
embedding_model |
all-MiniLM-L6-v2 |
Sentence-transformers model for embeddings |
chunk_size |
1000 |
Characters per text chunk |
chunk_overlap |
200 |
Overlap between consecutive chunks |
top_k |
3 |
Number of passages retrieved per query |
temperature |
0.3 |
LLM sampling temperature |
max_tokens |
384 |
Maximum tokens in LLM response |
context_window |
2048 |
Ollama context window size |
GET /health
No authentication required. Returns server status and the state of all loaded books.
Response:
{
"status": "ready",
"model": "qwen2.5:3b",
"books": [
{"name": "Principles of Data Science", "status": "ready", "index_size": 842},
{"name": "Introduction to Python Programming", "status": "ready", "index_size": 615}
]
}POST /v1/query
Authorization: Bearer ujjwal-your-key-here
Content-Type: application/json
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
question |
string | Yes | The question to ask (1 to 2000 characters) |
book |
string | Yes | Book identifier from the registry |
stream |
boolean | No | Enable server-sent events streaming (default: false) |
top_k |
integer | No | Number of passages to retrieve (1 to 10) |
temperature |
float | No | Sampling temperature (0.0 to 2.0) |
Non-streaming Response:
{
"id": "rag-a1b2c3d4e5f6",
"answer": "Data science is an interdisciplinary field...",
"sources": [
{"text": "Chapter 1 introduces...", "score": 0.847}
],
"model": "qwen2.5:3b",
"created": 1706745600
}Streaming Response (SSE):
data: {"id":"rag-a1b2c3d4e5f6","delta":"Data","done":false}
data: {"id":"rag-a1b2c3d4e5f6","delta":" science","done":false}
...
data: {"id":"rag-a1b2c3d4e5f6","delta":"","done":true}
data: [DONE]
All /v1/* endpoints require a Bearer token in the Authorization header. Keys are stored in api_keys.json and auto-generated on first server startup with the prefix ujjwal-.
The SDK provides one client class per registered book. All classes share the same interface.
| Class | Book |
|---|---|
Principlesofdatascience |
Principles of Data Science |
Introductiontopythonprogramming |
Introduction to Python Programming |
from textbookapi import Principlesofdatascience
client = Principlesofdatascience(
api_key="ujjwal-your-key-here",
base_url="http://localhost:8000", # optional, this is the default
timeout=120.0, # optional, request timeout in seconds
)# Non-streaming: returns a QueryResult object
result = client.ask("What is a random variable?")
print(result.answer)
print(result.sources) # list of SourceInfo objects
print(result.model)
# Streaming: returns a generator that yields tokens
for token in client.ask("Explain Bayes theorem", stream=True):
print(token, end="", flush=True)
# With optional parameters
result = client.ask("What is regression?", top_k=5, temperature=0.5)status = client.health()
print(status)# Manual cleanup
client.close()
# Or use as a context manager
with Principlesofdatascience(api_key="ujjwal-...") as client:
result = client.ask("What is clustering?")from textbookapi import Principlesofdatascience, textbookapiError
try:
result = client.ask("What is PCA?")
except textbookapiError as e:
print(e.status_code)
print(e.message)To add a new textbook to the system:
-
Place the PDF file in the
textbooks/directory. -
Register the book in
textbookapi/config.py:
BOOKS = {
# ... existing books ...
"yourbook": {
"title": "Your Book Title",
"pdf": "Your-Book-Filename.pdf",
},
}- Add a client class in
textbookapi/client.py:
class Yourbook(_BookClient):
"""Client for 'Your Book Title'."""
_book_id = "yourbook"- Export the class in
textbookapi/__init__.py:
from .client import (
Principlesofdatascience,
Introductiontopythonprogramming,
Yourbook,
textbookapiError,
QueryResult,
SourceInfo,
)- Restart the server. The new book will be indexed on first startup and cached for subsequent runs.
python -m textbookapi.server
The server runs on http://localhost:8000 by default. API documentation is available at http://localhost:8000/docs (Swagger UI) and http://localhost:8000/redoc (ReDoc).
# Health check
curl http://localhost:8000/health
# Non-streaming query
curl -X POST http://localhost:8000/v1/query \
-H "Authorization: Bearer ujjwal-your-key-here" \
-H "Content-Type: application/json" \
-d '{"question": "What is data science?", "book": "principlesofdatascience"}'
# Streaming query
curl -X POST http://localhost:8000/v1/query \
-H "Authorization: Bearer ujjwal-your-key-here" \
-H "Content-Type: application/json" \
-d '{"question": "What is data science?", "book": "principlesofdatascience", "stream": true}'
Delete the .cache/ directory to force re-indexing of all books on next startup:
rm -rf .cache/
Contributions are welcome. Please follow the guidelines below to keep the project consistent and maintainable.
- Fork the repository.
- Create a feature branch from
main:git checkout -b feature/your-feature-name - Install dependencies:
pip install -r requirements.txt - Make your changes.
- Test your changes locally by starting the server and running queries.
- Commit with a clear, descriptive message:
git commit -m "Add support for custom embedding models" - Push to your fork and open a pull request.
Code Style
- Follow PEP 8 conventions.
- Use type hints for function signatures.
- Keep functions focused and short. If a function exceeds 40 lines, consider splitting it.
Commits
- Write commit messages in the imperative mood ("Add feature", not "Added feature").
- Each commit should represent a single logical change.
- Reference issue numbers in commit messages where applicable.
Pull Requests
- Provide a clear description of what the PR changes and why.
- Keep PRs focused. One feature or fix per PR.
- Ensure the server starts and responds to queries before submitting.
- Update the README if your change affects usage, configuration, or the public API.
Adding Books
- Follow the steps in Adding a New Book.
- Do not commit PDF files to the repository. Add them to
.gitignore. - Only commit the configuration and client class changes.
Reporting Issues
- Use GitHub Issues to report bugs or request features.
- Include the Python version, OS, Ollama version, and full error traceback when reporting bugs.
- Describe the expected behavior and the actual behavior.
The following areas are open for improvement:
- Testing: Unit tests for the engine, server endpoints, and client SDK.
- Embedding models: Support for alternative embedding models beyond
all-MiniLM-L6-v2. - LLM backends: Support for OpenAI API, Anthropic API, or other inference providers alongside Ollama.
- Document formats: Support for EPUB, DOCX, or HTML in addition to PDF.
- Chunk strategies: Semantic chunking, section-aware splitting, or table extraction.
- Async engine: Converting the synchronous RAG engine to fully async for better server performance.
- Docker: Containerized deployment with Ollama bundled.
- Rate limiting: Per-key rate limiting on the API.
MIT License