Ask questions about your own PDF library in natural language, get cited answers, fully offline — nothing leaves your machine. Hybrid retrieval (dense + sparse) with cross-encoder reranking, ~200 lines of Python.
Built for researchers who can't (or won't) send their corpus to a cloud API. Runs on a single modest GPU, or CPU-only.
The capstone project of Local LLMs for Researchers.
📝 The full story — building this across old + new GPUs, and the 3 gotchas (CPU embeddings, the context trap, and not merging GPUs): the writeup on bric.pe.kr.
A vector search finds passages; it can't reason over them. This is exhaustive, cited question-answering over papers you own — with the whole stack (embeddings, sparse index, reranker, vector DB, LLM) running locally:
PDFs ─▶ chunk ─┬─▶ BGE-M3 dense (Ollama) ─┐
└─▶ BM25 sparse (fastembed) ─┴─▶ Qdrant (embedded, on disk)
│
question ─▶ dense + sparse ─▶ RRF fusion ─▶ cross-encoder rerank ─▶ top passages
│
▼
local LLM (Ollama) ─▶ cited answer
Dense catches meaning, sparse (BM25) catches exact terms (gene names, identifiers); fusing them and then reranking with a cross-encoder gives the LLM noticeably better context than dense-alone.
pip install pypdf qdrant-client fastembed # fastembed is optional (see below)
ollama pull bge-m3 # dense embeddings
ollama pull qwen3:8b # or any chat model; set RAG_LLM to override
python rag.py ingest ./papers # a folder of PDFs (or individual files)
python rag.py ask "What method did paper X use for batch correction?"
python rag.py statsThat's it. No server to run, no API key, no data leaving the box. fastembed (BM25 sparse + the reranker) is optional — without it, retrieval runs dense-only and everything else works the same. Upgrading from an older dense-only index? The vector schema changed — delete rag_qdrant/ and re-ingest once.
$ python rag.py ask "Which gene passed all three colocalization tests, and why was DRD2 demoted?"
SLC12A5 (KCC2) was the only gene that passed all three tests (SMR, HEIDI, and COLOC) [1].
DRD2 was demoted because it did not colocalize (COLOC PP4 ≈ 0): the GWAS and eQTL signals
point to distinct variants, so the association is likely LD-driven, not cis-causal [1][3].
Sources:
[1] paper.pdf p.1 (rerank 4.054)
[2] paper.pdf p.10 (rerank 3.938)
...
(14s, hybrid+rerank, fully local - nothing left this machine)
mcp_server.py exposes the corpus as MCP tools, so an agent can search and cite your papers:
search_papers(query, k)— hybrid + reranked passages with{source, page}citations, no LLM (let the client's own model reason over them).ask_papers(question)— retrieve + answer with your local LLM, inline[n]citations.corpus_stats()— index size and retrieval mode.
pip install mcp
python rag.py ingest ./papers # ingest first (CLI), then start the server
python mcp_server.pyPoint your client at it (Claude Desktop claude_desktop_config.json, or a project .mcp.json):
{
"mcpServers": {
"paper-rag": {
"command": "python",
"args": ["/abs/path/to/mcp_server.py"],
"env": { "RAG_DB": "/abs/path/to/rag_qdrant",
"OLLAMA_URL": "http://127.0.0.1:11434", "RAG_LLM": "qwen3:8b" }
}
}
}Your corpus never leaves the machine — only the tool calls and answers cross to the client. (Embedded Qdrant is single-process: ingest with the server stopped, then serve.)
- Chunking — PDFs → text (
pypdf) → ~1400-char overlapping chunks, tagged with source + page. - Embeddings — BGE-M3 dense via Ollama (
/api/embed), 1024-dim, multilingual, plus BM25 sparse (fastembed) for exact-term matching. - Storage — Qdrant in embedded mode (
QdrantClient(path=...)) with named dense + sparse vectors — a real vector DB, no Docker, no server, just a folder on disk. - Retrieval — hybrid: dense and sparse top-20 are fused with Reciprocal Rank Fusion, then a cross-encoder reranker (fastembed, CPU) reorders them, and the best 5 passages go to the LLM.
- Answer — a local LLM answers only from the retrieved context with
[n]citations. - Fallback — no
fastembed? It runs dense-only (cosine top-k, no sparse, no rerank); same CLI, same output.
| Var | Default | |
|---|---|---|
OLLAMA_URL |
http://127.0.0.1:11434 |
Ollama serving the LLM |
RAG_EMBED_URL |
= OLLAMA_URL |
Ollama serving embeddings (can be a different box) |
RAG_EMBED |
bge-m3 |
embedding model |
RAG_LLM |
qwen3:8b |
any Ollama chat model |
RAG_DB |
./rag_qdrant |
vector DB folder |
RAG_RERANK |
Xenova/ms-marco-MiniLM-L-6-v2 |
any fastembed cross-encoder (e.g. jinaai/jina-reranker-v2-base-multilingual, BAAI/bge-reranker-base) |
RAG_SPARSE |
Qdrant/bm25 |
fastembed sparse model |
RAG_NUM_CTX |
8192 |
LLM context cap — keeps big-native-context models fully on-GPU instead of spilling KV cache to CPU |
RAG_EMBED_TIMEOUT |
120 |
seconds before an embed request is retried |
Split the heavy and the interactive work across boxes — e.g. a fast 24 GB GPU for the LLM, and a cheaper/older box (or plain CPU) for embedding, which is steadier under a long ingest:
# Ingest where the DB lives — embeddings on CPU / an old GPU (stable):
RAG_EMBED=bge-m3-cpu python rag.py ingest ./papers
# Serve: LLM on the fast GPU box, embeddings still on the local/cheap box:
OLLAMA_URL=http://gpu-box:11434 RAG_LLM=qwen3:27b \
RAG_EMBED_URL=http://127.0.0.1:11434 RAG_EMBED=bge-m3-cpu \
python mcp_server.pyRAG_EMBED_URL points embeddings at a different Ollama than OLLAMA_URL (the LLM). Dense vectors are defined by the model, so any box running the same bge-m3 produces compatible vectors — you can even ingest on one box and serve from another. RAG_DB is the index folder (embedded Qdrant is a local directory; keep it where you ingest and serve).
- Reasoning models hide the answer. If your LLM emits a
<think>block, its answer can come back empty. This uses"think": falseso you get the final answer, not the chain-of-thought. - Qdrant's API moved. Recent
qdrant-clientusesquery_points(), not the oldsearch(). - Embedded Qdrant is underrated — you get the real engine without standing up a server; perfect for a single-machine private tool.
- Embedders can stall under load. On some setups (notably older GPUs and WSL) a long ingest can make the embedding model hang. So
ingestsaves progress per batch with deterministic IDs: if it stalls, re-run the same command and it resumes — and a stuck embed call fails fast (see Troubleshooting) instead of hanging forever. - A huge native context silently halves your speed. Some models ship a giant native context (e.g. 256K); Ollama then allocates a KV cache to match, which can overflow VRAM and spill to CPU — on a 24 GB card a 27B model dropped from ~36 to ~17 tok/s.
RAG_NUM_CTX(default 8192) caps it so the model stays fully on-GPU. RAG prompts are small, so a modest context loses nothing.
ingest hangs, or an embed call times out. The embedding model (in Ollama) has stalled — a known hiccup under sustained load, especially on older GPUs or under WSL. Fix:
ollama stop bge-m3 # unload the stuck model
python rag.py ingest ./papers # re-run the SAME ingest — it resumes where it stoppedOn WSL, if Ollama or even nvidia-smi won't respond at all (the GPU is wedged), reset the VM from Windows PowerShell, then re-run the ingest:
wsl --shutdownNothing is lost — already-embedded batches are on disk, so the re-run only embeds what's left.
If it keeps wedging, run the embedder on CPU (the real fix on some setups — notably older Pascal GPUs under WSL, where the embedding model can repeatedly hang the GPU). The embedder is tiny (~1 GB), so pin it to CPU and keep your LLM on the GPU:
printf 'FROM bge-m3\nPARAMETER num_gpu 0\n' > bge-m3-cpu.Modelfile
ollama create bge-m3-cpu -f bge-m3-cpu.Modelfile
RAG_EMBED=bge-m3-cpu python rag.py ingest ./papers # embeddings on CPU, answers still on the GPUCPU embedding is plenty fast for a personal library (~100 chunks in well under a minute) and sidesteps the GPU hang entirely.
- Naive fixed-size chunking (no semantic/late chunking yet). Good enough for "find and answer," not a production search system.
- Reranking runs on CPU (fastembed/ONNX) — fast for a personal library (reranking ~20 candidates is sub-second to a couple of seconds); for a very large corpus a GPU rerank server (TEI/Infinity) would scale better.
- Answer quality is your local model's quality — verify domain-specific claims (a small local model can be fluent and wrong).
- Text-based PDFs only (scanned/image PDFs need OCR first).
MIT.