Ask questions about your documents. Get cited answers.
A self-hostable RAG (Retrieval-Augmented Generation) CLI + API that uses hybrid retrieval (BM25 + vector search) and cross-encoder reranking to find the most relevant chunks before sending them to an LLM. Answers are forced to cite their sources.
$ docwhisper ask "What is the return policy?"
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q: What is the return policy?
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
You may return most items within 30 days of delivery [1].
Items must be unused, in original packaging, and include
the original receipt [1]. Refunds are processed within
5β7 business days [2].
ββ Sources ββββββββββββββββββββββββββββββββββββββββββββββββββ
[1] docs/returns_policy.md
"You may return most items within 30 days of the
delivery date for a full refund..."
[2] docs/returns_policy.md
"Approved refunds are processed within 5β7 business
days to your original payment method..."
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Most RAG tutorials use pure vector search. That's fine for semantic queries but misses exact keyword matches ("what is the Section 4.2 clause?"). BM25 catches those. Combining both, then reranking with a cross-encoder, is what actually works in production.
I built this to have a clean reference implementation I could point people to.
Your docs (.txt / .md / .pdf / .html)
β
βΌ
[ Chunker ] β sliding window, configurable size + overlap
β
ββββββ΄ββββββ
β β
[ BM25 ] [ Vector ] β run in parallel, top-20 each
β β
ββββββ¬ββββββ
β merge + deduplicate
βΌ
[ Cross-Encoder ] β rerank to top-5
β
βΌ
[ LLM ] β gpt-4o-mini / llama3 / groq / etc.
β
βΌ
Answer + Citations
git clone https://github.com/sanjaychelliah/docwhisper
cd docwhisper
pip install -e .For PDF support:
pip install -e ".[pdf]"For the REST API:
pip install -e ".[server]"cp .env.example .env
# edit .env and add your OPENAI_API_KEY
export OPENAI_API_KEY=sk-...Works with any OpenAI-compatible API β see Using other LLMs.
mkdir docs
cp your_files.md docs/ # .txt, .md, .pdf, .html all workdocwhisper ingest --docs-dir ./docsThis downloads the embedding model on first run (~90MB), encodes all your chunks, and saves the index to .docwhisper_index/.
docwhisper ask "What is the cancellation policy?"
docwhisper ask "How do I reset my password?"from docwhisper.pipeline import DocWhisper
dw = DocWhisper(docs_dir="./docs")
dw.ingest() # skip if already indexed, use dw.load() instead
answer = dw.ask("What is the refund timeline?")
print(answer.answer) # the answer text
print(answer.citations) # list of cited chunks with sources
print(answer.has_citations) # False if the LLM answered from general knowledge
print(answer.format()) # pretty-printed versionIf the index already exists, skip re-ingesting:
dw = DocWhisper(docs_dir="./docs")
dw.load() # load from disk, no re-embedding
answer = dw.ask("...")# start the server
uvicorn docwhisper.server:app --reload
# index docs
curl -X POST http://localhost:8000/ingest
# ask a question
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is the return window?"}'Response:
{
"question": "What is the return window?",
"answer": "You can return items within 30 days [1].",
"citations": [
{
"ref": "[1]",
"source": "docs/returns_policy.md",
"excerpt": "You may return most items within 30 days of the delivery date..."
}
],
"has_citations": true,
"model": "gpt-4o-mini"
}Swagger docs at http://localhost:8000/docs.
docwhisper uses the OpenAI client under the hood β just point OPENAI_API_BASE at any compatible endpoint.
Ollama (fully local, free):
# install ollama: https://ollama.com
ollama pull llama3.2
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama # any non-empty string
export DOCWHISPER_LLM_MODEL=llama3.2
docwhisper ask "..."Groq (fast, cheap):
export OPENAI_API_BASE=https://api.groq.com/openai/v1
export OPENAI_API_KEY=gsk_...
export DOCWHISPER_LLM_MODEL=llama-3.3-70b-versatileTogether AI:
export OPENAI_API_BASE=https://api.together.xyz/v1
export OPENAI_API_KEY=...
export DOCWHISPER_LLM_MODEL=meta-llama/Llama-3-70b-chat-hfAll settings can be changed via environment variables or a .env file. See .env.example for the full list.
| Variable | Default | What it does |
|---|---|---|
OPENAI_API_KEY |
β | Your LLM API key |
OPENAI_API_BASE |
OpenAI | API endpoint (swap for Ollama/Groq/etc.) |
DOCWHISPER_LLM_MODEL |
gpt-4o-mini |
Model name |
DOCWHISPER_DOCS_DIR |
docs/ |
Where your documents live |
DOCWHISPER_INDEX_DIR |
.docwhisper_index |
Where the built index is stored |
DOCWHISPER_EMBED_MODEL |
all-MiniLM-L6-v2 |
Sentence-transformers model for embeddings |
DOCWHISPER_RERANK_MODEL |
cross-encoder/ms-marco-MiniLM-L-6-v2 |
Cross-encoder for reranking |
DOCWHISPER_BM25_TOP_K |
20 |
Candidates from BM25 |
DOCWHISPER_VECTOR_TOP_K |
20 |
Candidates from vector search |
DOCWHISPER_RERANK_TOP_K |
5 |
Final chunks sent to LLM after reranking |
DOCWHISPER_CHUNK_SIZE |
512 |
Words per chunk |
DOCWHISPER_CHUNK_OVERLAP |
64 |
Overlap between consecutive chunks |
DOCWHISPER_REQUIRE_CITATIONS |
true |
Warn (and exit 1) if answer has no citations |
There's a simple eval runner that checks citation presence, answer relevance, and whether the right source was retrieved. Useful for catching regressions after you change models or chunk settings.
Create an eval file:
[
{
"question": "What is the return window?",
"expected_keywords": ["30 days", "refund"],
"expected_source_hint": "returns_policy"
}
]Run it:
python -m docwhisper.eval --eval-file eval_questions.jsonOutput:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
docwhisper eval report
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Q: What is the return window?
citations : β
relevance : β
source : β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Result: 1/1 cases passed
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The CI pipeline (.github/workflows/ci.yml) runs unit tests on every push and runs the eval suite on main when OPENAI_API_KEY is available as a repository secret.
docwhisper/
βββ docwhisper/
β βββ config.py β all settings, env-var overridable
β βββ ingest.py β load docs, chunk, build BM25 + vector index
β βββ retrieve.py β hybrid retrieval + cross-encoder reranking
β βββ answer.py β LLM call + citation enforcement
β βββ pipeline.py β high-level DocWhisper class
β βββ cli.py β command-line interface
β βββ server.py β FastAPI REST server
β βββ eval.py β evaluation runner
βββ tests/ β pytest unit tests (no LLM needed)
βββ examples/
β βββ sample_docs/ β example markdown docs to try
β βββ eval_questions.json
β βββ run_quickstart.py
β βββ use_ollama.py
βββ .github/workflows/ci.yml
βββ .env.example
βββ pyproject.toml
| Format | Requires |
|---|---|
.txt |
nothing |
.md |
nothing |
.pdf |
pip install -e ".[pdf]" |
.html |
pip install -e ".[html]" |
- The chunker is word-based, not sentence-aware. It can cut mid-sentence. Good enough for most cases, but if you're indexing structured docs with short dense paragraphs you might want to tune
CHUNK_SIZEdown. - Vector index is a flat numpy array β no FAISS or HNSW. Fine up to ~50k chunks, starts getting slow beyond that. Adding FAISS is a one-file change if you need it.
- No streaming support yet on the CLI β the full answer comes back at once.
- Eval metrics are naive (keyword overlap, not semantic similarity). Good for smoke tests, not for benchmarking model quality.