Skip to content

sanjaychelliah/docwhisper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

docwhisper πŸ”

Ask questions about your documents. Get cited answers.

A self-hostable RAG (Retrieval-Augmented Generation) CLI + API that uses hybrid retrieval (BM25 + vector search) and cross-encoder reranking to find the most relevant chunks before sending them to an LLM. Answers are forced to cite their sources.

$ docwhisper ask "What is the return policy?"

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Q: What is the return policy?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You may return most items within 30 days of delivery [1].
Items must be unused, in original packaging, and include
the original receipt [1]. Refunds are processed within
5–7 business days [2].

── Sources ──────────────────────────────────────────────────
  [1]  docs/returns_policy.md
       "You may return most items within 30 days of the
        delivery date for a full refund..."

  [2]  docs/returns_policy.md
       "Approved refunds are processed within 5–7 business
        days to your original payment method..."
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Why this exists

Most RAG tutorials use pure vector search. That's fine for semantic queries but misses exact keyword matches ("what is the Section 4.2 clause?"). BM25 catches those. Combining both, then reranking with a cross-encoder, is what actually works in production.

I built this to have a clean reference implementation I could point people to.


How it works

Your docs (.txt / .md / .pdf / .html)
         β”‚
         β–Ό
    [ Chunker ]   β€” sliding window, configurable size + overlap
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
    β”‚          β”‚
[ BM25 ]  [ Vector ]   β€” run in parallel, top-20 each
    β”‚          β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
         β”‚  merge + deduplicate
         β–Ό
  [ Cross-Encoder ]    β€” rerank to top-5
         β”‚
         β–Ό
      [ LLM ]          β€” gpt-4o-mini / llama3 / groq / etc.
         β”‚
         β–Ό
  Answer + Citations

Quickstart

1. Install

git clone https://github.com/sanjaychelliah/docwhisper
cd docwhisper
pip install -e .

For PDF support:

pip install -e ".[pdf]"

For the REST API:

pip install -e ".[server]"

2. Set your API key

cp .env.example .env
# edit .env and add your OPENAI_API_KEY
export OPENAI_API_KEY=sk-...

Works with any OpenAI-compatible API β€” see Using other LLMs.

3. Add your documents

mkdir docs
cp your_files.md docs/          # .txt, .md, .pdf, .html all work

4. Index

docwhisper ingest --docs-dir ./docs

This downloads the embedding model on first run (~90MB), encodes all your chunks, and saves the index to .docwhisper_index/.

5. Ask

docwhisper ask "What is the cancellation policy?"
docwhisper ask "How do I reset my password?"

Python API

from docwhisper.pipeline import DocWhisper

dw = DocWhisper(docs_dir="./docs")
dw.ingest()  # skip if already indexed, use dw.load() instead

answer = dw.ask("What is the refund timeline?")

print(answer.answer)      # the answer text
print(answer.citations)   # list of cited chunks with sources
print(answer.has_citations)  # False if the LLM answered from general knowledge
print(answer.format())    # pretty-printed version

If the index already exists, skip re-ingesting:

dw = DocWhisper(docs_dir="./docs")
dw.load()   # load from disk, no re-embedding
answer = dw.ask("...")

REST API

# start the server
uvicorn docwhisper.server:app --reload

# index docs
curl -X POST http://localhost:8000/ingest

# ask a question
curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the return window?"}'

Response:

{
  "question": "What is the return window?",
  "answer": "You can return items within 30 days [1].",
  "citations": [
    {
      "ref": "[1]",
      "source": "docs/returns_policy.md",
      "excerpt": "You may return most items within 30 days of the delivery date..."
    }
  ],
  "has_citations": true,
  "model": "gpt-4o-mini"
}

Swagger docs at http://localhost:8000/docs.


Using other LLMs

docwhisper uses the OpenAI client under the hood β€” just point OPENAI_API_BASE at any compatible endpoint.

Ollama (fully local, free):

# install ollama: https://ollama.com
ollama pull llama3.2

export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama   # any non-empty string
export DOCWHISPER_LLM_MODEL=llama3.2
docwhisper ask "..."

Groq (fast, cheap):

export OPENAI_API_BASE=https://api.groq.com/openai/v1
export OPENAI_API_KEY=gsk_...
export DOCWHISPER_LLM_MODEL=llama-3.3-70b-versatile

Together AI:

export OPENAI_API_BASE=https://api.together.xyz/v1
export OPENAI_API_KEY=...
export DOCWHISPER_LLM_MODEL=meta-llama/Llama-3-70b-chat-hf

Configuration

All settings can be changed via environment variables or a .env file. See .env.example for the full list.

Variable Default What it does
OPENAI_API_KEY β€” Your LLM API key
OPENAI_API_BASE OpenAI API endpoint (swap for Ollama/Groq/etc.)
DOCWHISPER_LLM_MODEL gpt-4o-mini Model name
DOCWHISPER_DOCS_DIR docs/ Where your documents live
DOCWHISPER_INDEX_DIR .docwhisper_index Where the built index is stored
DOCWHISPER_EMBED_MODEL all-MiniLM-L6-v2 Sentence-transformers model for embeddings
DOCWHISPER_RERANK_MODEL cross-encoder/ms-marco-MiniLM-L-6-v2 Cross-encoder for reranking
DOCWHISPER_BM25_TOP_K 20 Candidates from BM25
DOCWHISPER_VECTOR_TOP_K 20 Candidates from vector search
DOCWHISPER_RERANK_TOP_K 5 Final chunks sent to LLM after reranking
DOCWHISPER_CHUNK_SIZE 512 Words per chunk
DOCWHISPER_CHUNK_OVERLAP 64 Overlap between consecutive chunks
DOCWHISPER_REQUIRE_CITATIONS true Warn (and exit 1) if answer has no citations

Evaluation

There's a simple eval runner that checks citation presence, answer relevance, and whether the right source was retrieved. Useful for catching regressions after you change models or chunk settings.

Create an eval file:

[
  {
    "question": "What is the return window?",
    "expected_keywords": ["30 days", "refund"],
    "expected_source_hint": "returns_policy"
  }
]

Run it:

python -m docwhisper.eval --eval-file eval_questions.json

Output:

══════════════════════════════════════════════════════════════
  docwhisper eval report
══════════════════════════════════════════════════════════════

  βœ“  Q: What is the return window?
      citations : βœ“
      relevance : βœ“
      source    : βœ“

──────────────────────────────────────────────────────────────
  Result: 1/1 cases passed
══════════════════════════════════════════════════════════════

The CI pipeline (.github/workflows/ci.yml) runs unit tests on every push and runs the eval suite on main when OPENAI_API_KEY is available as a repository secret.


Project structure

docwhisper/
β”œβ”€β”€ docwhisper/
β”‚   β”œβ”€β”€ config.py      β€” all settings, env-var overridable
β”‚   β”œβ”€β”€ ingest.py      β€” load docs, chunk, build BM25 + vector index
β”‚   β”œβ”€β”€ retrieve.py    β€” hybrid retrieval + cross-encoder reranking
β”‚   β”œβ”€β”€ answer.py      β€” LLM call + citation enforcement
β”‚   β”œβ”€β”€ pipeline.py    β€” high-level DocWhisper class
β”‚   β”œβ”€β”€ cli.py         β€” command-line interface
β”‚   β”œβ”€β”€ server.py      β€” FastAPI REST server
β”‚   └── eval.py        β€” evaluation runner
β”œβ”€β”€ tests/             β€” pytest unit tests (no LLM needed)
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ sample_docs/   β€” example markdown docs to try
β”‚   β”œβ”€β”€ eval_questions.json
β”‚   β”œβ”€β”€ run_quickstart.py
β”‚   └── use_ollama.py
β”œβ”€β”€ .github/workflows/ci.yml
β”œβ”€β”€ .env.example
└── pyproject.toml

Supported file types

Format Requires
.txt nothing
.md nothing
.pdf pip install -e ".[pdf]"
.html pip install -e ".[html]"

Limitations / known issues

  • The chunker is word-based, not sentence-aware. It can cut mid-sentence. Good enough for most cases, but if you're indexing structured docs with short dense paragraphs you might want to tune CHUNK_SIZE down.
  • Vector index is a flat numpy array β€” no FAISS or HNSW. Fine up to ~50k chunks, starts getting slow beyond that. Adding FAISS is a one-file change if you need it.
  • No streaming support yet on the CLI β€” the full answer comes back at once.
  • Eval metrics are naive (keyword overlap, not semantic similarity). Good for smoke tests, not for benchmarking model quality.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages