QueryDoc

This project is an example chatbot that analyzes PDF documents using an embedding model and generates LLM answers through a Coarse-to-Fine search (RAG) approach.

QueryDoc

QueryDoc/
├─ scripts/
│   ├─ pdf_extractor.py
│   ├─ chunker.py
│   ├─ build_index.py
│   └─ section_rep_builder.py
├─ src/
│   ├─ inference/
│   │   ├─ embedding_model.py
│   │   └─ llm_model.py
│   ├─ search/
│   │   ├─ section_coarse_search.py
│   │   ├─ fine_search.py
│   │   └─ vector_search.py
│   ├─ chatbot.py
│   └─ utils/
│       ├─ init.py
│       └─ text_cleaning.py
├─ data/
│   ├─ extracted/
│   ├─ chunks/
│   ├─ index/
│   └─ original/
│       └─ sample.pdf
├─ app.py
├─ requirements.txt
└─ README.md

Installation and Execution

Create a virtual environment and install libraries

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

• For OCR features, install Tesseract and the appropriate language data,

  sudo apt-get install tesseract-ocr   # Debian/Ubuntu
  sudo apt-get install tesseract-ocr-kor
  brew install tesseract-ocr   # MacOS
  brew install tesseract-ocr-kor

(On Windows, activate with .\venv\Scripts\activate or a similar command.)

Extract PDF & Split into Chunks

python scripts/pdf_extractor.py
python scripts/chunker.py

• After execution, JSON files will be created in data/extracted/.json and data/chunks/.json.

Build Embeddings

python scripts/build_index.py

• This generates data/index/*_vectors.json.

Generate Section Representative Vectors

python scripts/section_rep_builder.py

• This creates files like sections_with_emb.json.

Directly Testing chatbot.py

To test the chatbot interactively, run:

python src/chatbot.py

• Then type your question (e.g., “Explain Chapter 1.”) and press Enter.

• Type “exit” to quit.

The system prompt used by the chatbot can be customized by editing DEFAULT_SYSTEM_PROMPT in src/chatbot.py or by passing a custom prompt when creating a PDFChatBot instance.

Run the Chatbot Server

python app.py

• A FastAPI server will run at http://0.0.0.0:8000 (default port).

• You can send a JSON-formatted question to the POST /ask endpoint to receive an answer.

Launch the Web Demo

python web_demo.py

• A browser window will appear allowing you to upload a PDF and edit the system prompt before asking questions.

Example API Request

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What's egocentric AI agent?"}'

Additional Web Demo Info

python web_demo.py

Sign in with the default credentials admin/password.
Uploaded PDFs are saved under data/user_uploads/<username>.
You can modify the system prompt in src/chatbot.py or provide one in the web interface.

Key Libraries

• PyMuPDF (fitz): Extracts PDF text and table of contents (ToC).

• SentenceTransformers: Loads the BAAI/bge-m3 embedding model.

• Transformers: Provides the local LLM (e.g., trillionlabs/Trillion-7B-preview).

• FastAPI: A simple REST API server.

• Gradio: Interactive web demo framework.

• pdfplumber: Layout‑aware PDF parsing
• pytesseract: OCR fallback engine
• Pillow: Image handling for OCR pipelines
• pandas: DataFrame operations for layout analysis
• scikit‑learn: KMeans clustering for multi‑column detection

Notes

• Models such as bge-m3 and Trillion-7B may take some time to download the first time they are loaded.

• Since section content is complemented using the average of section chunk embeddings (without a summarization model), very long sections may result in reduced search accuracy. (Utilizing a summarization model may be considered in the future.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QueryDoc

QueryDoc

Installation and Execution

Key Libraries

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data/original		data/original
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
web_demo.py		web_demo.py

License

MIMICLab/Taewook

Folders and files

Latest commit

History

Repository files navigation

QueryDoc

QueryDoc

Installation and Execution

Key Libraries

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages