A production-grade, multi-threaded Retrieval-Augmented Generation (RAG) application that lets you chat with your PDF and text documents using LLaMA 3.3 70B.
DocChat is a Retrieval-Augmented Generation (RAG) application that goes beyond the typical "upload a PDF and ask questions" pattern. It introduces concurrent multi-threaded document processing, Maximal Marginal Relevance (MMR) retrieval, and a polished dark-themed UI — built to demonstrate real-world engineering practices, not just a tutorial-level prototype.
Upload up to 5 documents → Documents are processed in parallel threads → Ask questions grounded strictly in your documents → Get accurate, hallucination-resistant answers.
Most PDF chat applications follow a simple, sequential pipeline: load one file, chunk it, embed it, query it. DocChat is engineered differently.
Traditional Approach DocChat Approach
───────────────────── ─────────────────────
File 1 → Load → Chunk File 1 ─┐
File 2 → Load → Chunk File 2 ─┤→ Parallel Threads
File 3 → Load → Chunk File 3 ─┤ (concurrent I/O)
(sequential) File 4 ─┤
Total: T1 + T2 + T3 File 5 ─┘
Total: max(T1..T5)
- Each uploaded document is processed in its own dedicated thread (
threading.Thread), eliminating the sequential bottleneck. - A shared
threading.Lockensures thread-safe aggregation of document chunks into a single unified list — no race conditions, no data corruption. - Daemon threads are used so background workers don't prevent graceful application shutdown.
- Named threads (
loader-{filename}) enable clean debugging and monitoring.
# One thread per file — true concurrent document processing
threads = []
for uf in uploaded_files:
t = threading.Thread(
target=_process_single_file,
args=(uf, all_chunks, errors, lock),
name=f"loader-{uf.name}",
daemon=True,
)
threads.append(t)
for t in threads:
t.start()
for t in threads:
t.join() # barrier: wait for ALL files before embeddingWhy it matters: When processing 5 large PDFs, the sequential approach takes the sum of all processing times. With threading, it takes only as long as the slowest file. For I/O-bound PDF parsing, this is a significant real-world speedup.
Unlike basic similarity search (k nearest neighbors), DocChat uses MMR — a retrieval strategy that balances relevance with diversity.
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": 0.5}
)| Parameter | Value | Purpose |
|---|---|---|
search_type |
"mmr" |
Activates Maximal Marginal Relevance |
k |
3 |
Final number of documents returned |
fetch_k |
10 |
Candidate pool size before diversity filtering |
lambda_mult |
0.5 |
Balance between relevance (1.0) and diversity (0.0) |
Why it matters: Basic similarity search often returns near-duplicate chunks from the same paragraph. MMR ensures retrieved context covers different aspects of the query, leading to more comprehensive and accurate answers.
The system prompt explicitly constrains the LLM to answer only from the provided context:
If the answer is not present in the context,
say: "I could not find the answer in the document."
This is a deliberate design choice for trustworthiness — the model will never fabricate information that isn't in your documents.
- Custom dark theme with DM Serif Display + DM Sans typography
- Real-time status indicators (document count pills, processing states)
- Session management with full conversation history
- Responsive layout with sidebar controls and main chat area
┌─────────────────────────────────────────────────────────┐
│ Streamlit Frontend │
│ (Dark Theme · Chat UI · File Upload) │
└──────────────────────┬──────────────────────────────────┘
│
┌──────────────▼──────────────┐
│ Multi-Threaded Loader │
│ ┌────────┐ ┌────────┐ │
│ │Thread 1│ │Thread 2│ ... │ ← One thread per file
│ │PDF Load│ │TXT Load│ │
│ │+ Chunk │ │+ Chunk │ │
│ └───┬────┘ └───┬────┘ │
│ └─────┬─────┘ │
│ Lock-Protected │
│ Chunk Aggregation │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Embedding Pipeline │
│ sentence-transformers/ │
│ all-MiniLM-L6-v2 │
│ (HuggingFace, local) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ ChromaDB Vector Store │
│ (In-Memory / Persisted) │
│ MMR Retrieval (k=3) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Groq LLM Engine │
│ LLaMA 3.3 70B Versatile │
│ (Grounded Prompt Template) │
└─────────────────────────────┘
| Layer | Technology | Purpose |
|---|---|---|
| LLM | Groq Cloud + LLaMA 3.3 70B | Ultra-fast inference for document Q&A |
| Embeddings | HuggingFace all-MiniLM-L6-v2 |
Local, lightweight semantic embeddings |
| Vector Store | ChromaDB | Persistent vector storage + MMR retrieval |
| Framework | LangChain | Orchestration of RAG pipeline components |
| Concurrency | Python threading |
Parallel document loading & chunking |
| Frontend | Streamlit | Interactive chat UI with custom theming |
| Text Splitting | RecursiveCharacterTextSplitter | Context-aware document chunking |
| Document Loaders | PyPDFLoader, TextLoader | Multi-format document ingestion |
| Config | python-dotenv | Secure API key management |
pdf_chat_threaded_documents/
├── rag_application/
│ ├── app.py # 🚀 Main Streamlit app (threaded RAG pipeline)
│ ├── main.py # 🖥️ CLI-based RAG interface
│ ├── create-database.py # 🗄️ Standalone vector DB creation script
│ ├── data/ # 📂 Sample documents (PDF, TXT)
│ ├── loaders/ # 🔌 Document loader experiments
│ │ ├── pdf.py # PyPDF loader
│ │ ├── page.py # Web page loader
│ │ ├── loader_exp.py # Text file loader
│ │ └── docling_load.py # Docling-based loader
│ ├── splitters/ # ✂️ Text splitting strategies
│ │ ├── recursiveSplitter.py# Recursive character splitting
│ │ ├── splitter.py # Character-based splitting
│ │ └── splitter2.py # Token-based splitting
│ ├── VECTORSTORE/ # 💾 Vector store experiments
│ │ └── db.py # ChromaDB operations
│ └── .gitignore
├── requirements.txt
└── README.md
- Python 3.10+
- A Groq API Key (free tier available)
git clone https://github.com/<your-username>/pdf_chat_threaded_documents.git
cd pdf_chat_threaded_documentspython -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activatepip install -r requirements.txtCreate a .env file inside rag_application/:
GROQ_API_KEY=your_groq_api_key_hereOption A — Streamlit Web UI (Recommended)
cd rag_application
streamlit run app.pyOption B — CLI Interface
# First, create the vector database
cd rag_application
python create-database.py
# Then run the CLI chatbot
python main.py- Upload — Drag & drop up to 5 PDF or TXT files in the sidebar.
- Process — Click
⚡ Process Documentsto trigger threaded ingestion. - Chat — Ask questions in the chat input; answers are grounded in your documents.
- Clear — Hit
🗑 Clear Sessionto reset and upload new documents.
| Decision | Rationale |
|---|---|
| Threading over Multiprocessing | Document loading is I/O-bound (file reads, PDF parsing), making threads ideal. Avoids the overhead of process spawning and IPC serialization. |
| Lock-based synchronization | A single threading.Lock protects the shared chunk list — minimal contention since each thread writes once after processing. |
| Daemon threads | Ensures clean shutdown if the Streamlit process is terminated mid-ingestion. |
| MMR over Similarity Search | Prevents redundant context chunks, improving answer quality for complex queries. |
| In-memory ChromaDB for web UI | Each session gets a fresh vectorstore — no stale data leaking between users. |
| Persisted ChromaDB for CLI | CLI mode persists the database for repeated querying without re-embedding. |
@st.cache_resource |
Embeddings model and LLM are loaded once and shared across reruns — no redundant initialization. |
The loaders/ and splitters/ directories contain standalone experiments exploring different LangChain components:
- Loaders: PyPDF, TextLoader, WebBaseLoader, Docling (for URL-based academic papers)
- Splitters: CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter
- Vector Store: ChromaDB similarity search vs. retriever patterns
These experiments document the research process and demonstrate familiarity with the broader LangChain ecosystem.
| Metric | Value |
|---|---|
| Document processing | Parallel — scales with file count, not linearly |
| Embedding model | Local — no API calls, no latency, no cost |
| LLM inference | Groq — sub-second response times via dedicated LPU hardware |
| Retrieval strategy | MMR — diverse, non-redundant context selection |
| Max concurrent files | 5 (configurable) |
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License — see the LICENSE file for details.
Built with ❤️ using LangChain, Groq, and Python Threading