🧠 Multimodal RAG-Based Chatbot System

A state-of-the-art Multimodal Retrieval-Augmented Generation (RAG) application built with Python, FastAPI, and OpenAI. It supports uploads of text, images, and PDFs, blending OCR, local vision embeddings, and dense-vector search into a conversational API.

🌟 Key Capabilities

Multimodal Ingestion Pipeline:
- Dynamically parses uploaded .pdf, .png, .jpg, and .txt files.
- Leverages pdfplumber to scrape precise text chunks and bounding boxes, while simultaneously utilizing pytesseract to extract OCR text directly from embedded images and visual graphics mapping them precisely to metadata.
Text & Vision Embeddings:
- Pure text retrieval uses OpenAIEmbeddings alongside a local FAISS CPU index.
- Native visual elements rely on local Hugging Face sentence-transformers models (using clip-ViT-B-32) to properly convert imagery into searchable vector dimensions without expensive third-party external API loops.
Hybrid Search System (RRF):
- Merges Sparse semantic tracking (using rank_bm25 lexical algorithms) alongside foundational Dense embedding searches (using FAISS).
- Melds these candidates utilizing Reciprocal Rank Fusion (RRF) to organically return the best combined text matches.
Cross-Modal Reranking Phase:
- Both textual paragraphs and raw extracted image paths are scored specifically against the user's live query using cosine_similarity.
- Ensures visually relevant diagrams are handed directly to the OpenAI generation head.
Conversational Memory & State Persistence:
- Real-time LLM interactions are maintained logically in memory natively mapping chat_history.
- Connected seamlessly to a MongoDB Atlas Cloud Database, the tool autosaves every conversation you have. Previous dynamic chats are mapped securely inside a "Chat History" sidebar to reload sessions asynchronously.
API-first design:
- FastAPI endpoints for upload/chat/history plus Swagger docs at /docs.
Per-Thread Context Persistence:
- Each chat thread keeps its own retrieval context (vector index + image references) so users can reopen old chats and continue asking questions without re-uploading.
- Original uploaded file bytes are not stored in chat history documents.

🛠️ Tech Stack & Requirements

Infrastructure & Backends

Python 3.11+
FastAPI (API server)
MongoDB (Database history persistence)

AI Tooling & Frameworks

OpenAI API (gpt-4o-mini for text/vision generation capability & text-embedding-3-small for dense vectors)
LangChain (Structuring FAISS and data abstractions)
Hugging Face (sentence-transformers) & torchvision (Underlying core algorithms utilized for evaluating visual cross-modals)

Parsing & Data Scraping

pytesseract, Pillow, pdf2image, pdfplumber

🚀 Setting Up the Project

1. Requirements

Ensure you have Tesseract OCR formally installed on your host machine to allow image-to-text processing.

Install all the application requirements via Virtual Environment:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

2. Environment Configurations

You must specify a few critical keys in an active .env file located at the root of the project (you can start from .env.example):

# Required for text-generation & core vector embeddings
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
OPENAI_VISION_MODEL="gpt-4o-mini"

# Required for thread persistence in the UI 
MONGO_URI="mongodb+srv://<user>:<password>@cluster0...mongodb.net/YourDB"

3. Execution

Launch the API server locally:

uvicorn server:app --host 0.0.0.0 --port 8000

Open:

http://localhost:8000/ (serves index.html)
http://localhost:8000/docs (Swagger UI)

🗂️ Codebase Architecture

server.py: FastAPI app exposing upload/chat/history endpoints and serving index.html.
app.py: Compatibility entrypoint (Streamlit removed).
ingestion.py: Standard handler breaking down PDFs and Images. Stores physical file derivatives cleanly into an internal extracted_images buffer directory.
chunker.py: Utility built to slice enormous text scripts down to manageable 1200 token chunks keeping critical surrounding metadata intact.
hybrid_retriever.py: The core search algorithm dynamically intersecting BM25 hits alongside dense scoring rules.
reranker.py & llm_query.py: Injects sentence-transformers CLIP integrations scoring content and querying OpenAI's API sequentially via structured system prompts.
db_utils.py: Abstracted controller communicating to MongoDB to save/recall active thread logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Multimodal RAG-Based Chatbot System

🌟 Key Capabilities

🛠️ Tech Stack & Requirements

Infrastructure & Backends

AI Tooling & Frameworks

Parsing & Data Scraping

🚀 Setting Up the Project

1. Requirements

2. Environment Configurations

3. Execution

🗂️ Codebase Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
thread_contexts		thread_contexts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
chunker.py		chunker.py
db_utils.py		db_utils.py
docker-compose.yml		docker-compose.yml
hybrid_retriever.py		hybrid_retriever.py
index.html		index.html
ingestion.py		ingestion.py
llm_query.py		llm_query.py
render.yaml		render.yaml
requirements.txt		requirements.txt
reranker.py		reranker.py
server.py		server.py
vectorstore_utils.py		vectorstore_utils.py

Folders and files

Latest commit

History

Repository files navigation

🧠 Multimodal RAG-Based Chatbot System

🌟 Key Capabilities

🛠️ Tech Stack & Requirements

Infrastructure & Backends

AI Tooling & Frameworks

Parsing & Data Scraping

🚀 Setting Up the Project

1. Requirements

2. Environment Configurations

3. Execution

🗂️ Codebase Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages