A state-of-the-art Multimodal Retrieval-Augmented Generation (RAG) application built with Python, FastAPI, and OpenAI. It supports uploads of text, images, and PDFs, blending OCR, local vision embeddings, and dense-vector search into a conversational API.
- Multimodal Ingestion Pipeline:
- Dynamically parses uploaded
.pdf,.png,.jpg, and.txtfiles. - Leverages
pdfplumberto scrape precise text chunks and bounding boxes, while simultaneously utilizingpytesseractto extract OCR text directly from embedded images and visual graphics mapping them precisely to metadata.
- Dynamically parses uploaded
- Text & Vision Embeddings:
- Pure text retrieval uses
OpenAIEmbeddingsalongside a local FAISS CPU index. - Native visual elements rely on local Hugging Face
sentence-transformersmodels (usingclip-ViT-B-32) to properly convert imagery into searchable vector dimensions without expensive third-party external API loops.
- Pure text retrieval uses
- Hybrid Search System (RRF):
- Merges Sparse semantic tracking (using
rank_bm25lexical algorithms) alongside foundational Dense embedding searches (using FAISS). - Melds these candidates utilizing Reciprocal Rank Fusion (RRF) to organically return the best combined text matches.
- Merges Sparse semantic tracking (using
- Cross-Modal Reranking Phase:
- Both textual paragraphs and raw extracted image paths are scored specifically against the user's live query using
cosine_similarity. - Ensures visually relevant diagrams are handed directly to the OpenAI generation head.
- Both textual paragraphs and raw extracted image paths are scored specifically against the user's live query using
- Conversational Memory & State Persistence:
- Real-time LLM interactions are maintained logically in memory natively mapping
chat_history. - Connected seamlessly to a MongoDB Atlas Cloud Database, the tool autosaves every conversation you have. Previous dynamic chats are mapped securely inside a "Chat History" sidebar to reload sessions asynchronously.
- Real-time LLM interactions are maintained logically in memory natively mapping
- API-first design:
- FastAPI endpoints for upload/chat/history plus Swagger docs at
/docs.
- FastAPI endpoints for upload/chat/history plus Swagger docs at
- Per-Thread Context Persistence:
- Each chat thread keeps its own retrieval context (vector index + image references) so users can reopen old chats and continue asking questions without re-uploading.
- Original uploaded file bytes are not stored in chat history documents.
- Python 3.11+
- FastAPI (API server)
- MongoDB (Database history persistence)
- OpenAI API (
gpt-4o-minifor text/vision generation capability &text-embedding-3-smallfor dense vectors) - LangChain (Structuring FAISS and data abstractions)
- Hugging Face (
sentence-transformers) &torchvision(Underlying core algorithms utilized for evaluating visual cross-modals)
pytesseract,Pillow,pdf2image,pdfplumber
Ensure you have Tesseract OCR formally installed on your host machine to allow image-to-text processing.
Install all the application requirements via Virtual Environment:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txtYou must specify a few critical keys in an active .env file located at the root of the project (you can start from .env.example):
# Required for text-generation & core vector embeddings
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
OPENAI_VISION_MODEL="gpt-4o-mini"
# Required for thread persistence in the UI
MONGO_URI="mongodb+srv://<user>:<password>@cluster0...mongodb.net/YourDB"Launch the API server locally:
uvicorn server:app --host 0.0.0.0 --port 8000Open:
- http://localhost:8000/ (serves
index.html) - http://localhost:8000/docs (Swagger UI)
server.py: FastAPI app exposing upload/chat/history endpoints and servingindex.html.app.py: Compatibility entrypoint (Streamlit removed).ingestion.py: Standard handler breaking down PDFs and Images. Stores physical file derivatives cleanly into an internalextracted_imagesbuffer directory.chunker.py: Utility built to slice enormous text scripts down to manageable1200token chunks keeping critical surrounding metadata intact.hybrid_retriever.py: The core search algorithm dynamically intersecting BM25 hits alongside dense scoring rules.reranker.py&llm_query.py: Injectssentence-transformersCLIP integrations scoring content and querying OpenAI's API sequentially via structured system prompts.db_utils.py: Abstracted controller communicating to MongoDB to save/recall active thread logs.