This project implements a Document Summarizer and Question-Answering (QA) System using Retrieval Augmented Generation (RAG) principles. It combines document retrieval, summarization, and question-answering capabilities using FAISS for efficient vector search and OpenAI GPT models for summarization and Q&A tasks.
Watch it summarize the paper "Attention Is All You Need" β¬οΈ
- Upload and process documents: Supports PDF and EPUB formats.
- Text splitting and embedding: Uses OpenAI embeddings and FAISS for efficient text retrieval.
- Clustering for summary: Clusters embeddings to summarize the most representative parts of the document.
- Summarization: Generates cohesive summaries using GPT-3.5 and GPT-4.
- Interactive Q&A: Ask questions based on the uploaded document and get answers powered by GPT models.
- LangChain: For managing document loaders, embeddings, chains, and summarization.
- OpenAI API: For using GPT-3.5 and GPT-4 models in summarization and Q&A.
- FAISS: For fast vector searches.
- KMeans Clustering: For grouping similar document embeddings.
- Streamlit: For building the interactive web application.
- Document Upload: The app allows you to upload PDF or EPUB files.
- Text Processing:
- The document is loaded using
PyPDFLoaderorUnstructuredEPubLoader. - Text is split into manageable chunks using
RecursiveCharacterTextSplitter.
- The document is loaded using
- Embeddings and FAISS Indexing:
- Chunks are embedded using OpenAIβs embeddings API.
- FAISS is used to create a vector store for efficient retrieval.
- Clustering:
- Embeddings are clustered using KMeans to identify key sections.
- Summarization:
- Selected document chunks are summarized using GPT-3.5-turbo.
- A final, cohesive summary is generated by GPT-4.
- Q&A:
- Users can ask questions about the document, and relevant chunks are retrieved using FAISS.
- GPT-3.5 answers the questions based on the retrieved content.
-
Clone the repository:
git clone https://github.com/srimoyee1212/Document-Summary-and-QA.git cd Document-Summary-and-QA -
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Set your OpenAI API key by adding it directly to the code or using environment variable
-
Run the Streamlit app:
streamlit run app.py
-
Open the app in your browser at
http://localhost:8501. -
Upload a document (PDF or EPUB).
-
Generate a summary and/or ask questions about the document.
