<a href="https://colab.research.google.com/github/vi14m/RAG-QA/blob/main/RAG_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📄 Local RAG PDF Chatbot with LangChain, HuggingFace, and Gradio

This notebook demonstrates how to build a fully local Retrieval-Augmented Generation (RAG) chatbot for querying your own PDF documents. It leverages LangChain, HuggingFace Transformers, FAISS for vector search, and Gradio for a user-friendly chat interface.

---

## **Workflow Overview**

1. **Install Required Libraries**  
    Install all necessary Python packages for document loading, embedding, vector storage, and the chat interface.

2. **Upload PDF Documents**  
    Upload your PDF files, which will be used as the knowledge base for the chatbot.

3. **Load and Process Documents**  
    - Load PDFs from the uploaded files.
    - Split documents into manageable text chunks.
    - Filter out empty or invalid chunks.

4. **Create Embeddings and Vector Store**  
    - Generate embeddings for each document chunk using a HuggingFace model.
    - Store embeddings in a FAISS vector database for efficient retrieval.

5. **Set Up Conversational RAG Chain**  
    - Initialize a lightweight HuggingFace language model for answer generation.
    - Set up a conversational retrieval chain using LangChain.

6. **Define Chatbot Logic**  
    - Implement a function to handle user queries, maintain chat history, and interact with the RAG chain.

7. **Launch Gradio Chat Interface**  
    - Provide a simple web-based chat interface for users to interact with the PDF chatbot.

8. **Automated Q&A and Export**  
    - Run a set of sample questions through the chatbot.
    - Save the resulting Q&A pairs to an Excel file for further analysis.

---

## **Cell-by-Cell Description**

### 1. Install Dependencies
Install all required libraries for document processing, embeddings, vector storage, and the chat interface.

### 2. Upload PDFs
Use Google Colab's file upload utility to upload your PDF documents. Move them to a dedicated `data` directory.

### 3. Install FAISS
Install the FAISS library for efficient vector similarity search.

### 4. Load and Process Documents
- Load all PDFs from the `data` directory.
- Split documents into overlapping text chunks.
- Filter out empty chunks.
- Generate embeddings using a HuggingFace model.
- Store embeddings in a FAISS vector store.

### 5. Set Up RAG Chain
- Initialize a lightweight HuggingFace language model.
- Create a retriever from the FAISS vector store.
- Set up a conversational retrieval chain using LangChain.

### 6. Define Chatbot Logic
- Implement a function to process user queries, maintain chat history, and interact with the RAG chain.
- Handle greetings and short queries gracefully.

### 7. Launch Gradio Interface
- Wrap the chatbot logic in a Gradio interface for interactive chatting.

### 8. Define Sample Questions
List a set of example questions to automatically query the chatbot.

### 9. Automated Q&A and Export
- Run the sample questions through the chatbot.
- Save the resulting Q&A pairs to an Excel file.

---

## **Usage**

1. **Run each cell in order.**
2. **Upload your PDF documents when prompted.**
3. **Interact with the chatbot via the Gradio interface.**
4. **Review and export Q&A results as needed.**

---

**This notebook provides a fully local, privacy-preserving solution for querying your own documents using state-of-the-art NLP techniques.**

In [None]:
!pip install -q \
    langchain-core langchain-community langchain-chroma \
    langchain-text-splitters sentence-transformers \
    transformers accelerate chromadb pypdf gradio


In [None]:
from google.colab import files
uploaded = files.upload()

import os
os.makedirs("data", exist_ok=True)
for fname in uploaded.keys():
    os.rename(fname, f"data/{fname}")


In [None]:
!pip install faiss-cpu

In [None]:
import os
import shutil
from uuid import uuid4

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS  # FAISS instead of Chroma
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import ConversationalRetrievalChain

from transformers import pipeline

# Set paths
DATA_PATH = "/content/data"
FAISS_INDEX_PATH = "/content/faiss_index"

# Clean up FAISS index directory
if os.path.exists(FAISS_INDEX_PATH):
    shutil.rmtree(FAISS_INDEX_PATH)
os.makedirs(DATA_PATH, exist_ok=True)

# Load PDFs
loader = PyPDFDirectoryLoader(DATA_PATH)
raw_documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=150)
chunks = splitter.split_documents(raw_documents)

# Filter valid chunks
valid_chunks = [chunk for chunk in chunks if chunk.page_content.strip()]

# Embedding model
embedding = HuggingFaceEmbeddings(model_name="hkunlp/instructor-xl")

# FAISS vector store
vector_store = FAISS.from_documents(valid_chunks, embedding)

vector_store.save_local(FAISS_INDEX_PATH)

In [None]:
from transformers import pipeline
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import ConversationalRetrievalChain

# Lightweight model for RAG pipeline
# We can use OpenAI models, but they are not free
hf_pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    max_length=2048,
    temperature=0.7
)

llm = HuggingFacePipeline(pipeline=hf_pipe)

retriever = vector_store.as_retriever(search_kwargs={"k": 5})

chat_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)


In [None]:
# Initialize chat history as a list of tuples (question, answer)
chat_history = []

def ask_rag_bot(query: str) -> str:
    """
    Process a user query using the conversational RAG chain and maintain history.
    """

    query = query.strip()
    if not query:
        return "❓ Please enter a valid question."

    # Handle basic greetings
    greetings = ["hi", "hello", "hey"]
    if query.lower() in greetings:
        return "👋 Hi there! I’m your PDF assistant. Ask me anything based on the uploaded documents."

    if len(query.split()) < 3:
        return "🧐 Can you please provide more context in your question?"

    try:
        # Run the RAG pipeline
        result = chat_chain({
            "question": query,
            "chat_history": chat_history
        })

        answer = result.get("answer", "").strip()

        if not answer:
            return "🤔 I couldn't find an answer based on the documents."

        # Add to history only if valid
        chat_history.append((query, answer))

        return answer

    except Exception as e:
        return f"⚠️ An error occurred: {str(e)}"


In [None]:
import gradio as gr

# Define Gradio-friendly wrapper
def gradio_chat_interface(message, history):
    try:
        response = ask_rag_bot(message)
        return response
    except Exception as e:
        return f"⚠️ Error: {str(e)}"

# Launch interface
gr.ChatInterface(
    fn=gradio_chat_interface,
    title="📄 Document Chatbot (Free HuggingFace RAG)",
    description="Ask questions about your uploaded PDFs using a fully local RAG system.",
    textbox=gr.Textbox(placeholder="Ask something about your document..."),
).launch()


In [None]:
questions = [
    "What type of projects we worked on?",
    "Which college he attended?",
    "What B.Tech degree he is doing?",
    "What skills does he have?",
    "Which coding languages does he know?"
]


In [None]:
import pandas as pd

chat_history = []
qa_pairs = []

for q in questions:
    result = chat_chain({"question": q, "chat_history": chat_history})
    answer = result["answer"]
    qa_pairs.append((q, answer))
    chat_history.append((q, answer))

# Save to Excel
df = pd.DataFrame(qa_pairs, columns=["Question", "Answer"])
df.to_excel("RAG_QA_Output.xlsx", index=False)
print("✅ Saved to RAG_QA_Output.xlsx")
