
# 🧠 Retrieval-Augmented Generation (RAG) for Document QA

This notebook demonstrates how to build a complete **RAG-based document question answering pipeline** using LangChain. It covers:
- Loading and chunking single/multiple PDFs
- Embedding with OpenAI and HuggingFace
- Vector storage with FAISS and Chroma
- Retrieval with similarity and MMR
- Generation with OpenAI GPT and HuggingFace (optional)


In [None]:

!pip install langchain openai faiss-cpu tiktoken PyPDF2 chromadb sentence-transformers


## 📦 Import Required Modules

In [None]:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.chat_models import ChatOpenAI
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from transformers import pipeline
import os
from dotenv import load_dotenv


## 🔐 Load API Keys from .env File

In [None]:
load_dotenv()

## 📄 Load a Single PDF File

In [None]:

loader = PyPDFLoader("data/contract_detailed.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages from contract_detailed.pdf")


## 📁 Load Multiple PDF Files from Folder

In [None]:

from pathlib import Path
all_docs = []
data_path = Path("data")
for file in data_path.glob("*.pdf"):
    docs = PyPDFLoader(str(file)).load()
    all_docs.extend(docs)
print(f"Loaded total {len(all_docs)} pages from {len(list(data_path.glob('*.pdf')))} PDFs")


## ✂️ Split Documents into Chunks

In [None]:

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(all_docs)
print(f"Total chunks created: {len(chunks)}")


## 🔡 Create Embeddings
We compare two methods: OpenAI and HuggingFace.

In [None]:

embedding_oa = OpenAIEmbeddings()
embedding_hf = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


## 💾 Store Vectors in FAISS and Chroma

In [None]:

faiss_store = FAISS.from_documents(chunks, embedding_oa)
chroma_store = Chroma.from_documents(chunks, embedding_hf, collection_name="rag-demo")


## 🔍 Create Retrievers (Similarity and MMR)

In [None]:

faiss_retriever = faiss_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})
mmr_retriever = chroma_store.as_retriever(search_type="mmr", search_kwargs={"k": 3})


## 🧠 Setup LLMs (OpenAI GPT-3.5 Turbo, Optional HF Pipeline)

In [None]:

llm_oa = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Optional: HuggingFace (if using local models)
# hf_pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
# llm_hf = HuggingFacePipeline(pipeline=hf_pipe)


## 🔗 Build RetrievalQA Chains with Different Combinations

In [None]:

qa_openai_faiss = RetrievalQA.from_chain_type(llm=llm_oa, retriever=faiss_retriever, return_source_documents=True)
qa_openai_mmr = RetrievalQA.from_chain_type(llm=llm_oa, retriever=mmr_retriever, return_source_documents=True)


## ❓ Ask a Query and Compare Results

In [None]:

query = "What are the termination conditions in the contract?"
result_faiss = qa_openai_faiss({"query": query})
result_mmr = qa_openai_mmr({"query": query})

print("FAISS + OpenAI Answer:\n", result_faiss['result'])
print("\nSources:")
for doc in result_faiss['source_documents']:
    print("-", doc.metadata.get("source", "N/A"))

print("\nMMR + OpenAI Answer:\n", result_mmr['result'])
print("\nSources:")
for doc in result_mmr['source_documents']:
    print("-", doc.metadata.get("source", "N/A"))



## 📊 Summary: Pros & Cons

| Component         | Option A         | Option B          | Notes |
|------------------|------------------|-------------------|-------|
| Embeddings       | OpenAI           | HuggingFace       | OpenAI is more accurate; HF is free/local |
| Vector Store     | FAISS            | Chroma            | FAISS is fast/local; Chroma supports MMR |
| Retriever Type   | Similarity       | MMR               | MMR adds diversity in context |
| Generator (LLM)  | OpenAI GPT-3.5   | HF Instruct Model | OpenAI is reliable; HF can be self-hosted |


## 💾 Save Vector Store for Reuse

In [None]:
faiss_store.save_local('vectorstore/faiss')