# Assignment 3: Retrieval-Augmented Generation (RAG) using LangChain

## Part I: Conceptual Understanding
### Q1: Motivation behind Retrieval-Augmented Generation (RAG)
RAG helps enhance LLMs by retrieving relevant external documents during inference, leading to more accurate and grounded answers.

### Q2: RAG vs Standard LLM QA
Standard LLM QA relies on pre-trained internal knowledge. RAG augments this with external documents retrieved at runtime, reducing hallucinations.

### Q3: Role of Vector Store
It stores vector representations of documents and enables retrieval of relevant chunks based on similarity to user queries.

### Q4: Chain Types in LangChain
- **stuff**: Concatenates all chunks.
- **map_reduce**: Answers each chunk, then combines results.
- **refine**: Builds the answer incrementally from each chunk.

### Q5: Main Components of a LangChain RAG Pipeline
- Document Loader
- Text Splitter
- Embeddings
- Vector Store
- Retriever
- LLM
- RetrievalQA Chain


## Part II: RAG Implementation with LangChain

In [None]:
# Install required packages (uncomment and run if needed)
# !pip install langchain chromadb pypdf tqdm ollama


In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import Ollama
from langchain.chains import RetrievalQA

# Load document
loader = PyPDFLoader("data/test_doc.pdf")
pages = loader.load()

# Split document into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_documents(pages)

# Create embeddings and vector store
embeddings = OllamaEmbeddings(model="mistral")
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory=".chromadb")
vectorstore.persist()


In [None]:
# Initialize retriever and LLM
retriever = vectorstore.as_retriever()
llm = Ollama(model="mistral")

# Define RAG pipeline
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

In [None]:
# Sample queries
queries = [
    "What is the main topic of the document?",
    "List any key benefits mentioned.",
    "What problem does this paper solve?",
    "Give a summary in bullet points.",
    "Are there any limitations discussed?"
]

for q in queries:
    result = qa_chain(q)
    print(f"\n\033[1mQuestion:\033[0m {q}")
    print(f"\033[94mAnswer:\033[0m {result['result']}")
    print("\033[92mRetrieved Sources:\033[0m")
    for doc in result['source_documents']:
        print("-", doc.metadata.get("source"))


## Custom Prompt Template
- Added citation tags like [source]
- Disclaimer: "Answer is based on retrieved document context"
- Output formatted as bullet points