# Retrieval Augmented Generation (RAG) — Hands‑On with LangChain

**Goal:** Learn RAG from first principles and build an end‑to‑end RAG app with LangChain.  
We will cover: concepts, components, data ingestion, chunking, embeddings, vector stores, retrievers, generation, evaluation basics, and advanced RAG patterns.  
We’ll use **open‑source tooling** wherever possible (FAISS, HuggingFace embeddings, and Ollama for local LLMs such as `mistral`).

## How to use this notebook

- Execute cells **top‑to‑bottom**.  
- If you don’t have the packages, run the optional `pip install` cell.  
- If you don’t run Ollama locally, swap the LLM to any provider you have (e.g., OpenAI, Anthropic) — code comments show how.

In [None]:
# OPTIONAL: install dependencies (uncomment as needed)
# If you're on Colab, also: !apt -y install -qq libstdc++6
# %pip install -q langchain langchain-community langchain-text-splitters langchain-core
# %pip install -q faiss-cpu
# %pip install -q sentence-transformers
# %pip install -q ragas datasets evaluate # (optional, for evaluation section)
# %pip install -q langchain-ollama  # for local LLMs via Ollama

## RAG in one picture

**RAG = Retrieval + Generation.**  
1) **Indexing**: Load data → chunk → embed → store in a vector DB.  
2) **Retrieval**: Convert user query → retrieve relevant chunks.  
3) **Generation**: Feed retrieved chunks + query to an LLM with a grounded prompt → **answer with citations**.

## Dataset (Toy Corpus)

For demo, we'll use a small, self‑contained corpus so everything runs offline:
- `banking_faq.txt` — toy banking/domain knowledge
- `ml_rag_notes.txt` — small RAG notes

You can replace these with PDFs, web pages, markdown, or your own folder.

In [None]:
# Create a tiny local corpus so the notebook is self-contained
banking_faq = """
Q: What is two-factor authentication (2FA)?
A: A security process that requires two distinct forms of identification: something you know (password) and something you have (OTP/device).

Q: What is NEFT?
A: National Electronic Funds Transfer, an Indian retail payment system for one-to-one money transfers between bank accounts.

Q: What is data at rest vs data in transit?
A: Data at rest is stored data (e.g., on disk); data in transit moves across networks. Both should be protected with encryption and access controls.
"""

ml_rag_notes = """
Retrieval Augmented Generation (RAG) combines information retrieval with text generation.
Core steps: indexing (load, split, embed, store), retrieval (top-k, hybrid, filters),
and generation (prompt with retrieved context).
Advanced: query rewriting, multi-vector retrievers, reranking, stuffing vs map-reduce,
and evaluation (faithfulness, answer relevancy, context precision/recall).
"""

with open("banking_faq.txt", "w", encoding="utf-8") as f:
    f.write(banking_faq)
with open("ml_rag_notes.txt", "w", encoding="utf-8") as f:
    f.write(ml_rag_notes)

print("Created banking_faq.txt and ml_rag_notes.txt")

## 1) Ingestion & Chunking

We’ll load plain‑text files and chunk them with `RecursiveCharacterTextSplitter`.  
Chunk size/overlap matters: **too big** → slow, irrelevant; **too small** → lose context. Start with 500–1,000 tokens (or ~1,500–3,000 chars) and **tune**.

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load
paths = ["banking_faq.txt", "ml_rag_notes.txt"]
docs = []
for p in paths:
    docs.extend(TextLoader(p, encoding="utf-8").load())

# Split
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
chunks = splitter.split_documents(docs)

len(docs), len(chunks), chunks[0].page_content[:200]

## 2) Embeddings & Vector Store (FAISS)

We’ll use **HuggingFace** MiniLM embeddings (good quality & local) and **FAISS** for similarity search.

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Embeddings model (downloads on first use)
emb_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Build the index
vectorstore = FAISS.from_documents(chunks, emb_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
retriever

## 3) LLM Setup

We’ll default to **Ollama** (e.g., `mistral`, `llama3`) to stay local.  
If you don’t use Ollama, switch to `ChatOpenAI` or another provider by replacing the imports and the `llm` object.

In [None]:
# Choose ONE of the below.

# --- A) Local LLM via Ollama (recommended for offline) ---
try:
    from langchain_ollama import ChatOllama
    llm = ChatOllama(model="mistral")  # or "llama3", "qwen2", etc. in your Ollama
    USING_OLLAMA = True
except Exception as e:
    print("Ollama not available. Falling back to a no-op mock LLM.")
    USING_OLLAMA = False
    class MockLLM:
        def invoke(self, msgs):
            return type("Msg", (), {"content": "MockLLM: Provide a real LLM (Ollama/OpenAI/Anthropic/etc.)"})
    llm = MockLLM()

# --- B) Alternative: OpenAI (uncomment and set env var OPENAI_API_KEY) ---
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-4o-mini")

## 4) Prompting Strategy

We’ll use a **stuff** prompt template: instructions + retrieved context + question.  
Best practices:
- Keep instructions **clear and firm** (cite sources, don't fabricate).
- Add **formatting** (e.g., bullet points, JSON if needed).
- Provide **fallback** if context is insufficient.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer using ONLY the provided context. "
               "Cite sources as (S1), (S2), etc. If the answer isn't in context, say you don't know."),
    ("human", "Question: {question}\n\nContext:\n{context}\n\nAnswer:")
])
RAG_PROMPT

## 5) Build the RAG Chain (Retrieve → Format → Generate)

We’ll:
1) Retrieve top‑k chunks  
2) Format them into a single context string  
3) Call the LLM with our prompt

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

def format_docs(docs):
    formatted = []
    for i, d in enumerate(docs, start=1):
        formatted.append(f"(S{i})\n{d.page_content.strip()}")
    return "\n\n".join(formatted)

# Chain: input question -> retrieve docs -> format -> prompt -> llm
rag_chain = (
    {"docs": retriever | RunnableLambda(lambda x: x), "question": RunnablePassthrough()}
    | {"context": itemgetter("docs") | RunnableLambda(format_docs), "question": itemgetter("question")}
    | RAG_PROMPT
    | RunnableLambda(lambda p: llm.invoke(p) if hasattr(llm, "invoke") else llm(p))
)

response = rag_chain.invoke("What is NEFT and how is it used?")
print(getattr(response, "content", response))

## 6) Return Sources Separately (for UI apps)

Often you’ll want the answer **and** the underlying source docs to show citations or links.

In [None]:
def ask(question: str, k: int = 4):
    docs = retriever.get_relevant_documents(question)
    ctx = format_docs(docs)
    msg = RAG_PROMPT.invoke({"question": question, "context": ctx})
    out = llm.invoke(msg) if hasattr(llm, "invoke") else llm(msg)
    return getattr(out, "content", out), docs

answer, sources = ask("Explain data at rest vs data in transit.")
print("ANSWER:\n", answer, "\n")
print("SOURCES:")
for i, d in enumerate(sources, 1):
    print(f"  (S{i}) {d.metadata.get('source')}")

## 7) Query Transformations (Better Retrieval)

Techniques:
- **Rewriting** (e.g., condense follow‑ups into standalone queries)
- **HyDE** (generate hypothetical answer → embed → retrieve)
- **Multi‑query** (expand into several paraphrases → merge results)

Below is a simple **multi‑query** example.

In [None]:
from langchain_core.prompts import PromptTemplate

multiquery_template = PromptTemplate.from_template(
    "Generate 3 diverse search queries that rephrase: '{question}'. Return one per line."
)

def multiquery_retrieve(question: str, top_k: int = 3, per_query_k: int = 3):
    # Expand
    if hasattr(llm, "invoke"):
        gen = llm.invoke(multiquery_template.format(question=question))
        queries = [q.strip("-• ").strip() for q in getattr(gen, "content", str(gen)).splitlines() if q.strip()]
    else:
        queries = [question, question + " details", "explain " + question]

    # Retrieve per query and merge
    seen = set()
    merged = []
    for q in queries[:top_k]:
        hits = retriever.get_relevant_documents(q)[:per_query_k]
        for h in hits:
            key = (h.page_content, h.metadata.get("source"))
            if key not in seen:
                seen.add(key); merged.append(h)
    return merged

q = "What is RAG?"
docs_mq = multiquery_retrieve(q)
print("Retrieved", len(docs_mq), "unique chunks via multiquery.")
print(docs_mq[0].page_content[:160])

## 8) Lightweight Re‑ranking (Optional)

If you have a cross‑encoder (e.g., `cross-encoder/ms-marco-MiniLM-L-6-v2`), you can **re‑rank** top‑k docs by relevance score.  
Below is **optional** pseudocode — uncomment if you install `sentence-transformers` cross encoder.

In [None]:
# OPTIONAL: simple reranker with a cross-encoder
# from sentence_transformers import CrossEncoder
# reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# def rerank(question: str, docs):
#     pairs = [(question, d.page_content) for d in docs]
#     scores = reranker.predict(pairs)
#     ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
#     return [d for d, s in ranked]

# # Example usage:
# top = retriever.get_relevant_documents("What is NEFT?")
# top = rerank("What is NEFT?", top)
# format_docs(top[:4])

## 9) Build a Simple RAG Function (Reusable)

Wraps everything: retrieval (+optional multiquery), formatting, generation, and returns answer + sources.

In [None]:
def rag_answer(question: str, use_multiquery: bool = False, k: int = 4):
    docs = multiquery_retrieve(question, top_k=3, per_query_k=3) if use_multiquery else retriever.get_relevant_documents(question)
    ctx = format_docs(docs[:k])
    msg = RAG_PROMPT.invoke({"question": question, "context": ctx})
    out = llm.invoke(msg) if hasattr(llm, "invoke") else llm(msg)
    return getattr(out, "content", out), docs[:k]

ans, src = rag_answer("How does two-factor authentication work?", use_multiquery=True)
print(ans)

## 10) Evaluation Basics (RAGAS / Heuristics)

**Why evaluate RAG?** To track: answer relevance, faithfulness (groundedness), context relevance/precision/recall.

- **RAGAS** can compute metrics using an LLM judge (requires an API/LLM).  
- **Heuristics** (cheap): Answer contains keywords present in the retrieved context; exact‑match to FAQ; length/overlap checks.

We include a small, LLM‑free heuristic as a placeholder.

In [None]:
import re

def simple_overlap_score(answer: str, docs) -> float:
    ctx = " ".join(d.page_content for d in docs).lower()
    toks = set(re.findall(r"[a-zA-Z0-9]+", answer.lower()))
    toks = {t for t in toks if len(t) > 3}
    if not toks:
        return 0.0
    hit = sum(1 for t in toks if t in ctx)
    return hit / len(toks)

test_q = "What is data at rest vs data in transit?"
ans, src = rag_answer(test_q, use_multiquery=False)
print("ANSWER:", ans[:200], "...")
print("Overlap score (0-1):", round(simple_overlap_score(ans, src), 3))

## 11) Production Considerations & Advanced Patterns

- **Hybrid retrieval**: BM25 (sparse) + dense embeddings → better recall.  
- **Multi‑vector**: store titles, summaries, key phrases alongside body embeddings.  
- **Query routing**: choose retriever by domain/namespace.  
- **Structured output**: ask the LLM to produce JSON with fields (answer, citations).  
- **Safety**: protect PII, apply redaction, add guardrails.  
- **Caching**: store embeddings, responses (LSH cache).  
- **Observability**: log traces (LangSmith), track latency, failure modes.  
- **Chunking strategies**: by headings/semantic breaks; adaptive chunking.  
- **Freshness**: add **web search** or a **SQL retriever** for live data.  
- **Agents**: allow tools (search/DB) for retrieval beyond pure vector stores.  
- **Security**: document‑level/row‑level ACLs; encrypt at rest & in transit.

## 12) Full Minimal App (Console)

A tiny loop to ask questions. In real apps, use FastAPI/Gradio/Streamlit and show sources inline.

In [None]:
def chat():
    print("RAG chat. Type 'exit' to quit.")
    while True:
        q = input("\nYou: ").strip()
        if q.lower() in {"exit", "quit"}:
            break
        ans, src = rag_answer(q, use_multiquery=True)
        print("\nAssistant:", ans)
        print("\nSources:")
        for i, d in enumerate(src, 1):
            print(f"  (S{i}) {d.metadata.get('source')}")

# Uncomment to try:
# chat()

## 13) Alternatives / Related Techniques

- **Fusion‑in‑Decoder (FiD)**: encode multiple docs and let the decoder attend across them.  
- **ColBERT / Late Interaction**: fine retrieval with token‑level interactions.  
- **GraphRAG**: build a knowledge graph and retrieve subgraphs as context.  
- **Toolformer / Agents**: tool‑use during generation instead of pre‑retrieval.  
- **Index‑aware prompting**: train prompts to match chunking & metadata.  
- **Knowledge distillation**: pre‑compute Q/A pairs; train a domain‑LLM or a reranker.

## 14) What a Data Scientist Should Know (Checklist)

- **Data**: formats, loaders, cleaning, PII redaction, deduplication, canonicalization.  
- **Embeddings**: model choice, dimensionality, normalization, drift monitoring.  
- **Index**: FAISS/HNSW params (nlist, nprobe/efSearch), ANN recall/latency trade‑offs.  
- **Retrieval**: k, filters, re‑ranking, query rewrite, hybrid search.  
- **Prompting**: system instructions, style, output schema, refusal policy.  
- **Quality**: eval sets, AB tests, RAGAS/G-Eval; track hallucinations.  
- **Ops**: concurrency, caching, cold‑start, containerization, CI/CD.  
- **Security**: encryption (at rest/in transit), access control, audit logs.  
- **Cost & Latency**: embeddings batch, quantized LLMs, streaming.  
- **Compliance**: data residency, retention, legal holds, consent.

---

### You're done!
Try swapping the corpus, embeddings, and LLM to match your real use case.  
Integrate a web UI and display sources as expandable snippets with highlights.