# Module 21: Retrieval-Augmented Generation (RAG)

**Grounding LLMs with External Knowledge**

---

## 1. Objectives

- ✅ Understand RAG architecture
- ✅ Implement document chunking
- ✅ Create embeddings and vector stores
- ✅ Build retrieval pipeline

## 2. Prerequisites

- [Module 20: Fine-Tuning LLMs](../20_finetuning/20_finetuning.ipynb)

## 3. Why RAG?

### LLM Limitations

| Problem | RAG Solution |
|---------|-------------|
| Knowledge cutoff | Retrieve current docs |
| Hallucinations | Ground in sources |
| No private data | Index your documents |
| Token limits | Retrieve relevant chunks |

### RAG Architecture

```
┌─────────────────────────────────────────────────────┐
│                      RAG Pipeline                    │
├─────────────────────────────────────────────────────┤
│  [Documents] → [Chunk] → [Embed] → [Vector Store]   │
│                                          ↓           │
│  [Query] → [Embed] → [Similarity Search]            │
│                              ↓                       │
│  [Retrieved Chunks] + [Query] → [LLM] → [Answer]    │
└─────────────────────────────────────────────────────┘
```

In [2]:
# Install: pip install sentence-transformers chromadb langchain

import numpy as np
from typing import List, Tuple

## 4. Document Chunking

### Chunking Strategies

| Strategy | Pros | Cons |
|----------|------|------|
| Fixed size | Simple | Breaks context |
| Sentence | Semantic | Variable size |
| Recursive | Best quality | Complex |
| Semantic | Preserves meaning | Slow |

In [3]:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """Simple chunking with overlap."""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap for context continuity

    return chunks

# Example
sample_text = "RAG combines retrieval and generation. " * 20
chunks = chunk_text(sample_text, chunk_size=100, overlap=20)
print(f"Created {len(chunks)} chunks")
print(f"First chunk: '{chunks[0][:50]}...'")

Created 10 chunks
First chunk: 'RAG combines retrieval and generation. RAG combine...'


In [4]:
def recursive_chunk(text: str, max_size: int = 500, separators: List[str] = None) -> List[str]:
    """Recursive chunking - tries to split on semantic boundaries."""
    if separators is None:
        separators = ["\n\n", "\n", ". ", " "]

    if len(text) <= max_size:
        return [text]

    for sep in separators:
        if sep in text:
            parts = text.split(sep)
            chunks = []
            current = ""

            for part in parts:
                if len(current) + len(part) + len(sep) <= max_size:
                    current += part + sep
                else:
                    if current:
                        chunks.append(current.strip())
                    current = part + sep

            if current:
                chunks.append(current.strip())
            return chunks

    # Fallback: hard split
    return [text[i:i+max_size] for i in range(0, len(text), max_size)]

print("Recursive chunking ready!")

Recursive chunking ready!


## 5. Embeddings

### Popular Embedding Models

| Model | Dimensions | Quality | Speed |
|-------|-----------|---------|-------|
| all-MiniLM-L6-v2 | 384 | Good | Fast |
| all-mpnet-base-v2 | 768 | Better | Medium |
| bge-large-en-v1.5 | 1024 | Best | Slow |
| OpenAI ada-002 | 1536 | Great | API |

In [5]:
from sentence_transformers import SentenceTransformer

# Load embedding model
embed_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
texts = [
    "RAG combines retrieval with generation",
    "Vector databases store embeddings",
    "The weather is nice today"
]

embeddings = embed_model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}")
print(f"Each embedding: {embeddings[0].shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Embeddings shape: (3, 384)
Each embedding: (384,)


In [6]:
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare similarities
query = "How does retrieval augmented generation work?"
query_emb = embed_model.encode(query)

print("Query:", query)
print("\nSimilarities:")
for i, text in enumerate(texts):
    sim = cosine_similarity(query_emb, embeddings[i])
    print(f"  '{text[:40]}...': {sim:.3f}")

Query: How does retrieval augmented generation work?

Similarities:
  'RAG combines retrieval with generation...': 0.608
  'Vector databases store embeddings...': 0.185
  'The weather is nice today...': -0.102


## 6. Vector Store with ChromaDB

In [7]:
!pip install chromadb
import chromadb

# Create in-memory client
client = chromadb.Client()

# Create collection
collection = client.create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"}
)

# Add documents
documents = [
    "RAG helps LLMs access external knowledge",
    "Vector databases enable fast similarity search",
    "Chunking splits documents into manageable pieces",
    "Embeddings represent text as dense vectors"
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print(f"Added {collection.count()} documents")

Added 4 documents


In [8]:
# Query the collection
results = collection.query(
    query_texts=["How do I search for similar documents?"],
    n_results=2
)

print("Top results:")
for doc, dist in zip(results['documents'][0], results['distances'][0]):
    print(f"  [{dist:.3f}] {doc}")

Top results:
  [0.516] Vector databases enable fast similarity search
  [0.683] Chunking splits documents into manageable pieces


## 7. Complete RAG Pipeline

In [9]:
class SimpleRAG:
    """Minimal RAG implementation."""

    def __init__(self, embed_model_name: str = 'all-MiniLM-L6-v2'):
        self.embed_model = SentenceTransformer(embed_model_name)
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("rag_docs")
        self.documents = []

    def add_documents(self, docs: List[str]):
        """Add documents to index."""
        embeddings = self.embed_model.encode(docs).tolist()
        ids = [f"doc_{len(self.documents)+i}" for i in range(len(docs))]

        self.collection.add(
            documents=docs,
            embeddings=embeddings,
            ids=ids
        )
        self.documents.extend(docs)

    def retrieve(self, query: str, k: int = 3) -> List[str]:
        """Retrieve top-k relevant chunks."""
        query_emb = self.embed_model.encode([query]).tolist()
        results = self.collection.query(
            query_embeddings=query_emb,
            n_results=k
        )
        return results['documents'][0]

    def generate_prompt(self, query: str, context: List[str]) -> str:
        """Create prompt with retrieved context."""
        context_str = "\n\n".join(context)
        return f"""Answer the question based on the context below.

Context:
{context_str}

Question: {query}

Answer:"""

# Usage
rag = SimpleRAG()
print("SimpleRAG initialized!")

SimpleRAG initialized!


In [10]:
# Add knowledge base
knowledge = [
    "PyTorch is a deep learning framework developed by Meta.",
    "Transformers use self-attention to process sequences.",
    "BERT is an encoder-only transformer for understanding.",
    "GPT models use decoder-only architecture for generation.",
    "RAG retrieves relevant documents to augment LLM responses."
]

rag.add_documents(knowledge)

# Query
query = "What is BERT used for?"
context = rag.retrieve(query, k=2)
prompt = rag.generate_prompt(query, context)

print("Generated Prompt:")
print(prompt)

Generated Prompt:
Answer the question based on the context below.

Context:
BERT is an encoder-only transformer for understanding.

PyTorch is a deep learning framework developed by Meta.

Question: What is BERT used for?

Answer:


## 8. Advanced RAG Techniques

### Improvements

| Technique | Purpose |
|-----------|--------|
| HyDE | Generate hypothetical answer to embed |
| Re-ranking | Reorder results with cross-encoder |
| Multi-query | Generate variations of query |
| Parent retrieval | Return larger context around chunks |

### Production Considerations

1. **Persistent storage**: Use ChromaDB persist or Pinecone/Weaviate
2. **Metadata filtering**: Filter by date, source, category
3. **Hybrid search**: Combine dense + sparse (BM25)
4. **Evaluation**: Hit rate, MRR, answer faithfulness

## 9. Interview Questions

**Q1: What is RAG and why use it?**
<details><summary>Answer</summary>

RAG retrieves relevant documents and includes them in the LLM prompt. Benefits: reduces hallucinations, adds current/private knowledge, enables source citation.
</details>

**Q2: How do you choose chunk size?**
<details><summary>Answer</summary>

Balance between: too small (loses context) vs too large (dilutes relevance). Typical: 200-500 tokens. Add 10-20% overlap to maintain context across chunks.
</details>

**Q3: What's the difference between dense and sparse retrieval?**
<details><summary>Answer</summary>

- Dense: Uses embedding vectors, captures semantic similarity
- Sparse: Keyword matching (BM25), exact term matches
- Hybrid: Combines both for best results
</details>

## 10. Summary

- **RAG**: Retrieval + Generation to ground LLMs
- **Chunking**: Split documents with overlap
- **Embeddings**: Dense vector representations
- **Vector Store**: Fast similarity search
- **Pipeline**: Retrieve → Augment prompt → Generate

## 11. References

- [RAG Paper](https://arxiv.org/abs/2005.11401)
- [ChromaDB](https://www.trychroma.com/)
- [LangChain RAG](https://python.langchain.com/docs/use_cases/question_answering/)
- [Sentence Transformers](https://www.sbert.net/)

---
**Next:** [Module 22: NLP Model Deployment](../22_deployment/22_deployment.ipynb)