# Building a Simple RAG Pipeline

In this notebook, we'll build a complete Retrieval-Augmented Generation (RAG) pipeline from scratch. You'll learn how to customize each component for your specific needs.

## Learning Objectives

By the end of this notebook, you will:
1. Understand the complete RAG architecture
2. Customize chunking strategies (node parsing)
3. Configure retrieval parameters
4. Understand different response synthesis modes
5. Build a reusable RAG pipeline

---

## 1. RAG Architecture Deep Dive

RAG consists of three main phases:

### Phase 1: Indexing (Offline)
```
Documents → Chunking → Embedding → Vector Store
```

### Phase 2: Retrieval (At Query Time)
```
Query → Query Embedding → Similarity Search → Top-K Relevant Chunks
```

### Phase 3: Generation (At Query Time)
```
Query + Retrieved Chunks → LLM → Response
```

Let's implement each phase with full control over the parameters!

In [None]:
# Setup
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv
load_dotenv()

# Core imports
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    Document,
    ServiceContext,
)
from llama_index.core.node_parser import (
    SentenceSplitter,
    TokenTextSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
    SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

print("✓ Imports complete!")

In [None]:
# Configure settings
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

print("✓ Settings configured!")

## 2. Document Loading

Let's load documents and explore their structure before processing.

In [None]:
# Load documents
documents = SimpleDirectoryReader(
    input_dir="../data/sample_docs",
    filename_as_id=True,  # Use filename as document ID
).load_data()

print(f"Loaded {len(documents)} documents\n")

# Examine document structure
for doc in documents:
    print(f"Document: {doc.metadata.get('file_name', 'Unknown')}")
    print(f"  - Characters: {len(doc.text):,}")
    print(f"  - Words (approx): {len(doc.text.split()):,}")
    print(f"  - Metadata keys: {list(doc.metadata.keys())}")
    print()

## 3. Chunking Strategies (Node Parsing)

Chunking is **critical** for RAG performance. Poor chunking leads to:
- Lost context (chunks too small)
- Irrelevant retrieval (chunks too large)
- Broken sentences or ideas

### Available Chunking Strategies

| Strategy | Best For | Description |
|----------|----------|-------------|
| `SentenceSplitter` | General use | Splits on sentences, respects chunk size |
| `TokenTextSplitter` | Token-aware | Precise token counting for LLM limits |
| `SemanticSplitter` | Quality-focused | Groups semantically related text |

Let's compare them!

In [None]:
# Strategy 1: SentenceSplitter (Recommended default)
sentence_splitter = SentenceSplitter(
    chunk_size=512,       # Target chunk size in tokens
    chunk_overlap=50,     # Overlap between chunks for context continuity
    separator=" ",        # Default separator
)

# Parse documents into nodes
sentence_nodes = sentence_splitter.get_nodes_from_documents(documents)

print(f"SentenceSplitter produced {len(sentence_nodes)} nodes")
print(f"\nSample node (first 300 chars):")
print(f"'{sentence_nodes[0].text[:300]}...'")

In [None]:
# Strategy 2: TokenTextSplitter (Token-precise)
token_splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

token_nodes = token_splitter.get_nodes_from_documents(documents)

print(f"TokenTextSplitter produced {len(token_nodes)} nodes")

In [None]:
# Compare chunk sizes
import matplotlib.pyplot as plt

sentence_lengths = [len(n.text) for n in sentence_nodes]
token_lengths = [len(n.text) for n in token_nodes]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(sentence_lengths, bins=20, edgecolor='black', alpha=0.7)
axes[0].set_title('SentenceSplitter - Chunk Sizes')
axes[0].set_xlabel('Characters per chunk')
axes[0].set_ylabel('Count')

axes[1].hist(token_lengths, bins=20, edgecolor='black', alpha=0.7)
axes[1].set_title('TokenTextSplitter - Chunk Sizes')
axes[1].set_xlabel('Characters per chunk')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

print(f"\nSentenceSplitter: avg={sum(sentence_lengths)/len(sentence_lengths):.0f} chars")
print(f"TokenTextSplitter: avg={sum(token_lengths)/len(token_lengths):.0f} chars")

### Chunk Size Guidelines

| Use Case | Chunk Size | Overlap | Reasoning |
|----------|------------|---------|----------|
| Q&A on short docs | 256-512 | 20-50 | Precise retrieval |
| Long documents | 512-1024 | 50-100 | More context per chunk |
| Code documentation | 512-1024 | 100-200 | Preserve code blocks |
| Legal/Technical | 1024-2048 | 200+ | Complex context |

**Rule of thumb**: Start with 512 tokens, 10% overlap. Adjust based on retrieval quality.

## 4. Metadata Extraction (Advanced Preprocessing)

Extracting metadata from chunks can improve retrieval. LlamaIndex provides automatic extractors:

In [None]:
# Create an ingestion pipeline with transformations
# This combines parsing + metadata extraction

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=50),
        # TitleExtractor extracts a title for each chunk
        TitleExtractor(nodes=5),  # Use first 5 nodes to determine title
        # SummaryExtractor creates a summary of each chunk
        # SummaryExtractor(summaries=["self"]),  # Uncomment for summaries (slower)
    ]
)

# Run the pipeline
print("Running ingestion pipeline...")
enriched_nodes = pipeline.run(documents=documents, show_progress=True)

print(f"\nProduced {len(enriched_nodes)} enriched nodes")

In [None]:
# Examine enriched metadata
print("Sample enriched node:")
sample = enriched_nodes[0]
print(f"  Metadata: {sample.metadata}")
print(f"  Text preview: {sample.text[:150]}...")

## 5. Building the Index

Now let's create our index using the processed nodes.

In [None]:
# Create index from our enriched nodes
index = VectorStoreIndex(nodes=enriched_nodes, show_progress=True)

print("\n✓ Index created!")

## 6. Retrieval Configuration

Retrieval settings significantly impact response quality. Key parameters:

| Parameter | Description | Default |
|-----------|-------------|--------|
| `similarity_top_k` | Number of chunks to retrieve | 2 |
| `response_mode` | How to synthesize response | "compact" |

In [None]:
# Create retriever with custom settings
retriever = index.as_retriever(
    similarity_top_k=5,  # Retrieve more chunks
)

# Test retrieval (without LLM generation)
query = "What are the types of machine learning?"
retrieved_nodes = retriever.retrieve(query)

print(f"Query: '{query}'")
print(f"\nRetrieved {len(retrieved_nodes)} nodes:\n")

for i, node in enumerate(retrieved_nodes):
    print(f"--- Node {i+1} (score: {node.score:.4f}) ---")
    print(f"Source: {node.metadata.get('file_name', 'N/A')}")
    print(f"Text: {node.text[:200]}...")
    print()

## 7. Response Synthesis Modes

LlamaIndex offers different ways to synthesize responses from retrieved chunks:

| Mode | Description | Best For |
|------|-------------|----------|
| `refine` | Iteratively refine answer with each chunk | Quality |
| `compact` | Compact chunks into fewer LLM calls | Balance |
| `tree_summarize` | Hierarchical summarization | Long context |
| `simple_summarize` | Single LLM call with all chunks | Speed |
| `no_text` | Return chunks without LLM generation | Retrieval only |
| `accumulate` | Separate answer per chunk, then combine | Comprehensive |

In [None]:
from llama_index.core.response_synthesizers import ResponseMode

# Compare different response modes
query = "Explain the ethical considerations of AI."

response_modes = [
    ("compact", "Fast, combines chunks"),
    ("refine", "Quality, iterative refinement"),
    ("tree_summarize", "Hierarchical, good for long content"),
]

for mode, description in response_modes:
    query_engine = index.as_query_engine(
        similarity_top_k=3,
        response_mode=mode,
    )
    
    response = query_engine.query(query)
    
    print(f"\n{'='*60}")
    print(f"Mode: {mode} ({description})")
    print(f"{'='*60}")
    print(f"Response: {str(response)[:500]}...\n")

## 8. Customizing the Prompt

You can customize the prompt template used for generation:

In [None]:
from llama_index.core import PromptTemplate

# Custom QA prompt
custom_qa_prompt = PromptTemplate(
    """You are an expert technical assistant. 
Use the following context to answer the question.
If you cannot find the answer in the context, say "I don't have enough information to answer this."

Context:
---------
{context_str}
---------

Question: {query_str}

Provide a clear, concise answer with examples where relevant:"""
)

# Create query engine with custom prompt
query_engine = index.as_query_engine(
    similarity_top_k=3,
    text_qa_template=custom_qa_prompt,
)

# Test
response = query_engine.query("What are the main Python data types?")
print("Custom prompt response:")
print(response)

## 9. Complete RAG Pipeline Class

Let's wrap everything into a reusable class:

In [None]:
from typing import List, Optional
from pathlib import Path

class SimpleRAGPipeline:
    """A reusable RAG pipeline with configurable components."""
    
    def __init__(
        self,
        chunk_size: int = 512,
        chunk_overlap: int = 50,
        similarity_top_k: int = 3,
        response_mode: str = "compact",
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.similarity_top_k = similarity_top_k
        self.response_mode = response_mode
        
        self.index = None
        self.query_engine = None
        
    def load_documents(self, input_dir: str) -> List[Document]:
        """Load documents from a directory."""
        reader = SimpleDirectoryReader(input_dir=input_dir)
        documents = reader.load_data()
        print(f"Loaded {len(documents)} documents")
        return documents
    
    def build_index(self, documents: List[Document]) -> VectorStoreIndex:
        """Build index from documents with custom chunking."""
        # Create node parser
        splitter = SentenceSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
        )
        
        # Parse into nodes
        nodes = splitter.get_nodes_from_documents(documents)
        print(f"Created {len(nodes)} nodes")
        
        # Build index
        self.index = VectorStoreIndex(nodes=nodes, show_progress=True)
        
        # Create query engine
        self.query_engine = self.index.as_query_engine(
            similarity_top_k=self.similarity_top_k,
            response_mode=self.response_mode,
        )
        
        print("✓ Index and query engine ready!")
        return self.index
    
    def query(self, question: str, verbose: bool = False) -> str:
        """Query the RAG system."""
        if not self.query_engine:
            raise ValueError("Build index first using build_index()")
        
        response = self.query_engine.query(question)
        
        if verbose:
            print(f"\nRetrieved {len(response.source_nodes)} source nodes:")
            for i, node in enumerate(response.source_nodes):
                print(f"  {i+1}. Score: {node.score:.4f} - {node.text[:100]}...")
        
        return str(response)
    
    def save(self, path: str):
        """Save index to disk."""
        if self.index:
            self.index.storage_context.persist(persist_dir=path)
            print(f"✓ Index saved to {path}")
    
    def load(self, path: str):
        """Load index from disk."""
        from llama_index.core import StorageContext, load_index_from_storage
        
        storage_context = StorageContext.from_defaults(persist_dir=path)
        self.index = load_index_from_storage(storage_context)
        self.query_engine = self.index.as_query_engine(
            similarity_top_k=self.similarity_top_k,
            response_mode=self.response_mode,
        )
        print(f"✓ Index loaded from {path}")

In [None]:
# Use the pipeline
rag = SimpleRAGPipeline(
    chunk_size=512,
    chunk_overlap=50,
    similarity_top_k=3,
    response_mode="compact",
)

# Build
docs = rag.load_documents("../data/sample_docs")
rag.build_index(docs)

# Query
answer = rag.query("What is deep learning and how does it work?", verbose=True)
print(f"\n\nAnswer:\n{answer}")

In [None]:
# Save for later use
rag.save("./storage/rag_pipeline_index")

## 10. Summary

You've learned how to build a complete RAG pipeline with control over:

### Key Takeaways

| Component | What You Learned |
|-----------|------------------|
| **Chunking** | Different strategies, size/overlap tuning |
| **Metadata** | Extractors for enriching chunks |
| **Retrieval** | Top-K selection, similarity scoring |
| **Synthesis** | Response modes for different use cases |
| **Prompts** | Custom templates for better answers |

### Best Practices

1. **Start simple**: Use defaults, then optimize based on results
2. **Test retrieval first**: Check what chunks are retrieved before blaming the LLM
3. **Monitor chunk quality**: Examine actual chunks during development
4. **Iterate on prompts**: Custom prompts often improve results significantly

### Next Steps

In the next notebook (`03_querying_basics.ipynb`), we'll explore:
- Streaming responses
- Async queries
- Query transformations
- Evaluation metrics

---

## Exercises

1. **Chunk size experiment**: Try chunk sizes of 256, 512, 1024. Compare retrieval quality.

2. **Overlap impact**: Test 0%, 10%, 20% overlap. When does it matter?

3. **Response modes**: Run the same query with all response modes. Compare quality vs speed.

4. **Custom prompts**: Create a prompt that makes the assistant respond in a specific style.

In [None]:
# Exercise space
# Try different configurations here!