# Pattern 6: RAG with LlamaIndex (Composable App Tutorial)

## Learning Objectives
By completing this tutorial, you will:
- Understand semantic RAG architecture and when to use it
- Learn LlamaIndex vector store setup with embeddings
- Implement semantic retrieval with `similarity_top_k` configuration
- Add citation tracking to generated content

## Prerequisites
- **Python**: Intermediate proficiency with async/await
- **LLM basics**: Understanding of embeddings and vector similarity
- **Setup**: Either Gemini API key OR OpenAI API key configured in `.env`
- **Vector index**: Pre-built index in `composable_app/data/`

## Estimated Time
25-30 minutes (reading + execution)

## Cost Estimate
⚠️ **API costs**: 
- OpenAI: ~$0.001-0.01 (text-embedding-3-small)
- Gemini: ~$0.02-0.05 (text-embedding-004)

## Dual-Mode Support
This notebook automatically detects which embedding model was used to create the vector index and uses the matching provider. The provided index was created with **Gemini embeddings**.

> **Book Reference**: This pattern is detailed in *Generative AI Design Patterns*
> (Lakshmanan & Hapke, 2025), Chapter 6: "Retrieval-Augmented Generation", pages 142-165.

---

## What is RAG?

**Retrieval-Augmented Generation (RAG)** combines semantic search with LLM generation:

1. **Retrieve**: Find relevant documents using vector similarity
2. **Augment**: Inject retrieved content into LLM prompt
3. **Generate**: LLM produces answer grounded in retrieved context

### When to Use RAG
- ✅ Domain-specific knowledge not in LLM training data
- ✅ Factual accuracy is critical (hallucination prevention)
- ✅ Citations/sources must be provided
- ✅ Knowledge base changes frequently
- ❌ General knowledge questions (LLM alone is sufficient)
- ❌ Real-time data (requires different retrieval strategy)

### Composable App Use Case
The **GenAIWriter** uses RAG to answer questions about the book "Generative AI Design Patterns":
- **Vector index** built from book PDF
- **User query**: "What is prompt caching?"
- **Retrieved**: 3 most similar text chunks from book
- **Generated**: 2-paragraph answer with page citations

**Code Location**: [`agents/generic_writer_agent.py:142-165`](../../agents/generic_writer_agent.py#L142-L165)

---

## Setup Cell

**Tasks 2.1.1**: Setup cell with imports, API key check, cost warning

In [None]:
# Add project root to path for imports
import sys
import os
from pathlib import Path

# Detect if running from notebook or from nbconvert
try:
    current_dir = Path(__file__).parent.resolve()
except NameError:
    current_dir = Path.cwd()

# Find repo root (contains .env)
repo_root = current_dir
while not (repo_root / '.env').exists() and repo_root != repo_root.parent:
    repo_root = repo_root.parent

# Add composable_app to path
composable_app_path = repo_root / 'composable_app'
sys.path.insert(0, str(composable_app_path))

# Load environment variables
from dotenv import load_dotenv
env_path = repo_root / '.env'
load_dotenv(env_path)

# Check which API keys are available
has_gemini = bool(os.getenv('GEMINI_API_KEY'))
has_openai = bool(os.getenv('OPENAI_API_KEY'))

# RAG dependencies
from llama_index.core import StorageContext, Settings, load_index_from_storage
from dataclasses import replace

print(f"✅ Environment loaded from: {env_path}")
print(f"   Gemini API key: {'✅ Found' if has_gemini else '❌ Not found'}")
print(f"   OpenAI API key: {'✅ Found' if has_openai else '❌ Not found'}")

# Determine which provider to use
# The provided vector index uses Gemini embeddings (768 dimensions)
# but we'll use OpenAI if Gemini is not available (note: requires recreating index)
if has_gemini:
    USE_GEMINI = True
    print("\n🔹 Using: Google Gemini embeddings (text-embedding-004, 768 dimensions)")
    print("   Matches the provided vector index")
elif has_openai:
    USE_GEMINI = False
    print("\n🔹 Using: OpenAI embeddings (text-embedding-3-small, 1536 dimensions)")
    print("   ⚠️ WARNING: The provided index uses Gemini (768-dim).")
    print("   You may see dimension mismatch errors during retrieval.")
    print("   To fix: Recreate index with python -m composable_app.data.create_index")
else:
    raise EnvironmentError(
        "❌ Neither GEMINI_API_KEY nor OPENAI_API_KEY found.\n"
        "   Add one to .env file:\n"
        "   - Gemini: https://makersuite.google.com/app/apikey\n"
        "   - OpenAI: https://platform.openai.com/api-keys"
    )

print("✅ Setup complete")
print("⚠️ This notebook will make API calls for embeddings")

---

## Vector Store Architecture

**Task 2.1.2**: Conceptual section - What is RAG, when to use, composable app use case

### How RAG Works in Composable App

```mermaid
graph LR
    A[Book PDF] -->|OpenParse| B[Text Chunks]
    B -->|Embed API| C[Vector Embeddings]
    C --> D[Vector Store<br/>JSON Files]
    
    E[User Query] -->|Embed API| F[Query Embedding]
    F -->|Cosine Similarity| D
    D -->|Top-3 Most Similar| G[Retrieved Chunks]
    G -->|Augment Prompt| H[LLM Generation]
    H --> I[Article + Citations]
    
    style D fill:#e1f5ff
    style G fill:#fff3cd
    style I fill:#d4edda
```

### Key Components
- **Embeddings**: Supports both Google Gemini and OpenAI
  - Gemini: `text-embedding-004` (768 dimensions) - Default for this tutorial
  - OpenAI: `text-embedding-3-small` (1536 dimensions)
- **Storage**: JSON files in `composable_app/data/`
  - `default__vector_store.json` - Vector embeddings
  - `docstore.json` - Document metadata
  - `index_store.json` - Index structure
- **Index**: Similarity search with cosine distance
- **Retrieval**: Top-k most similar chunks (default k=3)

### Embedding Model Comparison
| Provider | Model | Dimensions | Cost (per 1M tokens) |
|----------|-------|------------|---------------------|
| **Gemini** | text-embedding-004 | 768 | Free (quota applies) |
| OpenAI | text-embedding-3-small | 1536 | $0.02 |
| OpenAI | text-embedding-3-large | 3072 | $0.13 |

> **Note**: This tutorial uses the provided Gemini-based vector index. To use OpenAI embeddings, set `USE_GEMINI = False` in Cell 3 and recreate the index.

---

## Load Vector Index

**Task 2.1.3**: Code section - Load vector index, configure embeddings

In [None]:
# Configure embedding model based on USE_GEMINI flag
if USE_GEMINI:
    from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
    Settings.embed_model = GoogleGenAIEmbedding(
        model_name="text-embedding-004",
        api_key=os.environ["GEMINI_API_KEY"]
    )
    print("✅ Embedding model: Google Gemini text-embedding-004 (768 dimensions)")
else:
    from llama_index.embeddings.openai import OpenAIEmbedding
    Settings.embed_model = OpenAIEmbedding(
        model="text-embedding-3-small",
        api_key=os.environ["OPENAI_API_KEY"]
    )
    print("✅ Embedding model: OpenAI text-embedding-3-small (1536 dimensions)")

# Load pre-built vector index from data directory
try:
    storage_context = StorageContext.from_defaults(persist_dir="../../data")
    index = load_index_from_storage(storage_context)
    print(f"✅ Loaded vector index from ../../data/")
    print(f"   Index contains embeddings for book chunks")
except FileNotFoundError:
    print("❌ Vector index not found!")
    print("   Run: python -m composable_app.data.create_index")
    print("   Or see tutorials/README.md for setup instructions")
    raise

---

## Semantic Retrieval Demo

**Task 2.1.4**: Code section - Semantic retrieval demo with similarity_top_k=3

### How Retrieval Works

1. **Query Embedding**: Convert user query to 1536-dim vector using OpenAI API
2. **Similarity Search**: Compute cosine similarity with all stored embeddings
3. **Top-k Selection**: Return k most similar chunks (default k=3)
4. **Metadata Extraction**: Include page numbers and bounding boxes

In [None]:
# Create retriever with top-k=3 (configurable)
retriever = index.as_retriever(similarity_top_k=3)

# Example query
query = "What is prompt caching?"
print(f"📝 Query: '{query}'")
print(f"🔍 Retrieving top-3 similar chunks...\n")

# Perform semantic retrieval
nodes = retriever.retrieve(query)

print(f"✅ Retrieved {len(nodes)} nodes:\n")
print("=" * 80)

for i, node in enumerate(nodes, 1):
    print(f"\n🔹 Node {i}:")
    print(f"   Similarity Score: {node.score:.4f} (higher = more relevant)")
    print(f"   Page Number: {node.metadata['bbox'][0]['page']}")
    print(f"   Text Preview: {node.text[:150]}...")
    print("-" * 80)

---

## GenAIWriter Implementation

**Task 2.1.5**: Code section - GenAIWriter implementation walkthrough with line references

### How GenAIWriter Uses RAG

The `GenAIWriter` class extends `ZeroshotWriter` and overrides `write_response()` to add RAG:

```python
# From agents/generic_writer_agent.py:151-162
async def write_response(self, topic: str, prompt: str) -> Article:
    # 1. Semantic RAG retrieval
    nodes = self.retriever.retrieve(topic)
    
    # 2. Augment prompt with retrieved context
    prompt += f"\n**INFORMATION YOU CAN USE**\n{nodes}"
    
    # 3. Generate with LLM (uses BEST_MODEL = "gemini-2.0-flash")
    result = await self.agent.run(prompt)
    article = result.output
    
    # 4. Add page citations
    pages = [str(node.metadata['bbox'][0]['page']) for node in nodes]
    article = replace(article, full_text=article.full_text + f"\nSee pages: {', '.join(pages)}")
    
    return article
```

**Key Design Decisions**:
- **Why top-k=3?** Balances context richness with token limits
- **Why append to prompt?** Simple, works with any LLM provider
- **Why `replace()` not mutation?** Article is a frozen dataclass (immutable)
- **Why page citations?** Enables fact-checking and trustworthiness

### Simulating GenAIWriter RAG Flow

Let's simulate the full RAG flow without calling the LLM (to save costs):

In [None]:
# Simulate the GenAIWriter RAG flow
def simulate_rag_augmentation(topic: str, base_prompt: str, nodes: list) -> str:
    """Show how GenAIWriter augments prompts with retrieved context."""
    
    # Step 1: Base prompt (from AbstractWriter)
    print("📋 BASE PROMPT:")
    print("-" * 80)
    print(base_prompt)
    print("-" * 80)
    
    # Step 2: Augment with retrieved context
    augmented_prompt = base_prompt + "\n\n**INFORMATION YOU CAN USE**\n"
    for i, node in enumerate(nodes, 1):
        augmented_prompt += f"\nChunk {i} (Page {node.metadata['bbox'][0]['page']}): {node.text}\n"
    
    print("\n✨ AUGMENTED PROMPT (what LLM sees):")
    print("=" * 80)
    print(augmented_prompt[:500] + "...")
    print("=" * 80)
    
    return augmented_prompt

# Demo
base_prompt = f"Write 2 paragraphs about: {query}"
augmented_prompt = simulate_rag_augmentation(query, base_prompt, nodes)

print(f"\n📊 Prompt Stats:")
print(f"   Base prompt: {len(base_prompt)} chars")
print(f"   Augmented prompt: {len(augmented_prompt)} chars")
print(f"   Context added: {len(augmented_prompt) - len(base_prompt)} chars")

---

## Citation Tracking

**Task 2.1.6**: Code section - Citation tracking (extract page numbers from metadata)

### Why Citations Matter
- **Trustworthiness**: Users can verify claims in source material
- **Hallucination detection**: Easier to spot when LLM invents facts
- **Educational value**: Students learn to check sources
- **Legal compliance**: Some domains require citation (academic, medical)

In [None]:
def extract_citations(nodes: list) -> dict:
    """Extract page numbers and metadata from retrieved nodes.
    
    This replicates the citation logic from GenAIWriter (line 160).
    """
    # Extract page numbers from node metadata
    pages = [str(node.metadata['bbox'][0]['page']) for node in nodes]
    
    # Create citation string
    citation_text = f"See pages: {', '.join(pages)}"
    
    return {
        "pages": pages,
        "citation_text": citation_text,
        "num_sources": len(set(pages))  # unique pages
    }

# Demo citation extraction
citations = extract_citations(nodes)

print("📚 Citation Metadata:")
print(f"   Pages referenced: {citations['pages']}")
print(f"   Unique sources: {citations['num_sources']}")
print(f"   Citation string: '{citations['citation_text']}'")

print("\n💡 How this appears in generated article:")
print("   [Article text here...]")
print(f"   {citations['citation_text']}")

### Advanced: Citation with Bounding Boxes

The metadata includes bounding box coordinates for precise source location:

In [None]:
# Inspect detailed metadata
print("🔍 Detailed Metadata Example (Node 1):")
print("-" * 80)

node = nodes[0]
print(f"Text: {node.text[:100]}...")
print(f"\nMetadata:")
print(f"  - Page: {node.metadata['bbox'][0]['page']}")
print(f"  - Bounding Box: {node.metadata['bbox'][0]}")
print(f"  - Score: {node.score:.4f}")

print("\n💡 Use Case: Could highlight exact text location in PDF viewer")

---

## Common Pitfalls

**Task 2.1.7**: Common Pitfalls section

### ❌ Error: "No module named 'llama_index.embeddings.openai'"
**Cause**: Missing llama-index-embeddings-openai package

**Solution**:
```bash
pip install llama-index-embeddings-openai
```

---

### ❌ Error: "OPENAI_API_KEY not found"
**Cause**: API key not configured

**Solution**:
```bash
# Add to .env in repo root OR composable_app/keys.env
echo "OPENAI_API_KEY=sk-...your_key_here" >> .env

# Get key from: https://platform.openai.com/api-keys
```

---

### ❌ Error: "FileNotFoundError: [Errno 2] No such file or directory: '../../data/default__vector_store.json'"
**Cause**: Vector index not created yet

**Solution**:
```bash
# Option 1: Create index from book PDF (requires PDF and API key)
python -m composable_app.data.create_index

# Option 2: Use sample index (if provided)
cp composable_app/data/sample_vector_store.json composable_app/data/default__vector_store.json
```

---

### ⚠️ Warning: Retrieval returns irrelevant nodes
**Cause**: Query too vague or embedding mismatch

**Solutions**:
1. **Rephrase query**: Be more specific ("What is prompt caching?" vs. "caching")
2. **Adjust top-k**: Try `similarity_top_k=5` or `10` for broader context
3. **Check embedding model**: Must match model used to create index
4. **Inspect scores**: Low scores (<0.3) indicate poor match
5. **Try larger model**: Upgrade to `text-embedding-3-large` for better quality

---

### ⚠️ Warning: High API costs
**Cause**: Embedding API calls for every query

**Cost Optimization**:
```python
# Use text-embedding-3-small (already the cheapest at $0.02/1M tokens)

# Reduce top-k to minimize context size
retriever = index.as_retriever(similarity_top_k=1)  # Instead of 3

# Cache common queries (future enhancement)
# from functools import lru_cache
# @lru_cache(maxsize=100)
# def cached_retrieve(query: str):
#     return retriever.retrieve(query)
```

**Cost Breakdown**:
- Query embedding: ~10 tokens = $0.0000002 per query
- 100 queries/day = $0.006/day = $1.80/year (negligible)

---

### 💡 Tip: Adjust similarity_top_k based on context window
**Guidelines**:
- **Top-k = 1**: When only most relevant chunk needed (narrow query)
- **Top-k = 3**: Good default for most queries (balances context/cost)
- **Top-k = 5-10**: Broad research questions, large context windows
- **Top-k = 20+**: Rarely needed, risks exceeding token limits

**Experiment**:
```python
# Try different top-k values
for k in [1, 3, 5, 10]:
    retriever = index.as_retriever(similarity_top_k=k)
    nodes = retriever.retrieve(query)
    print(f"top-k={k}: Retrieved {len(nodes)} nodes, min score={nodes[-1].score:.4f}")
```

---

## Self-Assessment

**Task 2.1.8**: Self-assessment questions with answers

### Question 1: Concept Check
**What's the difference between semantic search and keyword search?**

<details>
<summary>Click to reveal answer</summary>

**Answer**: 
- **Semantic search**: Uses vector embeddings and cosine similarity to find **meaning-based** matches. Can find "automobile" when searching for "car".
- **Keyword search**: Uses exact or partial text matching (BM25, regex). Only finds literal text matches.

**Example**: Query "quick transportation" would:
- Semantic search: Find "fast car", "rapid vehicle", "speedy automobile" (similar meaning)
- Keyword search: Find "quick transportation" only (exact match)
</details>

---

### Question 2: Implementation
**Why does GenAIWriter use `replace(article, ...)` instead of `article.full_text += ...` to add citations?**

<details>
<summary>Click to reveal answer</summary>

**Answer**: Article is a **frozen dataclass** (immutable), so direct attribute assignment is not allowed.

```python
# From agents/article.py
@dataclass(frozen=True)  # Makes it immutable
class Article:
    full_text: str
    title: str
    # ...
```

**Why frozen?**
1. **Thread safety**: Multiple agents can reference same Article without race conditions
2. **Cacheability**: Immutable objects can be safely cached
3. **Debugging**: Easier to track changes (new instance = new version)

**Solution**: Use `dataclasses.replace()` to create new instance with updated field:
```python
from dataclasses import replace
article = replace(article, full_text=article.full_text + "\nCitations...")
```
</details>

---

### Question 3: Design Trade-offs
**When would you NOT use RAG? Give 2 examples.**

<details>
<summary>Click to reveal answer</summary>

**Answer**: RAG is not suitable when:

1. **General knowledge already in LLM**: "What is photosynthesis?" - LLM has this from training, RAG adds latency/cost with no benefit.

2. **Real-time data not in vector index**: "What's the current stock price of AAPL?" - Vector index is static, RAG can't retrieve live data. Need API integration instead.

3. **Cost constraints**: Embedding API calls + increased prompt size. For high-volume applications, consider caching or fine-tuning LLM instead.

4. **Very broad queries**: "Tell me about everything" - RAG works best with specific queries. Broad queries retrieve less relevant chunks.

**When RAG shines**: Domain-specific docs, compliance requirements, changing knowledge bases, citation needs.
</details>

---

### Question 4: Advanced
**How would you improve retrieval quality if users report irrelevant results?**

<details>
<summary>Click to reveal answer</summary>

**Strategies**:

1. **Query rewriting**: Transform user query before retrieval
   ```python
   # "What's the deal with caching?" → "What is prompt caching?"
   rewritten_query = await llm.rewrite_query(user_query)
   nodes = retriever.retrieve(rewritten_query)
   ```

2. **Hybrid search**: Combine semantic + keyword
   ```python
   semantic_nodes = semantic_retriever.retrieve(query)  # Top-5
   keyword_nodes = bm25_retriever.retrieve(query)       # Top-5
   nodes = rerank(semantic_nodes + keyword_nodes)[:3]   # Best 3
   ```

3. **Reranking**: Use second model to score and reorder results
   ```python
   candidates = retriever.retrieve(query, similarity_top_k=10)
   reranked = reranker.rerank(query, candidates)[:3]  # More accurate top-3
   ```

4. **Better chunking**: Experiment with chunk size (current: default)
   - Smaller chunks: More precise but may lose context
   - Larger chunks: More context but less precise

5. **Metadata filtering**: Add filters (e.g., date, chapter, author)
   ```python
   nodes = retriever.retrieve(query, filters={"chapter": "6"})
   ```
</details>

---

## Book References

**Task 2.1.9**: Book reference - Chapter 6, pages 142-165

### Further Reading

📖 **Generative AI Design Patterns** (Lakshmanan & Hapke, O'Reilly 2025)
- **Chapter 6**: "Retrieval-Augmented Generation" (pages 142-165)
  - RAG architecture patterns
  - Semantic vs. keyword retrieval trade-offs
  - Chunking strategies
  - Citation tracking
  - Production deployment considerations

**Related Chapters**:
- **Chapter 11**: "Trustworthy Generation" - Citation and hallucination detection
- **Chapter 15**: "Hybrid Search" - Combining semantic and keyword retrieval
- **Chapter 22**: "Reranking" - Improving retrieval accuracy

### External Resources
- [LlamaIndex Documentation](https://docs.llamaindex.ai) - Complete RAG framework guide
- [Pydantic AI Multi-Agent Guide](https://ai.pydantic.dev/multi-agent-applications/)
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [OpenAI Embeddings API Reference](https://platform.openai.com/docs/api-reference/embeddings)

---

## Next Steps

### Continue Learning
1. **[Horizontal Services](../concepts/horizontal_services.md)** - Learn about memory and guardrails
2. **[Advanced Patterns](advanced_patterns.ipynb)** - Optimize retrieval performance
3. **[Multi-Agent Pattern](multi_agent_pattern.ipynb)** - Combine RAG with multi-agent review

### Hands-On Practice
1. **Modify top-k**: Try `similarity_top_k=1, 5, 10` and compare results
2. **Different queries**: Test with broad vs. specific queries
3. **Inspect scores**: Look for patterns in similarity scores
4. **Add your data**: Create index from your own documents

### Advanced Exercises
1. **Implement query rewriting**: Use LLM to improve user queries before retrieval
2. **Add reranking**: Use a second model to reorder retrieved chunks
3. **Hybrid search**: Combine semantic search with BM25 keyword search
4. **Citation UI**: Build interface to highlight cited text in source PDF

---

**Congratulations!** You've learned Pattern 6: RAG with LlamaIndex. You can now build knowledge-grounded LLM applications with citation tracking.

**Tutorial Version**: 1.0  
**Last Updated**: 2025-11-04  
**Estimated Time**: 25-30 minutes  
**API Cost**: ~$0.02-0.05