# üîç RAG from Scratch: Building Production Search Systems

## What You'll Learn
- **RAG Architecture**: Retrieval + Generation pipeline
- **Vector Search**: Transform text into embeddings and find semantic matches
- **System Design**: Scale from 7 docs to 100M docs
- **Trade-offs**: RAG vs Fine-Tuning, Exact vs Approximate search

## Why RAG?

**The Problem with Static Models:**
- GPT-4 doesn't know about your company's internal docs
- Fine-tuning is expensive and makes the model "stale"

**The RAG Solution:**
1. **Store** knowledge in a searchable database (Vector DB)
2. **Retrieve** relevant context when a query comes in
3. **Generate** answer using LLM + retrieved context

**Key Advantage**: Update the database anytime without retraining the model.

---

In [None]:
# Setup
import sys
import numpy as np
import mlx.core as mx
import matplotlib.pyplot as plt
from mlx_nlp_utils import print_device_info, load_rag_knowledge_base

print_device_info()

## 1. Data Ingestion & Chunking Strategy

**Architectural Decision:** How do we split our text?
- **Fixed-size Chunking:** Simple, fast. Risk: Cutting a sentence in half.
- **Semantic Chunking:** Split by paragraph/topic. Better retrieval, harder to implement.
- **Recursive Chunking:** Split by paragraph, then sentence, then word.

For this demo, we will use **Paragraph-level Chunking**.

In [None]:
# Sample Knowledge Base (Simulating a Company Wiki)
# We load from our synthetic data generator
knowledge_base = load_rag_knowledge_base()

print(f"üìö Knowledge Base: {len(knowledge_base)} documents")
for i, doc in enumerate(knowledge_base[:3]):
    print(f"   Doc {i}: {doc[:80]}...")

## 2. Embeddings (The Vector Space)

We need to convert text into vectors. In production, you'd use a model like `bert-base` or `nomic-embed-text`.

**Trade-off:**
- **Small Models (384 dim):** Fast search, less semantic nuance.
- **Large Models (1024+ dim):** Better understanding, slower search, more RAM.

For this tutorial, we will simulate embeddings to keep dependencies low, but the math is identical.

In [None]:
# Simulating a 4-dimensional embedding space for visualization purposes
# In reality, this would be 768 or 1024 dimensions
embedding_dim = 4

# Mock embedding function (In production: use mlx-embeddings or sentence-transformers)
def get_embedding(text):
    # Deterministic random vector based on string hash for reproducibility
    seed = sum([ord(c) for c in text])
    mx.random.seed(seed)
    vector = mx.random.normal((embedding_dim,))
    # Normalize to unit length (Crucial for Cosine Similarity!)
    return vector / mx.linalg.norm(vector)

# Create Vector Database
vector_db = []
for doc in knowledge_base:
    vector_db.append(get_embedding(doc))

vector_db = mx.stack(vector_db)
print(f"üóÑÔ∏è Vector DB Shape: {vector_db.shape} (Docs, Dim)")

## 3. Retrieval: Cosine Similarity

We want to find the document vector $\mathbf{d}$ that is closest to our query vector $\mathbf{q}$.

The standard metric is **Cosine Similarity**:
$$ \text{similarity} = \cos(\theta) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \|\mathbf{d}\|} $$

Since we normalized our vectors (length = 1), this simplifies to just the **Dot Product**:
$$ \text{similarity} = \mathbf{q} \cdot \mathbf{d} $$

In [None]:
def retrieve(query, k=2):
    # 1. Embed query
    query_vec = get_embedding(query)
    
    # 2. Compute scores (Dot Product)
    # (Docs, Dim) @ (Dim,) -> (Docs,)
    scores = vector_db @ query_vec
    
    # 3. Get top-k indices
    # MLX doesn't have topk yet, so we use argsort
    indices = mx.argsort(scores)[::-1][:k]
    
    results = []
    for idx in indices.tolist():
        results.append((knowledge_base[idx], scores[idx].item()))
        
    return results

# Test Retrieval
query = "How does MLX handle memory?"
print(f"üîç Query: '{query}'\n")

hits = retrieve(query)
for i, (doc, score) in enumerate(hits):
    print(f"Hit {i+1} (Score: {score:.3f}):\n   \"{doc}\"\n")

## 4. Generation (The "G" in RAG)

Now we combine the retrieved context with the user query to prompt the LLM.

**Architectural Consideration: Context Window**
- If we retrieve too many documents, we exceed the context window (or pay huge API costs).
- **Lost in the Middle:** LLMs tend to ignore information in the middle of a long context. Put the most relevant chunks at the start or end.

In [None]:
def generate_rag_prompt(query, hits):
    context_str = "\n".join([f"- {doc}" for doc, score in hits])
    
    prompt = f"""<|user|>
Answer the question based ONLY on the following context:

{context_str}

Question: {query}
<|assistant|>
"""
    return prompt

rag_prompt = generate_rag_prompt(query, hits)
print("üìù Final Prompt for LLM:")
print("="*40)
print(rag_prompt)
print("="*40)

In [None]:
# Let's test with actual use cases!
print("\nüß™ TESTING RAG SYSTEM")
print("="*60)

queries = [
    "What is MLX?",
    "Why are transformers better than LSTMs?",
    "How do I fine-tune without using too much memory?",
    "What is the difference between fine-tuning and RAG?"
]

for query in queries:
    print(f"\n‚ùì Query: {query}")
    print("-"*60)
    
    hits = retrieve(query, k=2)
    
    for i, (doc, score) in enumerate(hits, 1):
        print(f"{i}. [Score: {score:.3f}] {doc[:80]}...")
    
    print()

print("="*60)

## 5. System Design: Scaling to 100 Million Docs

In an interview, you will be asked: *"This works for 7 sentences. How does it work for 100M?"*

### The Problem: Exact Search is $O(N)$
Calculating cosine similarity against 100M vectors takes too long.

### The Solution: ANN (Approximate Nearest Neighbors)
We trade accuracy for speed using algorithms like **HNSW (Hierarchical Navigable Small World)** or **IVF (Inverted File Index)**.

1.  **HNSW**: Builds a graph where nodes are vectors. Search navigates the graph greedily. $O(\log N)$.
2.  **Quantization**: Compress 32-bit floats to 8-bit integers (or binary) to fit index in RAM.

### Hybrid Search
Vector search is bad at exact keyword matching (e.g., searching for a specific SKU "XJ-900").
**Best Practice:** Combine Vector Search (Semantic) + BM25 (Keyword) using **Reciprocal Rank Fusion (RRF)**.

## 8. Production Checklist

Before deploying RAG to production, ensure:

### ‚úÖ Data Quality
- [ ] Chunk size optimized (128-512 tokens)
- [ ] Metadata attached (source, timestamp)
- [ ] Duplicates removed

### ‚úÖ Search Quality
- [ ] Threshold for "no answer" (e.g., score < 0.3)
- [ ] Hybrid search implemented (Vector + BM25)
- [ ] Re-ranking layer added

### ‚úÖ Performance
- [ ] HNSW index for > 100K docs
- [ ] Quantization enabled (PQ or Scalar)
- [ ] Caching for frequent queries

### ‚úÖ Monitoring
- [ ] Log retrieval scores
- [ ] Track "no answer" rate
- [ ] A/B test different chunking strategies

In [None]:
# Full RAG Pipeline Simulation
def rag_pipeline(user_query, top_k=2):
    """
    Complete RAG pipeline:
    1. User asks question
    2. Retrieve relevant docs
    3. Build prompt with context
    4. Generate answer with LLM
    """
    print(f"\nüîç RAG Pipeline for: '{user_query}'")
    print("-"*60)
    
    # Step 1: Retrieve
    print("üìö RETRIEVAL PHASE:")
    hits = retrieve(user_query, k=top_k)
    for i, (doc, score) in enumerate(hits, 1):
        print(f"   {i}. [{score:.2f}] {doc[:60]}...")
    
    # Step 2: Build prompt
    print("\nüìù GENERATION PHASE:")
    context = "\n".join([f"- {doc}" for doc, _ in hits])
    
    prompt = f"""<|system|>
You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
<|user|>
Context:
{context}

Question: {user_query}
<|assistant|>"""
    
    print("   Prompt constructed with retrieved context")
    print(f"   Prompt length: {len(prompt)} characters")
    
    # Step 3: Generate (simulated)
    print("\nü§ñ LLM RESPONSE:")
    print("   [In production, this calls: generate(model, tokenizer, prompt)]")
    print("   Example output: 'Based on the context, MLX is an array framework...'")
    
    return prompt

# Test the full pipeline
query = "What makes MLX different from CUDA?"
final_prompt = rag_pipeline(query)

print("\n" + "="*60)
print("‚úÖ RAG Pipeline Complete!")

## 7. Integration with LLM (Full RAG Pipeline)

Let's see how this would connect to an actual LLM in production.

In [None]:
# Edge Case 1: Query doesn't match any documents
print("üö® EDGE CASE: Out-of-Domain Query")
print("="*60)

bad_query = "How do I cook pasta?"
print(f"Query: {bad_query}\n")

hits = retrieve(bad_query, k=2)
for i, (doc, score) in enumerate(hits, 1):
    print(f"{i}. [Score: {score:.3f}] {doc}")

print("\nüí° Solution: Set a threshold (e.g., score < 0.3 ‚Üí 'I don't know')")
print("="*60)

# Edge Case 2: Multi-hop reasoning
print("\n\nüö® EDGE CASE: Multi-Hop Question")
print("="*60)

complex_query = "If I want to use MLX and need parallelization, should I use LSTMs?"
print(f"Query: {complex_query}\n")

hits = retrieve(complex_query, k=3)
for i, (doc, score) in enumerate(hits, 1):
    print(f"{i}. [Score: {score:.3f}] {doc}")

print("\nüí° Solution: Use LLM to decompose query into sub-questions")
print("   1. What is MLX? ‚Üí [retrieve]")
print("   2. Can LSTMs parallelize? ‚Üí [retrieve]")
print("   3. LLM synthesizes: 'Use Transformers, not LSTMs'")
print("="*60)

## 6. When RAG Fails: Edge Cases

Understanding failure modes is critical for production systems.

### Trade-off Table: Choosing Your Search Strategy

| Scenario | Best Approach | Why |
|----------|--------------|-----|
| 1M documents, semantic search | HNSW (Faiss/Milvus) | Fast approximate search |
| 100K documents, need 100% recall | Exact search (what we built) | Small enough for brute force |
| Need exact SKU/ID matches | Hybrid (Vector + BM25) | Keyword search for IDs |
| Real-time updates | Vector DB with incremental indexing | No rebuild needed |
| Privacy-sensitive data | On-device MLX embeddings | No API calls |

### Code Snippet: Production HNSW (Conceptual)

```python
# Using Faiss (Facebook AI Similarity Search)
import faiss

# Build index
index = faiss.IndexHNSWFlat(embedding_dim, 32)  # 32 = connections per node
index.add(vector_db_np)  # numpy array

# Search
D, I = index.search(query_vec_np, k=5)  # D=distances, I=indices
```

## ‚ùì FAQ

**Q: Vector Search vs. Keyword Search?**
A:
*   **Vector (Semantic):** Finds "meaning". Query "dog" matches "puppy". Good for concepts.
*   **Keyword (Lexical):** Finds exact matches. Query "Error 503" matches "Error 503". Good for specific IDs/names.
*   **Hybrid:** The best systems use both (Reciprocal Rank Fusion).

**Q: How do I handle stale data?**
A: That is the main advantage of RAG over Fine-Tuning. You just update the Vector Database (add/delete/update vectors). The LLM doesn't need to change.

**Q: What is "Re-ranking"?**
A: Vector search is fast but approximate. A common pattern is to retrieve 50 documents using vectors, then use a slower, more accurate "Cross-Encoder" model to re-rank the top 50 and pick the best 5 for the LLM.

## üí≠ Closing Thoughts

**The Future of Context Windows**
As LLMs support larger contexts (1M+ tokens), do we still need RAG?
*   **Yes:** For latency and cost. Reading 1M tokens takes time and money.
*   **Yes:** For privacy. You don't want to send your entire database to the model for every query.

**Architectural Evolution:**
RAG is evolving from "Static Retrieval" to "Agentic Retrieval"‚Äîwhere the LLM decides *what* to search for, *when* to search, and *how* to filter the results. You are building the foundation for these autonomous agents.