# RAG System - Implementation Walkthrough

**Learning Objectives:**
- Build a complete RAG pipeline from scratch
- Understand each component (chunking, embedding, retrieval, generation)
- Test the system with real queries
- Learn best practices for production RAG

**For Interviews:**  
This notebook shows you can:
- Implement RAG systems end-to-end
- Explain design decisions
- Debug and test ML systems
- Think about production considerations

In [None]:
import sys
sys.path.append('../src')

import json
from pathlib import Path

from document_processor import load_documents, process_documents
from vector_store import VectorStore
from rag_pipeline import RAGPipeline

print("‚úì Imports complete")

## Part 1: Document Processing

**Interview Key Point:** RAG starts with good document processing. Garbage in = garbage out.

In [None]:
# Load knowledge base
print("üìö Loading Knowledge Base...\n")
documents = load_documents('../data/knowledge_base.json')

print(f"Loaded {len(documents)} documents")
print(f"\nFirst document preview:")
print(f"  ID: {documents[0]['id']}")
print(f"  Title: {documents[0]['title']}")
print(f"  Category: {documents[0]['category']}")
print(f"  Content length: {len(documents[0]['content'].split())} words")

### Chunking Strategy

**Why chunking matters:**
- Embeddings work better on coherent chunks (not entire documents)
- Retrieval is more precise with smaller chunks
- LLM context windows are limited

**Common strategy:** ~400 words with ~50 word overlap

In [None]:
# Process documents into chunks
print("‚úÇÔ∏è Chunking Documents...\n")

CHUNK_SIZE = 400  # words per chunk
OVERLAP = 50      # overlap between chunks

chunks = process_documents(
    documents,
    chunk_size=CHUNK_SIZE,
    overlap=OVERLAP
)

print(f"\nüìä Chunking Results:")
print(f"  Total chunks: {len(chunks)}")
print(f"  Average chunks per document: {len(chunks)/len(documents):.1f}")
print(f"  Chunk size: {CHUNK_SIZE} words")
print(f"  Overlap: {OVERLAP} words ({OVERLAP/CHUNK_SIZE*100:.0f}%)")

In [None]:
# Examine a chunk
print("\nüîç Example Chunk:\n")
example_chunk = chunks[0]
print(f"ID: {example_chunk['id']}")
print(f"Source: {example_chunk['source_title']}")
print(f"Chunk {example_chunk['chunk_index'] + 1} of {example_chunk['total_chunks']}")
print(f"\nText:\n{example_chunk['text'][:300]}...")
print(f"\nWord count: {len(example_chunk['text'].split())}")

## Part 2: Vector Store Setup

**Interview Key Point:** Vector databases store embeddings for semantic search.

**How it works:**
1. Text ‚Üí Embedding model ‚Üí Vector (list of numbers)
2. Store vectors in database with metadata
3. Query ‚Üí Embedding ‚Üí Find nearest vectors (cosine similarity)
4. Return corresponding text chunks

In [None]:
# Initialize vector store
print("üóÑÔ∏è Initializing Vector Store...\n")

vector_store = VectorStore(
    collection_name="rag_demo_kb",
    persist_directory="../chroma_db"
)

print(f"Collection: {vector_store.collection_name}")
print(f"Current document count: {vector_store.count()}")

In [None]:
# Add chunks to vector store
# This will:
# 1. Generate embeddings for each chunk (using sentence-transformers)
# 2. Store embeddings + metadata in ChromaDB
# 3. Index for fast similarity search

print("\nüì• Adding chunks to vector store...")
print("(This may take a minute on first run to download the embedding model)\n")

vector_store.add_documents(chunks)

print(f"\n‚úì Vector store ready!")
print(f"  Total chunks indexed: {vector_store.count()}")

## Part 3: Semantic Search

**This is the "Retrieval" in RAG**

Let's test the semantic search capabilities.

In [None]:
# Test query
test_query = "What is Python programming?"

print(f"üîç Query: '{test_query}'\n")
print("Searching knowledge base...\n")

results = vector_store.query(test_query, top_k=3)

print(f"Retrieved {len(results['documents'])} chunks:\n")

for i, (doc, dist, meta) in enumerate(zip(
    results['documents'],
    results['distances'],
    results['metadatas']
), 1):
    similarity_score = 1 - dist  # Convert distance to similarity
    print(f"--- Result {i} ---")
    print(f"Similarity: {similarity_score:.3f}")
    print(f"Source: {meta['source_title']}")
    print(f"Category: {meta['category']}")
    print(f"Text: {doc[:200]}...\n")

### Understanding Similarity Scores

**Interview Tip:** Be ready to explain similarity metrics!

- **1.0 = Perfect match** (identical text)
- **0.8-0.9 = Very similar** (paraphrases, related concepts)
- **0.6-0.7 = Somewhat related** (shares topics)
- **< 0.5 = Weakly related** (might not be relevant)

ChromaDB uses **L2 distance** by default, which we convert to similarity (1 - distance).

In [None]:
# Try different queries to see semantic understanding
test_queries = [
    "programming languages",
    "artificial intelligence and machine learning",
    "web development frameworks"
]

print("üß™ Testing Semantic Search:\n")

for query in test_queries:
    results = vector_store.query(query, top_k=1)
    
    print(f"Query: '{query}'")
    print(f"Top match: {results['metadatas'][0]['source_title']}")
    print(f"Similarity: {1 - results['distances'][0]:.3f}")
    print()

## Part 4: Complete RAG Pipeline

Now let's use the full RAG pipeline that orchestrates everything.

In [None]:
# Initialize RAG pipeline
print("üöÄ Initializing RAG Pipeline...\n")

rag = RAGPipeline(
    collection_name="rag_demo_complete",
    chunk_size=400,
    overlap=50
)

# Add documents
rag.add_documents('../data/knowledge_base.json')

print("\n‚úì RAG Pipeline ready!")

### Query the RAG System

**The RAG workflow:**
1. **Retrieve:** Find relevant chunks from knowledge base
2. **Augment:** Assemble prompt with retrieved context
3. **Generate:** LLM generates answer based on context

In [None]:
# Ask a question
question = "What is Python and what is it used for?"

print(f"‚ùì Question: {question}\n")
print("="*60)

# Query the RAG system
response = rag.query(question, top_k=3)

# Show the answer
print("\nüìù RAG RESPONSE:\n")
print(response['answer'])

In [None]:
# Examine the prompt that would be sent to an LLM
print("\nüîç PROMPT FOR LLM:\n")
print("="*60)
print(response['prompt'])
print("="*60)

print("\nüí° In production, this prompt would be sent to GPT-4, Claude, etc.")

In [None]:
# Analyze retrieved contexts
print("\nüìö RETRIEVED CONTEXTS:\n")

for i, (ctx, dist) in enumerate(zip(response['contexts'], response['distances']), 1):
    print(f"--- Context {i} ---")
    print(f"Similarity: {1-dist:.3f}")
    print(f"Text: {ctx[:150]}...")
    print()

## Part 5: Testing Multiple Queries

Let's test the RAG system with various question types.

In [None]:
# Test questions
questions = [
    "What is machine learning?",
    "How does web development work?",
    "Explain data structures",
    "What are the benefits of cloud computing?"
]

print("üß™ Testing RAG System with Multiple Queries\n")
print("="*60 + "\n")

for question in questions:
    print(f"‚ùì {question}\n")
    
    # Get response
    response = rag.query(question, top_k=2)
    
    # Show top retrieved chunk
    if response['contexts']:
        top_context = response['contexts'][0]
        top_similarity = 1 - response['distances'][0]
        
        print(f"üìä Top Match (similarity: {top_similarity:.3f}):")
        print(f"   {top_context[:200]}...\n")
    else:
        print("   No relevant context found\n")
    
    print("-" * 60 + "\n")

## Part 6: Understanding RAG Components

**For interviews, be ready to explain each component:**

In [None]:
print("üèóÔ∏è RAG SYSTEM ARCHITECTURE\n")
print("="*60)

components = [
    {
        'component': '1. Document Processor',
        'purpose': 'Clean, chunk, and prepare documents',
        'key_decision': f'Chunk size = {CHUNK_SIZE}, Overlap = {OVERLAP}',
        'why': 'Balance precision vs context'
    },
    {
        'component': '2. Embedding Model',
        'purpose': 'Convert text to semantic vectors',
        'key_decision': 'sentence-transformers/all-MiniLM-L6-v2',
        'why': 'Fast, good quality, 384 dimensions'
    },
    {
        'component': '3. Vector Store',
        'purpose': 'Store & search embeddings efficiently',
        'key_decision': 'ChromaDB (in-memory + persistent)',
        'why': 'Simple, no external dependencies'
    },
    {
        'component': '4. Retriever',
        'purpose': 'Find most similar chunks to query',
        'key_decision': 'top_k=3, cosine similarity',
        'why': '3 chunks ‚âà 1200 words of context'
    },
    {
        'component': '5. Prompt Template',
        'purpose': 'Format context + question for LLM',
        'key_decision': 'Clear instructions, cite sources',
        'why': 'Reduce hallucinations, improve quality'
    }
]

for comp in components:
    print(f"\n{comp['component']}")
    print(f"  Purpose: {comp['purpose']}")
    print(f"  Decision: {comp['key_decision']}")
    print(f"  Why: {comp['why']}")

print("\n" + "="*60)

## Part 7: Production Considerations

**Interview talking points for production RAG systems:**

In [None]:
print("üè≠ PRODUCTION RAG CHECKLIST\n")
print("="*60)

considerations = [
    {
        'category': 'Performance',
        'items': [
            'Cache embeddings to avoid recomputation',
            'Use batch processing for large documents',
            'Consider GPU for embedding generation',
            'Implement async/parallel processing'
        ]
    },
    {
        'category': 'Quality',
        'items': [
            'Evaluate retrieval precision/recall',
            'A/B test different chunk sizes',
            'Monitor answer quality metrics',
            'Implement user feedback loops'
        ]
    },
    {
        'category': 'Scalability',
        'items': [
            'Use production vector DB (Pinecone, Weaviate, etc.)',
            'Implement incremental indexing',
            'Plan for millions of documents',
            'Monitor memory and disk usage'
        ]
    },
    {
        'category': 'Cost',
        'items': [
            'LLM API costs (per token)',
            'Vector DB storage costs',
            'Embedding generation costs',
            'Trade-off: smaller top_k = cheaper but maybe worse quality'
        ]
    },
    {
        'category': 'Security',
        'items': [
            'Filter sensitive information before indexing',
            'Implement access controls on knowledge base',
            'Sanitize user queries',
            'Audit what information is being retrieved'
        ]
    }
]

for consideration in considerations:
    print(f"\n{consideration['category']}:")
    for item in consideration['items']:
        print(f"  ‚Ä¢ {item}")

print("\n" + "="*60)

## Part 8: Common RAG Challenges & Solutions

**Be ready to discuss these in interviews:**

In [None]:
print("‚ö†Ô∏è COMMON RAG CHALLENGES\n")
print("="*60)

challenges = [
    {
        'problem': 'Retrieval returns irrelevant chunks',
        'causes': [
            'Chunk size too small/large',
            'Poor document preprocessing',
            'Weak embedding model'
        ],
        'solutions': [
            'Tune chunk size based on evaluation',
            'Use better cleaning/normalization',
            'Try domain-specific embedding models',
            'Implement hybrid search (keyword + semantic)'
        ]
    },
    {
        'problem': 'LLM ignores retrieved context',
        'causes': [
            'Context not relevant enough',
            'Prompt poorly structured',
            'Too much context (information overload)'
        ],
        'solutions': [
            'Improve retrieval quality',
            'Use clearer prompt instructions',
            'Reduce top_k to most relevant chunks',
            'Add explicit grounding instructions'
        ]
    },
    {
        'problem': 'Hallucinations despite RAG',
        'causes': [
            'No relevant context found',
            'LLM uses pre-training knowledge instead',
            'Ambiguous or contradictory context'
        ],
        'solutions': [
            'Explicitly instruct: "Only use provided context"',
            'Return "I don\'t know" if similarity too low',
            'Implement confidence scoring',
            'Ask LLM to cite specific context passages'
        ]
    },
    {
        'problem': 'Slow query response time',
        'causes': [
            'Embedding generation is slow',
            'Vector search not optimized',
            'LLM API latency'
        ],
        'solutions': [
            'Cache query embeddings',
            'Use approximate nearest neighbor search',
            'Stream LLM responses',
            'Precompute for common queries'
        ]
    }
]

for i, challenge in enumerate(challenges, 1):
    print(f"\n{i}. {challenge['problem']}")
    print(f"\n   Causes:")
    for cause in challenge['causes']:
        print(f"   - {cause}")
    print(f"\n   Solutions:")
    for solution in challenge['solutions']:
        print(f"   ‚úì {solution}")
    print()

print("="*60)

## Summary: What You've Learned

‚úÖ **Technical Skills:**
- Built a complete RAG system from scratch
- Implemented document chunking strategies
- Used vector databases for semantic search
- Created production-ready prompts

‚úÖ **Interview Ready:**
- Can explain RAG architecture
- Understand trade-offs (chunk size, top_k, etc.)
- Know production considerations
- Can debug common RAG problems

‚úÖ **Next Steps:**
1. Complete evaluation notebook (03_evaluation_optimization.ipynb)
2. Practice explaining this system out loud
3. Try implementing RAG for a different domain
4. Read the interview_questions.md file

In [None]:
# Cleanup
print("\nüßπ Cleaning up...")
rag.vector_store.delete_collection()
vector_store.delete_collection()
print("‚úì Done!")

print("\n" + "="*60)
print("RAG IMPLEMENTATION COMPLETE! üéâ")
print("="*60)