# Learning RAG (Retrieval-Augmented Generation) from Scratch

## What You'll Learn
This notebook teaches you how to build a **Knowledge Base Agent** step by step. By the end, you'll understand:

1. **Programming Basics**: How to work with Python data structures (lists, dictionaries)
2. **What is RAG**: How to combine retrieval (searching documents) with generation (LLM answers)
3. **Embeddings**: How text is converted to numbers for semantic search
4. **Building Agents**: How to create a system that can answer questions using external knowledge

## The Problem We're Solving
Imagine you have a collection of documents (company policies, technical docs, etc.) and you want to ask questions about them. A basic LLM can't answer because it doesn't have access to your documents. **RAG solves this** by:
1. Finding relevant documents (retrieval)
2. Giving those documents to the LLM as context (augmentation)
3. Having the LLM generate an answer based on that context (generation)

Let's build this step by step!

## Step 1: Creating a Small Knowledge Base

First, let's create a simple collection of documents. Think of this as a mini-database of information.

### What is a Knowledge Base?
A **knowledge base** is just a collection of documents/facts that contain information. In real applications, this might be:
- Company documentation
- Product manuals
- FAQs
- Research papers

### Data Structure Basics
We'll use Python **dictionaries** (key-value pairs) stored in a **list**. Each document has:
- `id`: A unique identifier
- `text`: The actual content
- `source`: Where it came from
- `date`: When it was written

This metadata helps us track where information comes from (provenance).

In [None]:
# Create our knowledge base: a list of dictionaries
# Each dictionary represents one document
knowledge_base = [
    {
        'id': 'doc1',
        'text': 'LangGraph provides a node-based workflow for composing LLM chains. It supports checkpoints and reducers for reliability.',
        'source': 'intro.md',
        'date': '2024-10-01'
    },
    {
        'id': 'doc2',
        'text': 'Retrieval-Augmented Generation (RAG) combines a retriever and a generator to ground output in real documents.',
        'source': 'rag.md',
        'date': '2024-11-02'
    },
    {
        'id': 'doc3',
        'text': 'Best practices: chunking documents, adding metadata, re-ranking top candidates, and logging retrieval traces.',
        'source': 'best_practices.md',
        'date': '2025-01-15'
    },
    {
        'id': 'doc4',
        'text': 'AI agents use tools to interact with external systems. Tools can be APIs, databases, or custom functions.',
        'source': 'agents.md',
        'date': '2024-12-10'
    },
    {
        'id': 'doc5',
        'text': 'Embeddings convert text into vectors (arrays of numbers) that capture semantic meaning.',
        'source': 'embeddings.md',
        'date': '2024-11-20'
    }
]

# Let's see what we have
print("üìö Knowledge Base Contents:")
print("=" * 80)
for doc in knowledge_base:
    # [:70] means "show first 70 characters" to keep output clean
    preview = doc['text'][:70] + "..." if len(doc['text']) > 70 else doc['text']
    print(f"\nüîπ ID: {doc['id']}")
    print(f"   Source: {doc['source']}")
    print(f"   Date: {doc['date']}")
    print(f"   Content: {preview}")

## Step 2: Understanding Embeddings

### The Problem with Keyword Search
If someone asks "How do AI systems use external information?", traditional keyword search might miss document 4 because it doesn't contain those exact words. But semantically, it's very relevant!

### What Are Embeddings?
**Embeddings** are a way to convert text into numbers (vectors) that capture meaning. Similar concepts have similar numbers.

For example:
- "dog" and "puppy" would have similar embeddings
- "dog" and "computer" would have very different embeddings

### Two Approaches We'll Try:

1. **Simple Approach (TF-IDF)**: Counts words, fast but less sophisticated
2. **Advanced Approach (Sentence Transformers)**: Uses AI to understand meaning

Let's implement both so you can see the difference!

In [None]:
# Import the libraries we need
import numpy as np  # For numerical operations
from sklearn.feature_extraction.text import TfidfVectorizer  # For simple text search

# We'll try to use advanced embeddings, but have a fallback
use_advanced_embeddings = False

try:
    # This is the advanced approach using AI-powered embeddings
    from sentence_transformers import SentenceTransformer
    
    # Load a pre-trained model (downloads automatically first time)
    # 'all-MiniLM-L6-v2' is a small, fast model good for learning
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    use_advanced_embeddings = True
    print("‚úÖ Using advanced AI embeddings (Sentence Transformers)")
    print("   This understands semantic meaning!")
    
except ImportError:
    # If sentence-transformers isn't installed, we'll use the simpler approach
    print("‚ö†Ô∏è  Sentence Transformers not installed")
    print("   Using simpler TF-IDF approach (keyword-based)")
    print("\n   To install: pip install sentence-transformers")

print("\n" + "="*80)

## Step 3: Converting Documents to Embeddings

Now we need to convert all our documents into embeddings (numbers) so we can search them.

### What's Happening Here:
1. **Extract text**: Get just the text content from each document
2. **Convert to embeddings**: Turn each text into an array of numbers
3. **Normalize**: Scale the numbers so comparisons are fair (like converting inches and feet to meters)

### Why Normalize?
Without normalization, longer documents would have bigger numbers just because they're longer, not because they're more relevant. Normalizing makes comparisons fair.

In [None]:
# Step 1: Extract all the text from our documents
# This creates a list of just the text strings
document_texts = [doc['text'] for doc in knowledge_base]

print(f"üìÑ Processing {len(document_texts)} documents...")

if use_advanced_embeddings:
    # ADVANCED METHOD: Use AI to understand meaning
    
    # Convert each text to a vector (array of numbers)
    # convert_to_numpy=True makes it easier to do math operations
    embedding_vectors = embedding_model.encode(document_texts, convert_to_numpy=True)
    
    print(f"‚úÖ Created embeddings with shape: {embedding_vectors.shape}")
    print(f"   - {embedding_vectors.shape[0]} documents")
    print(f"   - {embedding_vectors.shape[1]} dimensions (numbers) per document")
    
    # Normalize the vectors (make them unit length)
    # This is like adjusting volumes to the same level before comparing
    norms = np.sqrt((embedding_vectors**2).sum(axis=1, keepdims=True))
    embedding_vectors = embedding_vectors / (norms + 1e-9)  # +1e-9 prevents division by zero
    
    print("‚úÖ Vectors normalized for fair comparison")
    
else:
    # SIMPLE METHOD: Count words and their importance
    
    # TF-IDF = Term Frequency-Inverse Document Frequency
    # It's fancy counting: common words get lower scores, rare words get higher scores
    vectorizer = TfidfVectorizer(
        stop_words='english',  # Ignore common words like 'the', 'a', 'is'
        max_features=100  # Keep only top 100 most important words
    )
    
    # Fit and transform converts text to numbers
    tfidf_matrix = vectorizer.fit_transform(document_texts)
    
    print(f"‚úÖ Created TF-IDF matrix with shape: {tfidf_matrix.shape}")
    print(f"   - {tfidf_matrix.shape[0]} documents")
    print(f"   - {tfidf_matrix.shape[1]} unique important words")
    
print("\n‚ú® Knowledge base is ready for searching!")

## Step 4: Building the Retrieval Function

This is the heart of RAG! We need a function that:
1. Takes a user's question
2. Finds the most relevant documents
3. Returns them with **provenance** (source information)

### What is Provenance?
**Provenance** means tracking where information came from. This is crucial because:
- Users can verify the information
- You can trace incorrect answers back to their source
- It builds trust in your AI system

### How Similarity Works:
We'll use the **dot product** to compare vectors. Think of it like measuring the angle between two arrows:
- Similar meaning = small angle = high score
- Different meaning = large angle = low score

In [None]:
def search_knowledge_base(user_question, top_k=3):
    """
    Search the knowledge base for documents relevant to the user's question.
    
    Parameters:
    - user_question (str): The question to search for
    - top_k (int): How many top results to return (default: 3)
    
    Returns:
    - list: Top matching documents with scores and metadata
    """
    
    if use_advanced_embeddings:
        # === ADVANCED METHOD ===
        
        # Step 1: Convert the question to an embedding (same as we did for documents)
        question_vector = embedding_model.encode([user_question], convert_to_numpy=True)[0]
        
        # Step 2: Normalize the question vector
        question_norm = np.linalg.norm(question_vector)
        question_vector = question_vector / (question_norm + 1e-9)
        
        # Step 3: Calculate similarity scores
        # The @ operator does matrix multiplication (dot product)
        # Higher score = more similar = more relevant
        similarity_scores = embedding_vectors @ question_vector
        
    else:
        # === SIMPLE METHOD ===
        
        # Step 1: Convert question using same vectorizer as documents
        question_vector = vectorizer.transform([user_question])
        
        # Step 2: Calculate similarity (how many matching words, weighted by importance)
        similarity_scores = (tfidf_matrix @ question_vector.T).toarray().ravel()
    
    # Step 4: Find top K most similar documents
    # enumerate() gives us (index, score) pairs
    # sorted() orders by score (highest first because of the minus sign)
    # [:top_k] takes only the first K results
    ranked_results = sorted(
        enumerate(similarity_scores), 
        key=lambda x: -x[1]  # Sort by score, descending
    )[:top_k]
    
    # Step 5: Format results with provenance
    results = []
    for doc_index, similarity_score in ranked_results:
        # Get the original document
        original_doc = knowledge_base[doc_index]
        
        # Create a result with all info
        result = {
            'id': original_doc['id'],
            'text': original_doc['text'],
            'source': original_doc['source'],
            'date': original_doc['date'],
            'relevance_score': float(similarity_score)  # Convert to regular Python float
        }
        results.append(result)
    
    return results

print("‚úÖ Search function ready!")

## Step 5: Testing Our Retrieval System

Let's try searching! We'll ask a question and see which documents are retrieved.

### What to Look For:
1. **Relevance Score**: Higher = more relevant (typical range: 0.0 to 1.0)
2. **Source**: Which document did this come from?
3. **Date**: How recent is this information?

This is **RETRIEVAL** - the "R" in RAG. We're finding relevant documents, not generating answers yet.

In [None]:
# Let's test with a few different questions
test_questions = [
    "How does RAG work?",
    "What are tools in AI agents?",
    "Tell me about embeddings"
]

for question in test_questions:
    print("\n" + "="*80)
    print(f"üîç Question: {question}")
    print("="*80)
    
    # Search for relevant documents
    results = search_knowledge_base(question, top_k=2)  # Get top 2 results
    
    # Display results
    for i, result in enumerate(results, 1):
        print(f"\nüìÑ Result #{i}")
        print(f"   Score: {result['relevance_score']:.4f}")
        print(f"   Source: {result['source']} ({result['date']})")
        print(f"   Content: {result['text']}")
        print()

print("\n‚ú® Notice how different questions retrieve different relevant documents!")

## Step 6: Understanding RAG Architecture

Now let's understand the complete RAG pipeline. So far we've only done **Retrieval**.

### The Complete RAG Process:

```
User Question
     ‚Üì
1. RETRIEVAL: Search knowledge base for relevant docs
     ‚Üì
2. AUGMENTATION: Combine question + retrieved docs into a prompt
     ‚Üì
3. GENERATION: Send to LLM to generate an answer
     ‚Üì
Final Answer (with sources!)
```

### Why This is Powerful:
- **Without RAG**: LLM only knows what it was trained on (can be outdated or incomplete)
- **With RAG**: LLM has access to your specific, up-to-date information
- **Bonus**: You can cite sources, making answers verifiable!

### Next Steps:
To complete RAG, we would:
1. Take retrieved documents
2. Format them into a prompt like: "Based on these documents: [docs], answer: [question]"
3. Send to an LLM (like GPT, Claude, or Llama)
4. Get back a grounded, cited answer

For now, you've learned the hardest part - the retrieval engine!

## üéì What You've Learned

Congratulations! You've built a working retrieval system from scratch. Here's what you now understand:

### Programming Concepts:
- ‚úÖ **Lists and Dictionaries**: How to structure data
- ‚úÖ **Functions**: How to write reusable code with parameters
- ‚úÖ **List Comprehensions**: Clean way to transform data `[x for x in items]`
- ‚úÖ **Enumerate and Sorting**: How to rank and order results

### AI/ML Concepts:
- ‚úÖ **Embeddings**: Converting text to meaningful numbers
- ‚úÖ **Vector Similarity**: Comparing documents using math (dot product)
- ‚úÖ **Normalization**: Making comparisons fair
- ‚úÖ **Retrieval**: Finding relevant information efficiently

### RAG Pipeline:
- ‚úÖ **Knowledge Base**: How to structure document collections
- ‚úÖ **Provenance**: Tracking where information comes from
- ‚úÖ **Retrieval**: The "R" in RAG - finding relevant documents
- ‚úÖ **Comparison**: TF-IDF (simple) vs Transformers (advanced)

### Next Learning Steps:
1. **Add real LLM**: Connect to OpenAI/Anthropic to complete the "G" (generation)
2. **Scale up**: Use Chroma or Pinecone for larger document collections
3. **Add chunking**: Break large documents into smaller pieces
4. **Implement reranking**: Use a second model to improve top results
5. **Build an agent**: Let the LLM decide when to search vs when it knows the answer

## üí° Exercises to Try:

1. **Add more documents** to the knowledge base
2. **Try your own questions** and see what gets retrieved
3. **Experiment with `top_k`** - try returning 1, 3, or 5 results
4. **Add a confidence threshold** - only return results above a certain score
5. **Track which documents are retrieved most often** (analytics!)

In [None]:
# BONUS: Interactive Search Function
# Try running this cell and asking your own questions!

def interactive_search():
    """
    Let you ask questions interactively and see results.
    Type 'quit' to exit.
    """
    print("\n" + "="*80)
    print("ü§ñ Interactive Knowledge Base Search")
    print("="*80)
    print("Ask me anything about the documents in the knowledge base!")
    print("Type 'quit' to exit\n")
    
    while True:
        # Get user input
        question = input("Your question: ").strip()
        
        # Check if user wants to quit
        if question.lower() in ['quit', 'exit', 'q']:
            print("üëã Thanks for exploring RAG! Happy learning!")
            break
            
        # Skip empty questions
        if not question:
            continue
            
        # Search the knowledge base
        print("\nüîç Searching...")
        results = search_knowledge_base(question, top_k=3)
        
        # Display results nicely
        print(f"\nüìö Found {len(results)} relevant documents:\n")
        for i, result in enumerate(results, 1):
            print(f"‚îÅ‚îÅ‚îÅ Result #{i} (Score: {result['relevance_score']:.3f}) ‚îÅ‚îÅ‚îÅ")
            print(f"üìÑ {result['source']} ({result['date']})")
            print(f"üí¨ {result['text']}")
            print()

# Uncomment the line below to start interactive mode!
# interactive_search()