# ü§ñ RAG: Retrieval-Augmented Generation


**RAG** is a technique that combines the power of Large Language Models (LLMs) with the ability to retrieve relevant information from external knowledge sources.

---

## üíº Real-World Applications

- **Customer Support**: Answer questions using product documentation
- **Internal Knowledge Bases**: Help employees find company information
- **Document Q&A**: Extract insights from reports, contracts, or research papers
- **Code Documentation**: Search through codebases and generate explanations

---


By the end of this notebook, you will be able to:

‚úÖ **Understand the three components of RAG** (Retrieval ‚Üí Augmentation ‚Üí Generation)

‚úÖ **Learn about embeddings** and how they represent meaning numerically

‚úÖ **Implement semantic search** to find relevant documents

‚úÖ **Build a complete RAG pipeline** from scratch

‚úÖ **Understand production considerations** (vector databases, chunking, etc.)

---

Let's get started! üöÄ

---

# 1. Theory: Why RAG Exists

## üö´ The Problem: LLM Limitations

Before we dive into RAG, let's understand WHY we need it.

### Knowledge Cutoff Dates
LLMs are trained on data up to a specific date. They don't know about:
- Recent events or news
- New products or companies launched after training
- Updated policies or procedures

### No Access to Private Data
LLMs can't access:
- Your company's internal documents
- Proprietary customer information
- Personal or confidential data
- Real-time database contents

### Hallucination Risks
When uncertain, LLMs may:
- Generate plausible-sounding but incorrect information
- Mix facts from different sources incorrectly
- Fill gaps with "reasonable" guesses



---

## ‚úÖ The Solution: RAG

RAG gives LLMs the ability to "look things up" before answering. Here's how:

### The Three Steps of RAG:

1. **üîç Retrieval**
   - Search your knowledge base for relevant documents
   - Find the information that best matches the user's question
   - Like finding the right page in a textbook

2. **üìù Augmentation**
   - Add the retrieved information to the prompt as "context"
   - Tell the LLM: "Here's the relevant information, use it to answer"
   - Like giving someone notes before asking them a question

3. **üí¨ Generation**
   - LLM generates an answer based on the provided context
   - Answer is grounded in real information, not guesses
   - Like a student answering from their notes instead of memory



---

# 2. Setup

Let's set up our environment and prepare to build our RAG system.

## üì¶ What We'll Install

- **openai**: Official OpenAI Python SDK for API access
- **pandas**: Data manipulation (we'll store documents in a DataFrame)
- **numpy**: Numerical operations (for vector similarity calculations)
- **matplotlib**: Optional visualization tools

## üîë API Configuration

You'll need an OpenAI API key. You have two options:

**Method 1 (Recommended)**: Use Colab Secrets
1. Click the üîë icon in the left sidebar
2. Click "Add new secret"
3. Name: `OPENAI_API_KEY`
4. Value: Your OpenAI API key
5. Enable notebook access

**Method 2 (Fallback)**: Manual input when prompted

In [None]:
# Install required packages
!pip install -q openai pandas numpy matplotlib

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

print("‚úÖ All dependencies installed!")

In [None]:
import os

# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("‚úÖ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("üí° To use Colab secrets: Go to üîë (left sidebar) ‚Üí Add new secret ‚Üí Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("‚ùå ERROR: No API key provided!")

print("‚úÖ Authentication configured!")

# Configure which OpenAI models to use
OPENAI_LLM_MODEL = "gpt-5-nano"  # For text generation
OPENAI_EMBEDDING_MODEL = "text-embedding-3-small"  # For embeddings

print(f"ü§ñ LLM Model: {OPENAI_LLM_MODEL}")
print(f"üî¢ Embedding Model: {OPENAI_EMBEDDING_MODEL}")

In [None]:
# Import required libraries
from openai import OpenAI
import pandas as pd
import numpy as np

# Initialize OpenAI client
client = OpenAI()

print("‚úÖ OpenAI client initialized!")
print("‚úÖ All libraries imported successfully!")

---

# 3. Step 1: Preparing Our Documents

## üìÑ What Are "Documents" in RAG?

In RAG, a "document" is any piece of text that contains information:
- Product descriptions
- Company profiles
- Support articles
- Research papers
- Code documentation
- Meeting notes

## üèóÔ∏è Document Structure Matters

Good documents are:
- **Self-contained**: Each document has complete information about one topic
- **Well-structured**: Clear, organized, with key information highlighted
- **Right-sized**: Not too long (loses focus) or too short (lacks context)
- **Consistent**: Follow the same format for similar types of information

## üìä Scale

- **In this tutorial**: 10 sample startup companies (learning purposes)
- **In production**: Could be thousands or millions of documents

## üéØ Our Task

We'll create a small knowledge base of 10 startup companies, each with:
- Company name
- Industry
- Location
- Description (what they do)
- Investors
- Founded year

Then we'll convert this structured data into natural language "documents" that are easy to search.

---





Let's create our knowledge base of startup companies!

In [None]:
# Create mock data: 10 startup companies
# This simulates a knowledge base you might have in a real application

companies_data = [
    {
        "name": "Pentera",
        "industry": "Cybersecurity",
        "location": "Tel Aviv, Israel",
        "description": "Pentera provides automated security validation platforms that help organizations continuously test their cybersecurity defenses. Their platform simulates real-world attacks to identify vulnerabilities before hackers can exploit them.",
        "investors": ["K1 Investment Management", "Insight Partners", "Blackstone"],
        "founded": 2015
    },
    {
        "name": "Wiz",
        "industry": "Cloud Security",
        "location": "New York, USA",
        "description": "Wiz is a cloud security platform that helps organizations identify and remove critical risks across their cloud infrastructure. They provide comprehensive visibility and threat detection for AWS, Azure, and Google Cloud.",
        "investors": ["Sequoia Capital", "Greenoaks", "Salesforce Ventures", "Cyberstarts"],
        "founded": 2020
    },
    {
        "name": "Ramp",
        "industry": "FinTech",
        "location": "New York, USA",
        "description": "Ramp is a corporate card and spend management platform that helps companies save time and money. Their platform automates expense tracking, provides real-time insights, and identifies cost-saving opportunities.",
        "investors": ["Founders Fund", "Stripe", "Goldman Sachs", "Thrive Capital"],
        "founded": 2019
    },
    {
        "name": "Notion",
        "industry": "Productivity Software",
        "location": "San Francisco, USA",
        "description": "Notion is an all-in-one workspace that combines notes, tasks, wikis, and databases. Teams use Notion to collaborate, organize knowledge, and manage projects in one unified platform.",
        "investors": ["Coatue", "Sequoia Capital", "Index Ventures"],
        "founded": 2016
    },
    {
        "name": "Anduril Industries",
        "industry": "Defense Technology",
        "location": "Costa Mesa, USA",
        "description": "Anduril Industries builds advanced defense technology products including autonomous systems, sensors, and AI-powered solutions. Their technology is used for border security, base security, and military applications.",
        "investors": ["Andreessen Horowitz", "Founders Fund", "8VC", "Valor Equity Partners"],
        "founded": 2017
    },
    {
        "name": "Databricks",
        "industry": "Data Analytics",
        "location": "San Francisco, USA",
        "description": "Databricks provides a unified analytics platform built on Apache Spark. Their lakehouse platform combines data warehousing and data lakes, enabling companies to build data, analytics, and AI solutions at scale.",
        "investors": ["Andreessen Horowitz", "NEA", "Coatue", "Tiger Global"],
        "founded": 2013
    },
    {
        "name": "Figma",
        "industry": "Design Software",
        "location": "San Francisco, USA",
        "description": "Figma is a collaborative interface design tool that runs in the browser. Designers use Figma to create, prototype, and collaborate on user interfaces for web and mobile applications in real-time.",
        "investors": ["Sequoia Capital", "Greylock Partners", "Kleiner Perkins", "Index Ventures"],
        "founded": 2012
    },
    {
        "name": "Plaid",
        "industry": "FinTech Infrastructure",
        "location": "San Francisco, USA",
        "description": "Plaid provides financial services APIs that enable applications to connect with users' bank accounts. Their platform powers thousands of fintech apps including Venmo, Robinhood, and Chime.",
        "investors": ["Andreessen Horowitz", "NEA", "Index Ventures", "Goldman Sachs"],
        "founded": 2013
    },
    {
        "name": "UiPath",
        "industry": "Robotic Process Automation",
        "location": "New York, USA",
        "description": "UiPath is a leading RPA platform that helps organizations automate repetitive business processes. Their software robots can handle tasks like data entry, document processing, and system integration.",
        "investors": ["Accel", "CapitalG", "Sequoia Capital", "Tiger Global"],
        "founded": 2005
    },
    {
        "name": "Snyk",
        "industry": "Developer Security",
        "location": "Boston, USA",
        "description": "Snyk is a developer security platform that helps teams find and fix vulnerabilities in code, dependencies, containers, and infrastructure as code. Their tools integrate directly into developer workflows.",
        "investors": ["Accel", "Coatue", "Tiger Global", "Boldstart Ventures"],
        "founded": 2015
    }
]

print(f"‚úÖ Created data for {len(companies_data)} startup companies")

In [None]:
# Convert to pandas DataFrame for easier handling
# DataFrames are great for structured data manipulation

df = pd.DataFrame(companies_data)

print(f"‚úÖ Created database of {len(df)} companies")
print("\nüìä First few companies:")

# Display the first few rows
df.head()

In [None]:
def create_document_text(company: dict) -> str:
    """
    Convert a company dictionary into a readable text document.
    This is what we'll create embeddings for and search through.

    Args:
        company: Dictionary containing company information

    Returns:
        A formatted text document describing the company
    """
    # Join the list of investors into a readable string
    investors_str = ", ".join(company["investors"])

    # Create a natural language document
    # Note: This structure makes it easy for semantic search to find relevant info
    text = f"""{company['name']} is a {company['industry']} company headquartered in {company['location']}. {company['description']} The company was founded in {company['founded']}. Key investors include: {investors_str}."""

    return text.strip()

# Apply the function to all companies to create searchable documents
df['document'] = df.apply(lambda row: create_document_text(row.to_dict()), axis=1)

print("‚úÖ Created searchable documents for all companies")
print("\nüìÑ Example document:")
print("="*70)
print(df['document'].iloc[0])
print("="*70)

---

# 4. Step 2: Creating Embeddings


**Embeddings** are numerical representations that capture the *meaning* of text. Think of them as coordinates in a "meaning space."








Let's generate embeddings for all our company documents!

In [None]:
def get_embedding(text: str) -> list:
    """
    Generate an embedding vector for the given text using OpenAI's API.

    Args:
        text: The text to embed

    Returns:
        A list of floats representing the embedding vector (1536 dimensions)
    """
    try:
        # Call OpenAI's embeddings API
        response = client.embeddings.create(
            input=text,
            model=OPENAI_EMBEDDING_MODEL
        )

        # Extract the embedding vector from the response
        return response.data[0].embedding

    except Exception as e:
        print(f"‚ùå Error generating embedding: {e}")
        return None

# Test the function with a sample
sample_text = "This is a test sentence about cybersecurity"
sample_embedding = get_embedding(sample_text)

if sample_embedding:
    print(f"‚úÖ Generated embedding with {len(sample_embedding)} dimensions")
    print(f"\nüìä First 10 values: {sample_embedding[:10]}")
    print(f"\nüí° Each document will become a vector like this!")

In [None]:
# Generate embeddings for all company documents
# This is the step that converts our text into searchable vectors

print("üîÑ Generating embeddings for all documents...")
print("   (This may take a few seconds)\n")

# Apply the get_embedding function to each document
df['embedding'] = df['document'].apply(get_embedding)

# Check for any failures
failed_count = df['embedding'].isnull().sum()

if failed_count > 0:
    print(f"‚ö†Ô∏è Warning: {failed_count} embeddings failed to generate")
else:
    print(f"‚úÖ Successfully generated embeddings for all {len(df)} documents!")

# Show some stats about the embeddings
print(f"\nüìä Embedding Statistics:")
print(f"   Dimensions: {len(df['embedding'].iloc[0])}")
print(f"   Total embeddings: {len(df)}")
print(f"\nüìù First embedding (first 10 values):")
print(f"   {df['embedding'].iloc[0][:10]}")

## üí° What We Just Did

We converted all 10 company documents into numerical vectors (embeddings). Each document is now:
- A 1536-dimensional vector
- Representing the semantic meaning of the text
- Ready to be compared with query embeddings

## üóÑÔ∏è Production Note: Vector Databases

In this tutorial, we're storing embeddings in a Pandas DataFrame (in memory). This works fine for learning, but in production:

- **Don't do this**: Store millions of embeddings in memory
- **Do this instead**: Use a vector database (Pinecone, Weaviate, Chroma, Qdrant)
- **Why?**: Vector databases are optimized for fast similarity search at scale

We'll cover vector databases in later notebooks!



---

# 5. Step 3: Semantic Search

Now that we have embeddings for all documents, we need to measure how similar they are to a query.

### üßÆ Cosine Similarity

The most common way to measure vector similarity is **cosine similarity**:

- Measures the angle between two vectors
- Returns a value between -1 and 1
- 1 = identical direction (very similar)
- 0 = perpendicular (unrelated)
- -1 = opposite direction (very different)

### üí° Simple Explanation

Imagine two arrows in space:
- If they point in the same direction ‚Üí high similarity
- If they point in different directions ‚Üí low similarity

**Why this works**: Documents with similar meanings have embeddings that "point" in similar directions in the high-dimensional space.


---

## üîç The Retrieval Process

Here's how we find relevant documents:

### Step-by-Step:

1. **User asks a question**: "What does Pentera do?"

2. **Embed the question**: Convert it to a vector using the same embedding model

3. **Calculate similarity**: Compare question vector with all document vectors
   ```
   Question: [0.2, 0.5, 0.1, ...]
   
   Doc 1 (Pentera): [0.3, 0.4, 0.2, ...] ‚Üí similarity: 0.92 ‚úÖ
   Doc 2 (Wiz):     [0.3, 0.4, 0.1, ...] ‚Üí similarity: 0.87
   Doc 3 (Ramp):    [0.1, 0.2, 0.8, ...] ‚Üí similarity: 0.45
   ...
   ```

4. **Rank by score**: Sort documents by similarity (highest first)

5. **Return top-k**: Get the most relevant documents (typically top 1-5)

### üí° Key Point: This Is "Retrieval" in RAG

This semantic search process is the **"Retrieval"** component of RAG. We're retrieving the most relevant documents to provide as context to the LLM.

---





Let's implement semantic search to find relevant documents!

In [None]:
def cosine_similarity(vec1: list, vec2: list) -> float:
    """
    Calculate cosine similarity between two vectors.
    Returns a value between -1 and 1, where 1 means identical direction.

    Args:
        vec1: First embedding vector
        vec2: Second embedding vector

    Returns:
        Similarity score (higher = more similar)
    """
    # Convert to numpy arrays for mathematical operations
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)

    # Calculate dot product (how much vectors point in same direction)
    dot_product = np.dot(vec1, vec2)

    # Calculate magnitudes (length of each vector)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)

    # Cosine similarity = dot product / product of magnitudes
    return dot_product / norm_product

# Test with two sample embeddings
test_vec1 = get_embedding("cybersecurity startup")
test_vec2 = get_embedding("security company")
test_vec3 = get_embedding("restaurant food delivery")

sim_similar = cosine_similarity(test_vec1, test_vec2)
sim_different = cosine_similarity(test_vec1, test_vec3)

print("üß™ Testing Cosine Similarity:\n")
print(f"   'cybersecurity startup' vs 'security company': {sim_similar:.4f} ‚úÖ")
print(f"   'cybersecurity startup' vs 'restaurant food': {sim_different:.4f}")
print(f"\nüí° Notice: Similar concepts have higher scores!")

In [None]:
def find_most_relevant_documents(query: str, documents_df: pd.DataFrame, top_k: int = 1) -> pd.DataFrame:
    """
    Find the top_k most relevant documents for a given query.
    This is the core semantic search function.

    Args:
        query: The search query (user's question)
        documents_df: DataFrame with documents and their embeddings
        top_k: Number of top documents to return

    Returns:
        DataFrame with top_k most relevant documents, sorted by similarity
    """
    print(f"üîç Searching for: '{query}'")

    # Step 1: Generate embedding for the query
    query_embedding = get_embedding(query)

    if query_embedding is None:
        print("‚ùå Failed to generate query embedding")
        return None

    # Step 2: Calculate similarity with all documents
    # For each document embedding, calculate cosine similarity with query
    documents_df['similarity'] = documents_df['embedding'].apply(
        lambda doc_embedding: cosine_similarity(doc_embedding, query_embedding)
    )

    # Step 3: Get top_k results (highest similarity scores)
    results = documents_df.nlargest(top_k, 'similarity')

    print(f"‚úÖ Found {len(results)} relevant document(s)\n")

    return results

print("‚úÖ Semantic search function created!")

In [None]:
# Test semantic search with our original question
question = "What does the startup company Pentera do and who invested in it?"

print("üöÄ Testing Semantic Search")
print("="*70)

# Find most relevant document
relevant_docs = find_most_relevant_documents(question, df, top_k=1)

# Display results
if relevant_docs is not None and len(relevant_docs) > 0:
    print("üìÑ MOST RELEVANT DOCUMENT:")
    print("="*70)
    print(f"Company: {relevant_docs.iloc[0]['name']}")
    print(f"Industry: {relevant_docs.iloc[0]['industry']}")
    print(f"Similarity Score: {relevant_docs.iloc[0]['similarity']:.4f}")
    print(f"\nDocument Text:")
    print("-"*70)
    print(relevant_docs.iloc[0]['document'])
    print("="*70)
    print("\nüí° SUCCESS: We found the right document!")
    print("   This is the 'Retrieval' part of RAG working!")

## üéâ Semantic Search Is Working!

### What Just Happened?

1. We asked: "What does Pentera do and who invested in it?"
2. The system converted our question to an embedding
3. It compared our question with all 10 company documents
4. It found that the Pentera document was most similar
5. It returned that document with a high similarity score

### üìä Understanding Similarity Scores

- **0.9 - 1.0**: Extremely relevant (nearly identical meaning)
- **0.7 - 0.9**: Very relevant (strong semantic match)
- **0.5 - 0.7**: Somewhat relevant (partial match)
- **< 0.5**: Not very relevant (weak or no match)



---

# 6. Step 4: Augmented Generation



Now we have the relevant document. How do we use it? We need to carefully structure our prompt to:

### 1. Provide Context Explicitly
```
Context:
Pentera is a cybersecurity company...
```

### 2. Give Clear Instructions
```
Answer the question using ONLY the context provided.
```

### 3. Tell Model to Stay Grounded
```
If you cannot answer from the context, say so.
```

### 4. Ask the Question
```
Question: What does Pentera do?
```

## üéØ The RAG Prompt Structure

A typical RAG prompt looks like:

```
Answer the question below using ONLY the context provided.
If you cannot answer from the context, say "I don't have enough information."

Context:
[Retrieved document(s) here]

Question: [User's question here]

Answer:
```



This is the **"Augmentation"** in RAG. We're augmenting (enriching) the prompt with retrieved information. The LLM now has:
- The user's question
- Relevant context to answer it
- Clear instructions on how to use the context

---

```

### Benefits:

1. **Reduces Hallucination**: Model has facts to work with
2. **Provides Source Attribution**: We know where the answer came from
3. **Up-to-date Information**: Documents can be updated without retraining
4. **Domain-Specific**: Works with private/proprietary information

---





Let's implement the generation step with retrieved context!


The prompt we'll construct has three key parts:

1. **Instructions**: "Answer using ONLY the context provided"
2. **Context**: The retrieved document(s) with relevant information
3. **Question**: The user's original query

This structure ensures the LLM stays grounded in the retrieved facts and doesn't hallucinate information.

In [None]:
def generate_answer_with_rag(query: str, context: str) -> str:
    """
    Generate an answer using the retrieved context.
    This is the 'Generation' step in RAG.

    Args:
        query: The user's question
        context: The retrieved document(s) to use as context

    Returns:
        The generated answer based on the context
    """
    # Construct the prompt with context and instructions
    # This is the "Augmentation" - we're adding retrieved context to guide the LLM
    prompt = f"""Answer the question below using ONLY the context provided.
If you cannot answer the question from the context, say "I don't have enough information in the provided context to answer that question."

Context:
{context}

Question: {query}

Answer:"""

    try:
        # Call the LLM with the augmented prompt using the Responses API
        response = client.responses.create(
            model=OPENAI_LLM_MODEL,
            input=prompt
        )

        return response.output_text

    except Exception as e:
        return f"‚ùå Error generating answer: {e}"

print("‚úÖ RAG answer generation function created!")

In [None]:
def rag_pipeline(query: str, documents_df: pd.DataFrame, top_k: int = 1) -> dict:
    """
    Complete RAG pipeline: Retrieval + Augmented Generation.

    This function combines all steps:
    1. Retrieve relevant documents (semantic search)
    2. Extract context from retrieved documents
    3. Generate answer using context

    Args:
        query: User's question
        documents_df: DataFrame with documents and embeddings
        top_k: Number of documents to retrieve

    Returns:
        Dictionary with query, retrieved docs, and answer
    """
    # Step 1: Retrieve relevant documents
    relevant_docs = find_most_relevant_documents(query, documents_df, top_k)

    if relevant_docs is None or len(relevant_docs) == 0:
        return {
            "query": query,
            "retrieved_docs": None,
            "answer": "Failed to retrieve documents"
        }

    # Step 2: Extract context from retrieved documents
    # Join multiple documents with double newlines for clarity
    context = "\n\n".join(relevant_docs['document'].tolist())

    # Step 3: Generate answer with context
    answer = generate_answer_with_rag(query, context)

    return {
        "query": query,
        "retrieved_docs": relevant_docs,
        "answer": answer
    }

print("‚úÖ Complete RAG pipeline function created!")

In [None]:
# Run the complete RAG pipeline
question = "What does the startup company Pentera do and who invested in it?"

print("üöÄ Running Complete RAG Pipeline...")
print("="*70)

result = rag_pipeline(question, df, top_k=1)

print(f"\n‚ùì QUESTION:")
print(f"   {result['query']}\n")

print(f"üìÑ RETRIEVED DOCUMENT:")
print(f"   Company: {result['retrieved_docs'].iloc[0]['name']}")
print(f"   Similarity: {result['retrieved_docs'].iloc[0]['similarity']:.4f}\n")

print(f"‚úÖ RAG ANSWER:")
print("-"*70)
print(result['answer'])
print("-"*70)

print("\nüéâ SUCCESS! RAG pipeline is working!")

## üéâ Complete RAG Pipeline Is Working!



We successfully implemented the complete RAG pipeline:

1. **üîç Retrieval**: Found the most relevant document (Pentera profile)
2. **üìù Augmentation**: Added that document as context to our prompt
3. **üí¨ Generation**: LLM generated an accurate answer using the context




---

# 7. Optional Experiment: Top-K Retrieval


So far, we've been retrieving just the top 1 most relevant document. But what if we retrieve multiple documents?

### Trade-offs:

**More Documents (higher top_k):**
- ‚úÖ More complete information
- ‚úÖ Better coverage if information is split across documents
- ‚ùå More tokens = higher cost
- ‚ùå Irrelevant context can confuse the model
- ‚ùå Longer processing time

**Fewer Documents (lower top_k):**
- ‚úÖ Lower cost
- ‚úÖ Faster
- ‚úÖ More focused context
- ‚ùå Might miss relevant information

Let's experiment!

In [None]:
# Try a question that might benefit from multiple documents
experiment_question = "Which companies are in the cybersecurity industry?"

print("üß™ EXPERIMENT: Comparing top-1 vs top-3 retrieval")
print("="*70)
print(f"\n‚ùì Question: {experiment_question}\n")

# Test with top-1
print("üìä Test 1: Retrieving top-1 document")
print("-"*70)
result_top1 = rag_pipeline(experiment_question, df, top_k=1)

print(f"Retrieved: {result_top1['retrieved_docs'].iloc[0]['name']}")
print(f"\nAnswer: {result_top1['answer']}")
print("-"*70)

# Test with top-3
print("\nüìä Test 2: Retrieving top-3 documents")
print("-"*70)
result_top3 = rag_pipeline(experiment_question, df, top_k=3)

print(f"Retrieved:")
for idx, row in result_top3['retrieved_docs'].iterrows():
    print(f"  {idx+1}. {row['name']} (similarity: {row['similarity']:.4f})")

print(f"\nAnswer: {result_top3['answer']}")
print("-"*70)

print("\nüí° OBSERVATION:")
print("   - Top-1: May only mention one company")
print("   - Top-3: Can mention multiple companies if they're all in the context")
print("   - Trade-off: More complete vs more costly")
print("="*70)



### When to Use top_k = 1:
- Questions about a specific entity ("Tell me about Pentera")
- When you need focused, specific information
- Cost/speed is a priority
- Your documents are comprehensive (all info in one place)

### When to Use top_k > 1:
- Comparative questions ("Which companies do X?")
- Information might be split across documents
- Need comprehensive coverage
- Accuracy is more important than cost



**top_k is a hyperparameter you tune based on your use case!**

In production, you might:
- Start with top_k = 3-5
- Test with your specific questions
- Measure quality vs cost
- Adjust based on results

---

# 8. Production Considerations

## üì¶ Vector Databases in Production

In this notebook, we stored embeddings in a Pandas DataFrame. This works great for learning with 10 documents, but NOT for production:

### ‚ùå Problems with Our Approach:

1. **Slow**: Must compare query to ALL documents every time
   - 10 documents: Fast
   - 1 million documents: Extremely slow

2. **Limited Scale**: Can't handle millions of documents
   - Everything stored in memory
   - No optimization for large-scale search

3. **No Persistence**: Data lost when notebook closes
   - Must regenerate embeddings every time
   - No way to update documents incrementally

4. **No Advanced Features**:
   - Can't filter by metadata ("cybersecurity companies only")
   - No hybrid search (semantic + keyword)
   - No approximate nearest neighbor (ANN) algorithms

---

## ‚úÖ Production Solution: Vector Databases

Vector databases are specialized systems optimized for similarity search:

### Popular Vector Databases:

- **Pinecone**: Fully managed, cloud-based, easy to use
- **Weaviate**: Open-source, supports hybrid search
- **Chroma**: Simple, lightweight, great for prototyping
- **Qdrant**: Fast, Rust-based, good for production
- **Milvus**: Open-source, scalable, enterprise-ready

### üöÄ What Vector Databases Provide:

1. **Fast Search**:
   - Approximate Nearest Neighbor (ANN) algorithms
   - HNSW, IVF, etc.
   - Search millions of vectors in milliseconds

2. **Scalability**:
   - Handle billions of vectors
   - Distributed storage
   - Horizontal scaling

3. **Persistence**:
   - Store embeddings permanently
   - Add/update/delete documents
   - No need to regenerate

4. **Advanced Features**:
   - Metadata filtering
   - Hybrid search (semantic + keyword)
   - Multi-vector search
   - Analytics and monitoring



---

## üèóÔ∏è Other Production Considerations

### 1. Document Chunking

**Problem**: Long documents don't fit in context windows

**Solution**: Break documents into smaller chunks
- Chunk size: 200-500 words typical
- Overlap: 10-20% to maintain context
- Methods: Sentence-based, semantic chunking, fixed-size

### 2. Metadata Filtering

**Example**: "Find cybersecurity companies in the USA"
- Pre-filter by country = USA
- Then semantic search within filtered set
- Faster and more accurate

### 3. Hybrid Search

Combine semantic + keyword search:
- Semantic: Understands meaning
- Keyword: Exact matches (names, IDs, codes)
- Best of both worlds!

### 4. Re-ranking

Two-stage retrieval:
1. Fast retrieval: Get top 50 documents (fast, approximate)
2. Re-ranking: Use more sophisticated model on top 50
3. Return top 5 after re-ranking

### 5. Evaluation

How do you know if your RAG system is working well?

**Retrieval Metrics**:
- Precision@k: How many retrieved docs are relevant?
- Recall@k: Did we retrieve all relevant docs?
- MRR (Mean Reciprocal Rank): Where do relevant docs appear?

**Generation Metrics**:
- Faithfulness: Is answer grounded in context?
- Relevance: Does answer address the question?
- Human evaluation: Still the gold standard!





---

# 9. Try It Yourself!

Now it's your turn to experiment! Use the playground below to:
- Try different questions
- Experiment with top_k values
- Test questions that shouldn't be answerable from the data
- See how the system handles edge cases

## üí° Suggested Experiments:

1. **Specific company questions**:
   - "What does Wiz do?"
   - "Who invested in Notion?"
   - "When was Databricks founded?"

2. **Comparative questions**:
   - "Which companies are in the cybersecurity industry?"
   - "What fintech companies are in the database?"
   - "Compare Pentera and Wiz"

3. **Questions that SHOULD fail** (not in our data):
   - "What does Apple do?"
   - "Tell me about restaurants in New York"
   - "What's the weather today?"

4. **Different top_k values**:
   - Try top_k=1, top_k=3, top_k=5
   - See how answers change

Have fun experimenting! üöÄ

In [None]:
# üéÆ PLAYGROUND: Try your own questions!

# Modify these variables:
your_question = "Which companies are in the cybersecurity industry?"  # ‚Üê Change this!
your_top_k = 2  # ‚Üê Try different values (1, 2, 3, 5)

# Run the RAG pipeline
result = rag_pipeline(your_question, df, top_k=your_top_k)

# Display results
print("="*70)
print(f"‚ùì YOUR QUESTION:")
print(f"   {result['query']}\n")

print(f"üìÑ RETRIEVED DOCUMENTS (top-{your_top_k}):")
for idx, row in result['retrieved_docs'].iterrows():
    print(f"   {idx+1}. {row['name']} (similarity: {row['similarity']:.4f})")

print(f"\n‚úÖ ANSWER:")
print("-"*70)
print(result['answer'])
print("-"*70)
print("="*70)

In [None]:
# üß™ ADVANCED: Test a question that SHOULDN'T be answerable
# This tests how well the model handles missing information

off_topic_question = "What does Apple Inc do and who is their CEO?"

print("üß™ Testing off-topic question (not in our database):\n")
result = rag_pipeline(off_topic_question, df, top_k=1)

print("="*70)
print(f"‚ùì Question: {off_topic_question}\n")
print(f"üìÑ Top Retrieved Document: {result['retrieved_docs'].iloc[0]['name']}")
print(f"   Similarity Score: {result['retrieved_docs'].iloc[0]['similarity']:.4f}")
print(f"\n‚úÖ Answer:")
print("-"*70)
print(result['answer'])
print("-"*70)
print("\nüí° OBSERVATION:")
print("   - Low similarity score indicates poor match")
print("   - The answer should indicate insufficient information")
print("   - In production, you might reject queries with similarity < 0.7")
print("="*70)

---



### üìñ Additional Resources:

- **OpenAI Embeddings Guide**: https://platform.openai.com/docs/guides/embeddings
- **RAG Papers**: Look up "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"


