# RAG Pipeline: Context Assembly

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayoisio/genai-on-google-cloud/blob/main/chapter-2/colabs/04_rag_context_assembly.ipynb)

**Estimated Time**: 15 minutes

**Prerequisites**: Google Cloud project with billing enabled, Vertex AI API enabled

---

## Overview

Retrieval-Augmented Generation (RAG) enhances LLM responses with relevant context from your data. This notebook demonstrates:

1. **Build a complete RAG pipeline** from retrieval to generation
2. **Assemble context** with source attribution
3. **Generate grounded responses** using retrieved documents
4. **Handle edge cases** and optimize context usage

This implements the context assembly pattern from Example 2-2 in Chapter 2.

```mermaid
flowchart LR
    A[Query] --> B[Embed]
    B --> C[Retrieve]
    C --> D[Assemble Context]
    D --> E[Generate]
    E --> F[Answer]
```

## 1. Setup & Authentication

In [None]:
# @title Install Dependencies
!pip install --upgrade google-cloud-aiplatform google-generativeai -q

In [None]:
# @title Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()
print("‚úì Authentication successful")

In [None]:
# @title Configure Your Project
PROJECT_ID = "your-project-id"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# Validate project ID
if PROJECT_ID == "your-project-id":
    raise ValueError("Please set your PROJECT_ID above")

print(f"‚úì Project: {PROJECT_ID}")
print(f"‚úì Location: {LOCATION}")

In [None]:
# @title Initialize Vertex AI
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.language_models import TextEmbeddingModel
import numpy as np

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Initialize models
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-005")
generative_model = GenerativeModel("gemini-2.0-flash")

print(f"‚úì Vertex AI initialized")
print(f"‚úì Embedding model: text-embedding-005")
print(f"‚úì Generative model: gemini-2.0-flash")

## 2. Create a Knowledge Base

First, let's create a sample knowledge base with document chunks and their embeddings.

In [None]:
# @title Define sample knowledge base
# Simulating a knowledge base about AI and Machine Learning

KNOWLEDGE_BASE = [
    {
        "id": "ml_basics_1",
        "source": "ML Fundamentals Guide",
        "section": "Chapter 1: Introduction",
        "content": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. The core idea is to develop algorithms that can access data and use it to learn for themselves."
    },
    {
        "id": "ml_basics_2",
        "source": "ML Fundamentals Guide",
        "section": "Chapter 2: Types of ML",
        "content": "There are three main types of machine learning: supervised learning (using labeled data), unsupervised learning (finding patterns without labels), and reinforcement learning (learning through rewards and penalties). Each type is suited for different problem domains."
    },
    {
        "id": "dl_intro_1",
        "source": "Deep Learning Handbook",
        "section": "Neural Networks",
        "content": "Deep learning uses neural networks with multiple layers (hence 'deep') to progressively extract higher-level features from raw input. For example, in image recognition, lower layers identify edges, while higher layers identify concepts like faces or objects."
    },
    {
        "id": "dl_intro_2",
        "source": "Deep Learning Handbook",
        "section": "Training Process",
        "content": "Training a neural network involves forward propagation (computing predictions), calculating loss (error), and backpropagation (adjusting weights). This process repeats over many iterations until the model converges to an optimal solution."
    },
    {
        "id": "nlp_basics_1",
        "source": "NLP Reference Manual",
        "section": "Text Processing",
        "content": "Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. Key tasks include tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis."
    },
    {
        "id": "llm_overview_1",
        "source": "LLM Architecture Guide",
        "section": "Transformer Models",
        "content": "Large Language Models (LLMs) are based on the Transformer architecture, which uses self-attention mechanisms to process input sequences. This allows the model to weigh the importance of different parts of the input when generating output."
    },
    {
        "id": "llm_overview_2",
        "source": "LLM Architecture Guide",
        "section": "Training and Fine-tuning",
        "content": "LLMs are typically pre-trained on vast amounts of text data, then fine-tuned for specific tasks. Pre-training gives the model general language understanding, while fine-tuning adapts it to domain-specific requirements."
    },
    {
        "id": "rag_intro_1",
        "source": "RAG Implementation Guide",
        "section": "Overview",
        "content": "Retrieval-Augmented Generation (RAG) combines the power of LLMs with external knowledge retrieval. Instead of relying solely on the model's trained knowledge, RAG retrieves relevant documents and uses them to generate more accurate, up-to-date responses."
    },
    {
        "id": "rag_intro_2",
        "source": "RAG Implementation Guide",
        "section": "Benefits",
        "content": "Key benefits of RAG include: reduced hallucinations (grounding in real data), ability to cite sources, easy knowledge updates without retraining, and domain-specific accuracy. RAG is particularly valuable for enterprise applications."
    },
    {
        "id": "vector_search_1",
        "source": "Vector Database Guide",
        "section": "Semantic Search",
        "content": "Vector search enables semantic similarity matching by converting text to embeddings (dense numerical vectors). Unlike keyword search, vector search finds conceptually similar content even when exact words differ."
    }
]

print(f"‚úì Created knowledge base with {len(KNOWLEDGE_BASE)} documents")

In [None]:
# @title Generate embeddings for knowledge base
# Get content from all documents
contents = [doc['content'] for doc in KNOWLEDGE_BASE]

# Generate embeddings
embeddings = embedding_model.get_embeddings(contents)

# Add embeddings to documents
for doc, emb in zip(KNOWLEDGE_BASE, embeddings):
    doc['embedding'] = np.array(emb.values)

print(f"‚úì Generated embeddings for {len(KNOWLEDGE_BASE)} documents")
print(f"  Embedding dimension: {len(KNOWLEDGE_BASE[0]['embedding'])}")

## 3. Retrieval Component

The retrieval component finds relevant documents based on semantic similarity.

In [None]:
# @title Retrieval function
def cosine_similarity(vec1, vec2):
    """Compute cosine similarity between two vectors."""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def retrieve_documents(query, knowledge_base, top_k=3, threshold=0.5):
    """
    Retrieve relevant documents from the knowledge base.
    
    Args:
        query: User's question
        knowledge_base: List of documents with embeddings
        top_k: Number of documents to retrieve
        threshold: Minimum similarity score
    
    Returns:
        List of relevant documents with similarity scores
    """
    # Generate query embedding
    query_embedding = embedding_model.get_embeddings([query])[0].values
    query_vector = np.array(query_embedding)
    
    # Calculate similarities
    results = []
    for doc in knowledge_base:
        similarity = cosine_similarity(query_vector, doc['embedding'])
        if similarity >= threshold:
            results.append({
                'id': doc['id'],
                'source': doc['source'],
                'section': doc['section'],
                'content': doc['content'],
                'similarity': similarity
            })
    
    # Sort by similarity and return top_k
    results.sort(key=lambda x: x['similarity'], reverse=True)
    return results[:top_k]

print("‚úì Retrieval function defined")

In [None]:
# @title Test retrieval
test_query = "How does RAG help reduce hallucinations?"

retrieved_docs = retrieve_documents(test_query, KNOWLEDGE_BASE, top_k=3)

print(f"üîç Query: '{test_query}'\n")
print(f"Retrieved {len(retrieved_docs)} documents:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"\n{i}. [{doc['similarity']:.3f}] {doc['source']} - {doc['section']}")
    print(f"   {doc['content'][:100]}...")

## 4. Context Assembly (Example 2-2)

This is the core pattern from Chapter 2: assembling retrieved chunks into coherent context for the LLM.

In [None]:
# @title Context assembly function (Example 2-2 from Chapter)
def assemble_rag_context(query, retrieved_docs, max_context_chars=6000):
    """
    Assemble retrieved document chunks into coherent context for the LLM.
    
    This implements the context assembly pattern from Example 2-2.
    
    Args:
        query: The user's question
        retrieved_docs: List of retrieved documents with metadata
        max_context_chars: Maximum characters for context
    
    Returns:
        Formatted prompt with assembled context
    """
    if not retrieved_docs:
        return f"""Answer the following question. If you don't have enough information, 
say so clearly.

Question: {query}

Answer:"""
    
    # Sort by relevance (already sorted, but ensure)
    sorted_docs = sorted(retrieved_docs, key=lambda x: x['similarity'], reverse=True)
    
    # Build context with source attribution
    context_parts = []
    total_chars = 0
    sources_used = []
    
    for doc in sorted_docs:
        # Format each chunk with source metadata
        chunk_text = f"""[Source: {doc['source']} | Section: {doc['section']}]
{doc['content']}
"""
        
        # Check if adding this chunk exceeds limit
        if total_chars + len(chunk_text) > max_context_chars:
            break
        
        context_parts.append(chunk_text)
        total_chars += len(chunk_text)
        sources_used.append(f"{doc['source']} ({doc['section']})")
    
    # Assemble the full context
    assembled_context = "\n".join(context_parts)
    
    # Create the full prompt
    prompt = f"""You are a helpful AI assistant. Answer the question using ONLY the context provided below.
If the context doesn't contain enough information to answer the question fully, say so.
Always cite your sources by mentioning which document the information came from.

=== CONTEXT ===
{assembled_context}
=== END CONTEXT ===

Question: {query}

Instructions:
- Answer based ONLY on the context above
- Cite sources when providing information
- If information is missing, acknowledge it

Answer:"""
    
    return prompt, sources_used

print("‚úì Context assembly function defined (Example 2-2 pattern)")

In [None]:
# @title Test context assembly
prompt, sources = assemble_rag_context(test_query, retrieved_docs)

print("üìù Assembled Prompt:")
print("=" * 60)
print(prompt)
print("=" * 60)
print(f"\nüìö Sources used: {len(sources)}")
for source in sources:
    print(f"   - {source}")

## 5. Generation Component

Generate grounded responses using the assembled context.

In [None]:
# @title RAG generation function
def generate_rag_response(query, knowledge_base, top_k=3, temperature=0.2):
    """
    Complete RAG pipeline: retrieve, assemble context, and generate.
    
    Args:
        query: User's question
        knowledge_base: Document collection with embeddings
        top_k: Number of documents to retrieve
        temperature: Generation temperature (lower = more focused)
    
    Returns:
        Generated response, sources used, and retrieved documents
    """
    # Step 1: Retrieve relevant documents
    retrieved_docs = retrieve_documents(query, knowledge_base, top_k=top_k)
    
    # Step 2: Assemble context
    prompt, sources = assemble_rag_context(query, retrieved_docs)
    
    # Step 3: Generate response
    generation_config = {
        "temperature": temperature,
        "max_output_tokens": 1024,
    }
    
    response = generative_model.generate_content(
        prompt,
        generation_config=generation_config
    )
    
    return {
        'answer': response.text,
        'sources': sources,
        'retrieved_docs': retrieved_docs,
        'num_docs_retrieved': len(retrieved_docs)
    }

print("‚úì RAG generation function defined")

In [None]:
# @title Test the complete RAG pipeline
QUERY = "What is RAG and how does it help with LLM applications?"  # @param {type:"string"}

result = generate_rag_response(QUERY, KNOWLEDGE_BASE, top_k=4)

print(f"üîç Query: '{QUERY}'\n")
print("=" * 60)
print("üìù Answer:")
print(result['answer'])
print("=" * 60)
print(f"\nüìö Sources ({len(result['sources'])}):")
for source in result['sources']:
    print(f"   - {source}")

In [None]:
# @title Test with different queries
test_queries = [
    "What are the three types of machine learning?",
    "How do neural networks learn?",
    "What is the difference between vector search and keyword search?",
    "What is quantum computing?"  # Out of scope - should acknowledge
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"üîç Query: {query}")
    print("-" * 60)
    
    result = generate_rag_response(query, KNOWLEDGE_BASE, top_k=3)
    
    print(f"üìù Answer: {result['answer'][:300]}...")
    print(f"üìö Sources: {result['num_docs_retrieved']} documents")

## 6. Advanced: Handling Edge Cases

Real-world RAG systems need to handle various edge cases gracefully.

In [None]:
# @title Enhanced RAG with edge case handling
def enhanced_rag_response(query, knowledge_base, top_k=3, 
                          similarity_threshold=0.5, 
                          min_docs_required=1):
    """
    Enhanced RAG pipeline with edge case handling.
    
    Handles:
    - No relevant documents found
    - Low confidence responses
    - Query classification
    """
    # Retrieve documents with threshold
    retrieved_docs = retrieve_documents(
        query, 
        knowledge_base, 
        top_k=top_k,
        threshold=similarity_threshold
    )
    
    # Calculate average similarity
    avg_similarity = 0
    if retrieved_docs:
        avg_similarity = sum(d['similarity'] for d in retrieved_docs) / len(retrieved_docs)
    
    # Determine confidence level
    if len(retrieved_docs) < min_docs_required:
        confidence = "low"
        warning = "‚ö†Ô∏è Limited relevant information found in knowledge base."
    elif avg_similarity < 0.6:
        confidence = "medium"
        warning = "‚ÑπÔ∏è Retrieved documents have moderate relevance."
    else:
        confidence = "high"
        warning = None
    
    # Generate response
    if len(retrieved_docs) == 0:
        return {
            'answer': "I don't have enough information in my knowledge base to answer this question accurately. Please rephrase your question or ask about a different topic.",
            'confidence': 'none',
            'sources': [],
            'warning': "‚ùå No relevant documents found."
        }
    
    prompt, sources = assemble_rag_context(query, retrieved_docs)
    
    response = generative_model.generate_content(
        prompt,
        generation_config={"temperature": 0.2, "max_output_tokens": 1024}
    )
    
    return {
        'answer': response.text,
        'confidence': confidence,
        'avg_similarity': avg_similarity,
        'sources': sources,
        'warning': warning,
        'retrieved_docs': retrieved_docs
    }

print("‚úì Enhanced RAG function defined")

In [None]:
# @title Test enhanced RAG with edge cases
edge_case_queries = [
    "Explain deep learning",  # Should have high confidence
    "What is the weather today?",  # Out of scope
    "How does RAG work?",  # Should have high confidence
]

for query in edge_case_queries:
    print(f"\n{'='*60}")
    print(f"üîç Query: {query}")
    
    result = enhanced_rag_response(query, KNOWLEDGE_BASE)
    
    if result.get('warning'):
        print(result['warning'])
    
    print(f"üìä Confidence: {result['confidence']}")
    if 'avg_similarity' in result:
        print(f"üìà Avg Similarity: {result['avg_similarity']:.3f}")
    print(f"üìù Answer: {result['answer'][:200]}...")

## 7. Try It Yourself

In [None]:
# TODO: Add your own documents to the knowledge base
custom_docs = [
    {
        "id": "custom_1",
        "source": "My Custom Document",
        "section": "Introduction",
        "content": "Add your own content here to test the RAG pipeline."
    },
]

# Generate embeddings for custom docs
custom_contents = [doc['content'] for doc in custom_docs]
custom_embeddings = embedding_model.get_embeddings(custom_contents)

for doc, emb in zip(custom_docs, custom_embeddings):
    doc['embedding'] = np.array(emb.values)
    KNOWLEDGE_BASE.append(doc)

print(f"‚úì Added {len(custom_docs)} custom documents")
print(f"Total documents in knowledge base: {len(KNOWLEDGE_BASE)}")

In [None]:
# TODO: Experiment with different parameters
YOUR_QUERY = "Your question here"  # @param {type:"string"}
TOP_K = 3  # @param {type:"integer"}
SIMILARITY_THRESHOLD = 0.5  # @param {type:"number"}

result = enhanced_rag_response(
    YOUR_QUERY, 
    KNOWLEDGE_BASE,
    top_k=TOP_K,
    similarity_threshold=SIMILARITY_THRESHOLD
)

print(f"üîç Query: {YOUR_QUERY}")
print(f"üìä Confidence: {result['confidence']}")
print(f"üìù Answer:\n{result['answer']}")

## Summary

In this notebook, you learned how to:

1. ‚úÖ **Build a complete RAG pipeline** with retrieval, context assembly, and generation
2. ‚úÖ **Implement context assembly** with source attribution (Example 2-2)
3. ‚úÖ **Generate grounded responses** using retrieved documents
4. ‚úÖ **Handle edge cases** like missing information and low confidence

### Key Takeaways

- **Context assembly** is crucial for RAG quality
- **Source attribution** improves trustworthiness
- **Confidence scoring** helps users understand reliability
- **Edge case handling** is essential for production systems

---

## Next Steps

Continue to the next notebook: **[05_vertex_ai_rag_engine.ipynb](05_vertex_ai_rag_engine.ipynb)** to learn how to use Vertex AI RAG Engine for managed, production-ready RAG.