# 🦙 LlamaStack & RAG: Building Intelligent Agents

This notebook demonstrates **Retrieval-Augmented Generation (RAG)** - a powerful technique that enables AI models to access and reason about external documents and knowledge bases.

**What is RAG?**
RAG transforms static AI models into dynamic assistants that can:
- **Remember** every document you share with them
- **Search** through vast libraries of content in milliseconds  
- **Reason** about information from multiple sources simultaneously
- **Update** their knowledge without retraining the entire model

**Why RAG Matters:**
Instead of relying only on training data, RAG-enhanced models can reference your specific documents, course materials, and knowledge bases to provide accurate, cited responses.

You've already built a vector database - now let's add the intelligent layer that makes RAG truly powerful! 🚀

## 🏗️ The LlamaStack RAG Architecture

LlamaStack organizes RAG capabilities into **three elegant layers** that work together to create intelligent, knowledge-aware applications:

### 1. 🗄️ Storage Layer (The Foundation)
This is where your knowledge lives:
- **Vector IO**: Stores document embeddings for semantic search - converts text into mathematical vectors that capture meaning
- **KeyValue IO**: Manages structured metadata and simple lookups (document titles, authors, dates)
- **Relational IO**: Handles complex queries across structured data (coming soon)

### 2. 🔧 RAG Layer (The Intelligence)
This is where documents become searchable knowledge:
- **Document Ingestion**: Automatically downloads and processes files, URLs, and content
- **Intelligent Chunking**: Splits large documents into optimal pieces (typically 512 tokens) for retrieval
- **Semantic Search**: Finds relevant content based on meaning, not just keyword matching

### 3. 🤖 User Layer (The Interface)  
This is where users interact with the knowledge:
- **Context-Aware Agents**: LLMs that can automatically use RAG tools to answer questions
- **Multi-Document Reasoning**: Agents that synthesize information from multiple sources
- **Conversational Memory**: Maintains context across interactions while accessing external knowledge

**The Magic:** When you ask a question, the system searches the vector database for relevant chunks, then provides those chunks as context to the LLM for generating informed, cited responses.

## 📦 Install Required Packages

Install the Python packages needed for this lab.

In [1]:
!pip install -q llama_stack_client fire dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
# Core imports for RAG functionality
import uuid  # For generating unique vector database IDs

# LlamaStack client and RAG-specific classes
from llama_stack_client import RAGDocument  # Represents documents for ingestion
from llama_stack_client.types.shared.content_delta import TextDelta, ToolCallDelta  # For streaming responses

# Additional utilities for document processing
import base64    # For encoding images/binary data if needed
import requests  # For fetching documents from URLs

## 🔗 Connect to LlamaStack

Connect to LlamaStack - the AI engine that orchestrates all RAG operations. LlamaStack acts as the central hub that coordinates:
- Vector database operations (storage and retrieval)
- Document processing and chunking
- LLM inference with retrieved context
- Agent workflows and tool usage

In [None]:
# Standard imports for system utilities
import os
import sys
sys.path.append('..')  # Add parent directory to path for custom utilities

# Import custom utilities and LlamaStack client
from src.utils import step_printer  # For pretty-printing step-by-step progress
from termcolor import cprint        # For colorized console output
from llama_stack_client import LlamaStackClient  # Main client for all LlamaStack operations

# === LlamaStack Connection Setup ===
# The base URL points to your LlamaStack server deployment
base_url = "http://llama-stack-service:8321"

# Optional: Configure external search tools (Tavily for web search)
# Leave empty for this RAG-focused demo
tavily_search_api_key = ""
if tavily_search_api_key:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}
else:
    provider_data = None

# Create the LlamaStack client - this is your main interface for all RAG operations
client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data  # Additional provider configurations
)

print(f"Connected to LlamaStack server")

# === Model Configuration ===
# Specify which LLM model to use for generating responses
model_id = "llama32"  # Using Llama 3.2 model name

# === Generation Parameters ===
# These control how the model generates responses
temperature = 0.0  # 0.0 = deterministic, higher = more creative
max_tokens = 512   # Maximum length of generated responses
stream = False     # Whether to stream responses token-by-token

# Configure the sampling strategy based on temperature
if temperature > 0.0:
    top_p = 0.95  # Nucleus sampling parameter
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}  # Always pick most likely token

# Package sampling parameters for the inference API
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

# Display configuration for verification
print(f"Model: {model_id}")
print(f"Sampling Parameters: {sampling_params}")
print(f"Stream: {stream}")

## 🗃️ Create Vector Database for RAG

Set up a vector database where documents will be stored for retrieval. This is the **Storage Layer** of our RAG architecture.

**What happens here:**
1. **Registration**: Tell LlamaStack about your vector database configuration
2. **Embedding Model**: Specify which model converts text to vectors (we use `all-MiniLM-L6-v2`)
3. **Dimensions**: Set vector size (384 dimensions for our chosen model)
4. **Provider**: Connect to your Milvus database deployment

In [None]:
# === STEP 1: Register Vector Database ===

# Generate a unique identifier for this vector database instance
# Using UUID ensures no conflicts if multiple users run this notebook simultaneously
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
print(f"📊 Created vector database ID: {vector_db_id}")

# This tells LlamaStack how to connect to and use your vector database
client.vector_dbs.register(
    vector_db_id=vector_db_id,                    # Unique identifier we created above
    embedding_model="all-MiniLM-L6-v2",          # Hugging Face model for text → vectors
    embedding_dimension=384,                      # Vector size (must match model output)
    provider_id="milvus",                         # Use Milvus as the vector database backend
)
print(f"✅ Registered vector database with Milvus backend")

## 📚 Document Ingestion and Processing

This is where the **RAG Layer** comes into action! We'll use LlamaStack's RAG Tool to automatically:

1. **Download** documents from URLs
2. **Process** PDF content and extract text
3. **Chunk** large documents into optimal pieces (512 tokens each)
4. **Embed** each chunk using the embedding model
5. **Store** vectors and metadata in the vector database

**Two ways to ingest documents:**
- **Direct Vector IO**: Insert pre-processed chunks directly
- **RAG Tool** (what we're using): Automatic processing from URLs or files

A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. 

In [None]:
# === STEP 2: Define Documents to Ingest ===
# List of (URL, MIME_TYPE) tuples for documents to process
urls = [
    ("https://raw.githubusercontent.com/rhoai-genaiops/deploy-lab/main/university-data/canopy-in-botany.pdf", "application/pdf"),
]

# === STEP 3: Create RAGDocument Objects ===
# RAGDocument is LlamaStack's format for documents to be ingested
documents = [
    RAGDocument(
        document_id=f"doc-{i}",                   # Unique ID for this document
        content=url,                              # Can be URL, file path, or direct text
        mime_type=url_type,                       # Tells LlamaStack how to process the content
        metadata={                                # Additional information about the document
            "source_url": url,                    # Where this document came from
            "document_type": "academic_material",  # Category for filtering/organization
        },
    )
    for i, (url, url_type) in enumerate(urls)
]

# Display what we're about to ingest
print("📖 Ingesting documents into RAG system...")
for i, (url, url_type) in enumerate(urls):
    print(f"  • Document {i+1}: {url}")

# === STEP 4: Use RAG Tool for Automatic Processing ===
# This is where the magic happens! The RAG tool will:
# 1. Download the PDF from the URL
# 2. Extract and parse the text content
# 3. Split into chunks of 512 tokens each
# 4. Generate embeddings for each chunk
# 5. Store everything in the vector database
try:
    client.tool_runtime.rag_tool.insert(
        documents=documents,                      # List of RAGDocument objects to process
        vector_db_id=vector_db_id,               # Where to store the processed chunks
        chunk_size_in_tokens=512,                # Optimal size for retrieval (not too big, not too small)
    )
    print("\n✅ Document ingestion complete!")
    print("🎯 Your documents are now searchable via semantic similarity!")
except Exception as e:
    print(f"\n❌ Document ingestion failed: {e}")
    print("💡 This might be due to PDF processing issues. Try with different documents or check the PDF accessibility.")

## 🔍 Testing RAG Retrieval and Generation

Now let's test the complete **RAG Pipeline** - this demonstrates how all three layers work together:

### The RAG Process:
1. **🔍 Query Processing**: Convert user question into embeddings
2. **📚 Semantic Retrieval**: Find most similar document chunks in vector database  
3. **🔗 Context Assembly**: Combine user question with retrieved chunks
4. **🤖 Generation**: LLM generates informed response using both its training and the retrieved context
5. **📖 Citation**: Response includes references to source documents

**Why this works better than normal LLMs:**
- **Grounded responses**: Answers are based on your specific documents
- **Up-to-date**: Add new documents without retraining the model
- **Traceable**: Every answer can be traced back to source material
- **Accurate**: Reduces hallucination by providing factual context

In [None]:
# === Test Queries ===
# These questions will test our RAG system's ability to find and synthesize information
queries = [
    "What are the types of Canopy?",        # Tests retrieval of categorical information
    "What is the structure of Canopy?",     # Tests retrieval of structural/descriptive information
]

# === RAG Pipeline Testing Loop ===
for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # === STEP 1: RAG RETRIEVAL ===
    # Query the vector database to find relevant document chunks
    # This uses semantic similarity - the question gets converted to embeddings
    # and matched against document chunk embeddings
    rag_response = client.tool_runtime.rag_tool.query(
        content=prompt,                              # The user's question
        vector_db_ids=[vector_db_id],               # Which vector database(s) to search
        query_config={                              # How to format the retrieved results
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
        },
    )

    # Display the full RAG response structure
    cprint(rag_response)

    # === STEP 2: EXAMINE RETRIEVED METADATA ===
    # The metadata contains information about which documents were matched
    # and their relevance scores
    cprint(f"\n--- RAG Metadata ---", "yellow")
    cprint(rag_response.metadata, "cyan")

    # === STEP 3: PREPARE MESSAGES FOR LLM ===
    # Structure the conversation with system prompt and user query
    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    # === STEP 4: CONTEXT INJECTION ===
    # This is the key to RAG: we inject the retrieved content as context
    # The LLM now has both its training knowledge AND the specific document content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # === STEP 5: LLM GENERATION WITH CONTEXT ===
    # The LLM generates a response using both its training and the retrieved context
    response = client.inference.chat_completion(
        messages=messages,                          # The conversation including context
        model_id=model_id,                         # Which model to use for generation
        sampling_params=sampling_params,           # How to generate (greedy vs sampling)
        stream=True,                               # Stream the response token by token
    )

    # === STEP 6: DISPLAY GENERATED RESPONSE ===
    # Show the final answer that combines the LLM's knowledge with document facts
    cprint("inference> ", color="magenta", end='')
    
    # Handle streaming response - tokens arrive one by one
    for event in response:
        # Some SDKs surface "delta" at the top level; others nest under ".event"
        ev = getattr(event, "event", event)  # Fall back to event itself
        delta = getattr(ev, "delta", None)

        if delta is None:
            # Non-delta events: message_start, message_end, tool_started, heartbeats, etc.
            continue

        # Extract and display text tokens as they arrive
        text = getattr(delta, "text", None)
        if isinstance(text, str):
            cprint(text, color="magenta", end='')
            continue

        # Handle any tool call tokens (if the model decides to use tools)
        tool_call = getattr(delta, "tool_call", None)
        if tool_call is not None:
            cprint(str(tool_call), color="magenta", end='')
            continue
    
    cprint(f"\n--- End of RAG Answer ---", "blue")

print("\n🎉 RAG Pipeline Complete!")
print("🔍 Notice how the responses reference specific information from the documents")
print("📚 This is the power of RAG: grounded, factual, and citable answers")

## 🎉 You've Built a Complete RAG System!

**What you accomplished:**
- **🗄️ Storage Layer**: Registered and configured Milvus vector database with proper embeddings
- **🔧 RAG Layer**: Used LlamaStack's RAG Tool for automatic document processing and chunking
- **🤖 User Layer**: Built query processing with context-aware generation and streaming responses
- **📊 End-to-End Pipeline**: Demonstrated retrieval → context injection → generation → citation

**Key Technical Insights:**
- **Semantic Search**: Questions find relevant content by meaning, not just keyword matching
- **Document Chunking**: Large documents are split optimally (512 tokens) for precise retrieval
- **Context Injection**: The magic happens when retrieved chunks become context for the LLM
- **Grounded Generation**: Responses are factual because they reference specific document content

**RAG vs Standard LLMs:**
| Standard LLM | RAG-Enhanced LLM |
|--------------|------------------|
| ❌ Limited to training data | ✅ Access to your documents |
| ❌ Can hallucinate facts | ✅ Grounded in real sources |
| ❌ No citations | ✅ Traceable references |
| ❌ Static knowledge | ✅ Updatable knowledge base |

**Advanced RAG Patterns to Explore:**
- **Multi-Document Reasoning**: Synthesize information across multiple sources
- **Conversational RAG**: Maintain context across multiple questions
- **Hybrid Search**: Combine semantic and keyword search
- **Agent Workflows**: Let AI agents decide when and how to search documents

Your RAG system can now intelligently answer questions using document knowledge - the foundation for intelligent, knowledge-aware applications! 🚀