# ü¶ô Llama Stack & RAG: Building Intelligent Agents

This notebook demonstrates **Retrieval-Augmented Generation (RAG)** - a powerful technique that enables AI models to access and reason about external documents and knowledge bases.

**Why RAG Matters:**
Instead of relying only on training data, RAG-enhanced models can reference your specific documents, course materials, and knowledge bases to provide accurate, cited responses.

In [None]:
!pip install -q llama_stack_client==0.3.0 fire==0.7.1 dotenv==0.9.9

In [None]:
import uuid

from llama_stack_client import RAGDocument, LlamaStackClient
from termcolor import cprint

import sys
sys.path.append('..')

## üîó Connect to Llama Stack

Connect to Llama Stack - the AI engine that orchestrates all RAG operations. Llama Stack acts as the central hub that coordinates:
- Vector database operations (storage and retrieval)
- Document processing and chunking
- LLM inference with retrieved context
- Agent workflows and tool usage

In [None]:
# The base URL points to your Llama Stack server deployment
base_url = "http://llama-stack-service:8321"

# Create the Llama Stack client
client = LlamaStackClient(
    base_url=base_url,
    provider_data=None
)

print(f"Connected to Llama Stack server")

# Configs for model and sampling
model_id = "llama32"
temperature = 0.0
max_tokens = 512
stream = False

# Configure the sampling strategy based on temperature
if temperature > 0.0:
    top_p = 0.95  # Nucleus sampling parameter
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}  # Always pick most likely token

# Package sampling parameters for the inference API
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

print(f"Model: {model_id}")
print(f"Sampling Parameters: {sampling_params}")
print(f"Stream: {stream}")

## Register the Vector Database

In [None]:
# This tells Llama Stack how to connect to and use your vector database
vs = client.vector_stores.create(
    name="my_citations_db",
    extra_body={
        "embedding_model": "all-MiniLM-L6-v2",
        "embedding_dimension": 384,
        "provider_id": "milvus",
        "vector_db_id": "test"
    }
)

print(f"üìä Created vector database with ID: {vs.id}")

## üìö Document Ingestion and Processing

This is where the **RAG Layer** comes into action! We'll use Llama Stack's RAG Tool to automatically:

1. **Download** documents from URLs
2. **Process** PDF content and extract text
3. **Chunk** large documents into optimal pieces (512 tokens each)
4. **Embed** each chunk using the embedding model
5. **Store** vectors and metadata in the vector database

**Two ways to ingest documents:**
- **Direct Vector IO**: Insert pre-processed chunks directly into your Vector Database
- **Llama Stack RAG Tool** (what we're using): Automatic processing from URLs or files

We use the RAG Tool because it allows us to easily ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. 

In [None]:
# List of URLs (in this case just 1) to process
urls = [
    "https://raw.githubusercontent.com/rhoai-genaiops/deploy-lab/main/university-data/canopy-in-botany.pdf",
]

# Display what we're about to ingest
print("üìñ Ingesting documents into RAG system...")
for i, url in enumerate(urls):
    print(f"  ‚Ä¢ Document {i+1}: {url}")

# Download and upload files using the new API
import requests
from io import BytesIO

try:
    uploaded_file_ids = []
    
    for i, url in enumerate(urls):
        print(f"\nüì• Downloading document from: {url}")
        response = requests.get(url)
        response.raise_for_status()
        
        # Create a file-like object from the downloaded content
        file_content = BytesIO(response.content)
        file_content.name = f"canopy-in-botany-{i}.pdf"
        
        # Upload file using the new files API
        uploaded_file = client.files.create(
            file=file_content,
            purpose="assistants"  # Required purpose parameter
        )
        
        uploaded_file_ids.append(uploaded_file.id)
        print(f"‚úÖ Uploaded file with ID: {uploaded_file.id}")
    
    # Add files to the vector store with chunking configuration
    for file_id in uploaded_file_ids:
        client.vector_stores.files.create(
            vector_store_id=vs.id,
            file_id=file_id,
            chunking_strategy={
                "type": "static",
                "static": {
                    "max_chunk_size_tokens": 512,
                    "chunk_overlap_tokens": 50
                }
            }
        )
        print(f"‚úÖ Added file {file_id} to vector store with chunking")
    
    print("\n‚úÖ Document ingestion complete!")
    print("üéØ Your documents are now searchable via semantic similarity!")
    
except Exception as e:
    print(f"\n‚ùå Document ingestion failed: {e}")
    print("üí° This might be due to PDF processing issues or network connectivity. Try with different documents or check the PDF accessibility.")

# Choice time! üôã‚Äç‚ôÇÔ∏è

Now that we have injested a document, you can choose to either try out RAG here in the notebook or use the Llama Playground

üßæ If you prefer notebook, continue to the next cell.

ü¶ô If you wish to use Llama Stack Playground, open it up (here is the route if you have closed it: `https://llama-stack-playground-<USER_NAME>-test.<CLUSTER_DOMAIN>`) and in the left menu select the `Document Collections` we just injested (there should only be 1, and named something like `test_vector_db_1234..`). 

Then try these questions (with and without the document selected):

- What are the types of Canopy?
- What is the structure of Canopy?



![genaiops-rag-meme.png](genaiops-rag-meme.png)

In [None]:
queries = [
    "What are the types of Canopy?",
    "What is the structure of Canopy?",
]

## First, without RAG
First, let's test the response without RAG in the picture so we have something to compare with.  
Notice that this is very similar to the code we used before to send a prompt to the model, system prompt included and everything

In [None]:
for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    response = client.chat.completions.create(
        model=model_id,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
        stream=True,
    )

    cprint("inference> ", color="magenta", end='')
    
    # Handle streaming response - tokens arrive one by one
    for chunk in response:
        if hasattr(chunk, 'choices') and chunk.choices:
            delta = chunk.choices[0].delta
            if hasattr(delta, 'content') and delta.content:
                cprint(delta.content, color="magenta", end='')
        elif hasattr(chunk, 'content'):
            cprint(chunk.content, color="magenta", end='')

## üîç Testing RAG Retrieval and Generation

Now let's test the complete **RAG Pipeline** - this demonstrates how all three layers work together:

### The RAG Process:
1. **üîç Query Processing**: Convert user question into embeddings
2. **üìö Semantic Retrieval**: Find most similar document chunks in vector database  
3. **üîó Context Assembly**: Combine user question with retrieved chunks
4. **ü§ñ Generation**: LLM generates informed response using both its training and the retrieved context
5. **üìñ Citation**: Response includes references to source documents

**Why this works better than normal LLMs:**
- **Grounded responses**: Answers are based on your specific documents
- **Up-to-date**: Add new documents without retraining the model
- **Traceable**: Every answer can be traced back to source material
- **Accurate**: Reduces hallucination by providing factual context

In [None]:
for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # Query the vector database to find relevant document chunks
    search_results = client.vector_stores.search(
        vector_store_id=vs.id,
        query=prompt,
        max_num_results=5,
        search_mode="vector"  # Use vector similarity search
    )

    retrieved_chunks = []
    for i, result in enumerate(search_results.data):
        chunk_content = result.content if hasattr(result, 'content') else str(result)
        metadata = result.metadata if hasattr(result, 'metadata') else {}
        retrieved_chunks.append(f"Result {i+1}\nContent: {chunk_content}\nMetadata: {metadata}")
    
    rag_response_content = "\n\n".join(retrieved_chunks)
    
    cprint(f"Text chunks from vector search found: {len(search_results.data)} chunks")
    cprint(f"\n--- Search Results ---", "yellow")
    cprint(rag_response_content[:500] + "..." if len(rag_response_content) > 500 else rag_response_content, "cyan")

    
    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    # Now we inject the retrieved content as context into our prompt, this is the key part to make RAG work as this is where the LLM gets the document information
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{rag_response_content}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    response = client.chat.completions.create(
        model=model_id,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
        stream=True,
    )

    cprint("inference> ", color="magenta", end='')
    
    # Handle streaming response - tokens arrive one by one
    for chunk in response:
        if hasattr(chunk, 'choices') and chunk.choices:
            delta = chunk.choices[0].delta
            if hasattr(delta, 'content') and delta.content:
                cprint(delta.content, color="magenta", end='')
        elif hasattr(chunk, 'content'):
            cprint(chunk.content, color="magenta", end='')

## üéâ You've Built a Complete RAG System!

Buuut... it's running in an in-line vector database, with no automation or redundancy, and is not connected to our application yet.  
Let's go through the steps to move this from a proof of concept to a production ready system!