# 🦙 Llama Stack & RAG: Building Intelligent Agents

This notebook demonstrates **Retrieval-Augmented Generation (RAG)** - a powerful technique that enables AI models to access and reason about external documents and knowledge bases.

**Why RAG Matters:**
Instead of relying only on training data, RAG-enhanced models can reference your specific documents, course materials, and knowledge bases to provide accurate, cited responses.

In [1]:
!pip install -q llama_stack_client fire dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import uuid

from llama_stack_client import RAGDocument, LlamaStackClient
from termcolor import cprint

import sys
sys.path.append('..')

## 🔗 Connect to Llama Stack

Connect to Llama Stack - the AI engine that orchestrates all RAG operations. Llama Stack acts as the central hub that coordinates:
- Vector database operations (storage and retrieval)
- Document processing and chunking
- LLM inference with retrieved context
- Agent workflows and tool usage

In [None]:
# The base URL points to your Llama Stack server deployment
base_url = "http://llama-stack-service:8321"

# Create the Llama Stack client
client = LlamaStackClient(
    base_url=base_url,
    provider_data=None
)

print(f"Connected to Llama Stack server")

# Configs for model and sampling
model_id = "llama32"
temperature = 0.0
max_tokens = 512
stream = False

# Configure the sampling strategy based on temperature
if temperature > 0.0:
    top_p = 0.95  # Nucleus sampling parameter
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}  # Always pick most likely token

# Package sampling parameters for the inference API
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

print(f"Model: {model_id}")
print(f"Sampling Parameters: {sampling_params}")
print(f"Stream: {stream}")

## Register the Vector Database

In [None]:
# Generate a unique identifier to ensures no conflicts if multiple users run this notebook simultaneously
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
print(f"📊 Created vector database ID: {vector_db_id}")

# This tells Llama Stack how to connect to and use your vector database
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,                      # Vector size (must match model output)
    provider_id="milvus",                         # Use our inline Vector database, as we set it up in Llama Stack
)
print(f"✅ Registered vector database with inline backend")

## 📚 Document Ingestion and Processing

This is where the **RAG Layer** comes into action! We'll use Llama Stack's RAG Tool to automatically:

1. **Download** documents from URLs
2. **Process** PDF content and extract text
3. **Chunk** large documents into optimal pieces (512 tokens each)
4. **Embed** each chunk using the embedding model
5. **Store** vectors and metadata in the vector database

**Two ways to ingest documents:**
- **Direct Vector IO**: Insert pre-processed chunks directly into your Vector Database
- **Llama Stack RAG Tool** (what we're using): Automatic processing from URLs or files

We use the RAG Tool because it allows us to easily ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. 

In [None]:
# List of URLs (in this case just 1) to process
urls = [
    ("https://raw.githubusercontent.com/rhoai-genaiops/deploy-lab/main/university-data/canopy-in-botany.pdf", "application/pdf"),
]

# RAGDocument is Llama Stack's format for documents to be ingested
documents = [
    RAGDocument(
        document_id=f"doc-{i}",
        content=url,
        mime_type=url_type,
        metadata={
            "source_url": url,
            "document_type": "academic_material",
        },
    )
    for i, (url, url_type) in enumerate(urls)
]

# Display what we're about to ingest
print("📖 Ingesting documents into RAG system...")
for i, (url, url_type) in enumerate(urls):
    print(f"  • Document {i+1}: {url}")

# Then we automatically download, chunk, and store our document(s)
try:
    client.tool_runtime.rag_tool.insert(
        documents=documents,
        vector_db_id=vector_db_id,
        chunk_size_in_tokens=512,
    )
    print("\n✅ Document ingestion complete!")
    print("🎯 Your documents are now searchable via semantic similarity!")
except Exception as e:
    print(f"\n❌ Document ingestion failed: {e}")
    print("💡 This might be due to PDF processing issues. Try with different documents or check the PDF accessibility.")

# Choice time! 🙋‍♂️

Now that we have injested a document, you can choose to either try out RAG here in the notebook or use the Llama Playground

🧾 If you prefer notebook, continue to the next cell.

🦙 If you wish to use Llama Playground, open it up (here is the route if you have closed it: `https://milvus-test-attu-<USER_NAME>-test.<CLUSTER_DOMAIN>`) and in the left menu choose the `Document Collections` we just injested (there should only be 1). 

Then try these questions (with and without the document selected):

- What are the types of Canopy?
- What is the structure of Canopy?



![genaiops-rag-meme.png](genaiops-rag-meme.png)

In [None]:
queries = [
    "What are the types of Canopy?",
    "What is the structure of Canopy?",
]

## First, without RAG
First, let's test the response without RAG in the picture so we have something to compare with.  
Notice that this is very similar to the code we used before to send a prompt to the model, system prompt included and everything

In [None]:
for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")

    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=True,
    )

    cprint("inference> ", color="magenta", end='')
    
    # Handle streaming response - tokens arrive one by one
    for event in response:
        ev = getattr(event, "event", event)
        delta = getattr(ev, "delta", None)

        if delta is None:
            continue

        text = getattr(delta, "text", None)
        if isinstance(text, str):
            cprint(text, color="magenta", end='')
            continue

        tool_call = getattr(delta, "tool_call", None)
        if tool_call is not None:
            cprint(str(tool_call), color="magenta", end='')
            continue

## 🔍 Testing RAG Retrieval and Generation

Now let's test the complete **RAG Pipeline** - this demonstrates how all three layers work together:

### The RAG Process:
1. **🔍 Query Processing**: Convert user question into embeddings
2. **📚 Semantic Retrieval**: Find most similar document chunks in vector database  
3. **🔗 Context Assembly**: Combine user question with retrieved chunks
4. **🤖 Generation**: LLM generates informed response using both its training and the retrieved context
5. **📖 Citation**: Response includes references to source documents

**Why this works better than normal LLMs:**
- **Grounded responses**: Answers are based on your specific documents
- **Up-to-date**: Add new documents without retraining the model
- **Traceable**: Every answer can be traced back to source material
- **Accurate**: Reduces hallucination by providing factual context

In [None]:
for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # Query the vector database to find relevant document chunks
    rag_response = client.tool_runtime.rag_tool.query(
        content=prompt,
        vector_db_ids=[vector_db_id],
        query_config={
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
        },
    )

    cprint(f"Text chunks from RAG found: {rag_response}")
    cprint(f"\n--- RAG Metadata ---", "yellow")
    cprint(rag_response.metadata, "cyan")

    
    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    # Now we inject the retrieved content as context into our prompt, this is the key part to make RAG work as this is where the LLM gets the document information
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=True,
    )

    cprint("inference> ", color="magenta", end='')
    
    # Handle streaming response - tokens arrive one by one
    for event in response:
        ev = getattr(event, "event", event)
        delta = getattr(ev, "delta", None)

        if delta is None:
            continue

        text = getattr(delta, "text", None)
        if isinstance(text, str):
            cprint(text, color="magenta", end='')
            continue

        tool_call = getattr(delta, "tool_call", None)
        if tool_call is not None:
            cprint(str(tool_call), color="magenta", end='')
            continue

## 🎉 You've Built a Complete RAG System!

Buuut... it's running in an in-line vector database, with no automation or redundency, and is not connected to our application yet.  
Let's go through the steps to move this from a proof of concept to a production ready system!