# Level 1: Simple RAG

This notebook will show you how to build a simple RAG application with Llama Stack. You will learn how the API's provided by Llama Stack can be used to directly control and invoke all common RAG stages, including indexing, retrieval and inference. 

_Note: This notebook contains a non-agentic implementation of RAG. We will show you how to build an agentic RAG application later in this tutorial in [Level4_RAG_agent](Level4_RAG_agent.ipynb)._

## Overview

This tutorial covers the following steps:
1. Indexing a collection of documents into a vector database for later retrieval.
2. Executing the built-in RAG tool to retrieve the document chunks relevant to a given query.
3. Using the retrieved context to answer user queries during the inference step.

## 1. Setting Up this Notebook

First, we will start with a few imports.

In [2]:
!pip install llama_stack_client fire dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [37]:
import uuid

from llama_stack_client import RAGDocument
from llama_stack_client.types.shared.content_delta import TextDelta, ToolCallDelta
import base64
import requests


Next, we will initialize our environment as described in detail in our ["Getting Started" notebook](Level0_getting_started_with_Llama_Stack.ipynb). Please refer to it for additional explanations.

In [3]:
# for accessing the environment variables
import os
from dotenv import load_dotenv
load_dotenv()

# for communication with Llama Stack
from llama_stack_client import LlamaStackClient

# pretty print of the results returned from the model/agent
import sys
sys.path.append('..')  
from src.utils import step_printer
from termcolor import cprint

base_url = "http://llama-stack.genaiops-rag.svc.cluster.local:80" #os.getenv("REMOTE_BASE_URL")

# Tavily search API key is required for some of our demos and must be provided to the client upon initialization.
# We will cover it in the agentic demos that use the respective tool. Please ignore this parameter for all other demos.
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}


client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)
    
print(f"Connected to Llama Stack server")

# model_id for the model you wish to use that is configured with the Llama Stack server
model_id = "llama32-3b"

temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 4096))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tModel: {model_id}\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Connected to Llama Stack server
Inference Parameters:
	Model: llama32-3b
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 4096}
	stream: True


Finally, we complete the setup by initializing the document collection we will use for RAG ingestion and retrieval.

In [4]:
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

## 2. Indexing the Documents
- Initialize a new document collection in our vector database. All parameters related to the vector database, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, converting, and chunking the content of the documents.

In [46]:
# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2", #os.getenv("VDB_EMBEDDING"),
    embedding_dimension=int(os.getenv("VDB_EMBEDDING_DIMENSION", 384)),
    provider_id="milvus", #os.getenv("VDB_PROVIDER"),
)

# ingest the documents into the newly created document collection
urls = [
    ("https://raw.githubusercontent.com/rhoai-genaiops/deploy-lab/main/university-data/canopy-in-botany.pdf", "application/pdf"),
    # ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={"source_url": url},
    )
    for i, (url, url_type) in enumerate(urls)
]

#image_url = "https://raw.githubusercontent.com/rhoai-genaiops/deploy-lab/main/university-data/canopy-image.jpg"
#image_url = "https://ehq-production-australia.imgix.net/4a07fa408fdd60d1d4052d0d83911f60696fd584/photos/images/000/037/330/original/Tree_Canopy_Health.png"
# image_url = "https://raw.githubusercontent.com/meta-llama/llama-stack/refs/heads/main/docs/_static/llama-stack.png"
# B64_ENCODED_IMAGE = base64.b64encode(requests.get(image_url).content).decode("utf-8")

# image_document = RAGDocument(
#     document_id="num-image-3",
#     content={
#         "type": "image",
#         "image": {"data": B64_ENCODED_IMAGE},
#     },
#     metadata={"source_url": image_url},
#     mime_type="image/jpeg",
# )

documents.append(image_document)

client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=int(os.getenv("VECTOR_DB_CHUNK_SIZE", 512)),
)

INFO:httpx:HTTP Request: POST http://llama-stack.genaiops-rag.svc.cluster.local/v1/vector-dbs "HTTP/1.1 200 OK"


INFO:httpx:HTTP Request: POST http://llama-stack.genaiops-rag.svc.cluster.local/v1/tool-runtime/rag-tool/insert "HTTP/1.1 200 OK"


## 3. Executing Queries via the Built-in RAG Tool
- Directly invoke the RAG tool to query the vector database we ingested into at the previous stage.
- Construct an extended prompt using the retrieved chunks.
- Query the model with the extended prompt.
- Output the reply received from the model.

In [47]:
queries = [
    "What are the types of Canopy?",
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call
    rag_response = client.tool_runtime.rag_tool.query(
        content=prompt, 
        vector_db_ids=[vector_db_id],
        query_config={
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
        },
        )

    cprint(f"\n--- RAG Metadata ---", "yellow")
    cprint(rag_response.metadata, "cyan")

    # the list of messages to be sent to the model must start with the system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # construct the actual prompt to be executed, incorporating the original query and the retrieved content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # use Llama Stack inference API to directly communicate with the desired model
    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=stream,
    )
    
    # print the response
    cprint("inference> ", color="magenta", end='')
    if stream:
        for chunk in response:
            response_delta = chunk.event.delta
            if isinstance(response_delta, TextDelta):
                cprint(response_delta.text, color="magenta", end='')
            elif isinstance(response_delta, ToolCallDelta):
                cprint(response_delta.tool_call, color="magenta", end='')
    else:
        cprint(response.completion_message.content, color="magenta")

[34m
User> What are the types of Canopy?[0m


INFO:httpx:HTTP Request: POST http://llama-stack.genaiops-rag.svc.cluster.local/v1/tool-runtime/rag-tool/query "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://llama-stack.genaiops-rag.svc.cluster.local/v1/inference/chat-completion "HTTP/1.1 200 OK"


[33m
--- RAG Metadata ---[0m
[36m{'document_ids': ['num-0', 'num-0', 'num-0', 'num-0', 'num-0']}[0m
[35minference> [0m[35m[0m[35mBased[0m[35m on[0m[35m the[0m[35m provided[0m[35m knowledge[0m[35m search[0m[35m tool[0m[35m results[0m[35m,[0m[35m the[0m[35m types[0m[35m of[0m[35m Can[0m[35mopy[0m[35m can[0m[35m be[0m[35m summarized[0m[35m as[0m[35m follows[0m[35m:

[0m[35m1[0m[35m.[0m[35m **[0m[35mD[0m[35mense[0m[35m Can[0m[35mopy[0m[35m**:[0m[35m Forms[0m[35m a[0m[35m continuous[0m[35m or[0m[35m semi[0m[35m-[0m[35mcontinuous[0m[35m cover[0m[35m of[0m[35m foliage[0m[35m and[0m[35m branches[0m[35m that[0m[35m stretches[0m[35m above[0m[35m the[0m[35m forest[0m[35m floor[0m[35m,[0m[35m typically[0m[35m found[0m[35m in[0m[35m rain[0m[35mfore[0m[35msts[0m[35m.[0m[35m This[0m[35m type[0m[35m of[0m[35m canopy[0m[35m supports[0m[35m a[0m[35m unique[0m[35m ecosystem[