# Level 1: Foundational RAG

This tutorial presents an example of executing queries with foundational (i.e., non-agentic) RAG in Llama Stack. It shows how the APIs provided by Llama Stack can be used to directly control and invoke all RAG stages, including indexing, retrieval and inference. 
For an agentic RAG tutorial, please refer to [Level3_agentic_RAG.ipynb](demos/rag_agentic/notebooks/Level3_agentic_RAG.ipynb).

## Overview

This tutorial covers the following steps:
1. Indexing a collection of documents in a vector DB for later retrieval.
2. Executing the built-in RAG tool to retrieve the document chunks relevant to a given query.
3. Using the retrieved context to answer user queries during the inference step.

## Prerequisites

Before starting, ensure you have a running instance of the Llama Stack server (local or remote) with at least one preconfigured vector DB. For more information, please refer to the corresponding [Llama Stack tutorials](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).

## Setting the Environment Variables

Use the [`.env.example`](../../../.env.example) to create a new file called `.env` and ensure you add all the relevant environment variables below.

In addition to the environment variables listed in the ["Getting Started" notebook](demos/rag_agentic/notebooks/Level0_getting_started_with_Llama_Stack.ipynb), the following should be provided for this demo to run:
 - `VDB_PROVIDER`: the vector DB provider to be used. Must be supported by Llama Stack. For this demo, we use Milvus Lite which is our preferred solution.
 - `VDB_EMBEDDING`: the embedding model to be used for ingestion and retrieval. For this demo, we use all-MiniLM-L6-v2.
 - `VDB_EMBEDDING_DIMENSION` (optional): the dimension of the embedding. Defaults to 384.
 - `VECTOR_DB_CHUNK_SIZE` (optional): the chunk size for the vector DB. Defaults to 512.

## 1. Setting Up the Environment
We will start with a few imports needed for this demo only.

In [1]:
import uuid

from llama_stack_client import RAGDocument
from llama_stack_client.types.shared.content_delta import TextDelta, ToolCallDelta

Next, we will initialize our environment as described in detail in our ["Getting Started" notebook](demos/rag_agentic/notebooks/Level0_getting_started_with_Llama_Stack.ipynb). Please refer to it for additional explanations.

In [2]:
# for accessing the environment variables
import os
from dotenv import load_dotenv
load_dotenv()

# for communication with Llama Stack
from llama_stack_client import LlamaStackClient

# pretty print of the results returned from the model/agent
import sys
sys.path.append('..')  
from src.utils import step_printer
from termcolor import cprint

remote = os.getenv("REMOTE", "True")

if remote == "False":
    local_port = os.getenv("LOCAL_SERVER_PORT", 8321)
    base_url = f"http://localhost:{local_port}"
else: # any value non equal to 'False' will be considered as 'True'
    base_url = os.getenv("REMOTE_BASE_URL")


# Tavily search API key is required for some of our demos and must be provided to the client upon initialization.
# We will cover it in the agentic demos that use the respective tool. Please ignore this parameter for all other demos.
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}


client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)
    
print(f"Connected to Llama Stack server @ {base_url}")

# model_id will later be used to pass the name of the desired inference model to Llama Stack Agents/Inference APIs
model_id = os.getenv("INFERENCE_MODEL_ID")

temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 4096))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tModel: {model_id}\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Connected to Llama Stack server @ http://localhost:8321
Inference Parameters:
	Model: ibm-granite/granite-3.2-8b-instruct
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 4096}
	stream: True


Finally, we will initialize the document collection to be used for RAG ingestion and retrieval.

In [3]:
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

## 2. Indexing the Documents
- Initialize a new document collection in the target vector DB. All parameters related to the vector DB, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, conversion and chunking of the documents' content.

In [5]:
# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=os.getenv("VDB_EMBEDDING"),
    embedding_dimension=int(os.getenv("VDB_EMBEDDING_DIMENSION", 384)),
    provider_id=os.getenv("VDB_PROVIDER"),
)

# ingest the documents into the newly created document collection
urls = [
    ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=int(os.getenv("VECTOR_DB_CHUNK_SIZE", 512)),
)

## 3. Executing Queries via the Built-in RAG Tool
- Directly invoke the RAG tool to query the vector DB we ingested into at the previous stage.
- Construct an extended prompt using the retrieved chunks.
- Query the model with the extended prompt.
- Output the reply received from the model.

In [10]:
queries = [
    "How to install OpenShift?",
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call
    rag_response = client.tool_runtime.rag_tool.query(content=prompt, vector_db_ids=[vector_db_id])

    # the list of messages to be sent to the model must start with the system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # construct the actual prompt to be executed, incorporating the original query and the retrieved content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # use Llama Stack inference API to directly communicate with the desired model
    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=stream,
    )
    
    # print the response
    cprint("inference> ", color="yellow", end='')
    for chunk in response:
        response_delta = chunk.event.delta
        if isinstance(response_delta, TextDelta):
            cprint(response_delta.text, color="yellow", end='')
        elif isinstance(response_delta, ToolCallDelta):
            cprint(response_delta.tool_call, color="yellow", end='')

[34m
User> How to install OpenShift?[0m
[33mTo[0m[33m install[0m[33m Open[0m[33mShift[0m[33m,[0m[33m you[0m[33m can[0m[33m follow[0m[33m these[0m[33m steps[0m[33m:

[33m.[0m[33m **[0m[33mDownload[0m[33m and[0m[33m Install[0m[33m the[0m[33m oc[0m[33m CLI[0m[33m Tool[0m[33m**:[0m[33m You[0m[33m need[0m[33m to[0m[33m download[0m[33m and[0m[33m install[0m[33m the[0m[33m `[0m[33moc[0m[33m`[0m[33m command[0m[33m-line[0m[33m tool[0m[33m from[0m[33m the[0m[33m official[0m[33m Red[0m[33m Hat[0m[33m website[0m[33m.
[33m2[0m[33m.[0m[33m **[0m[33mCreate[0m[33m a[0m[33m Cluster[0m[33m**:[0m[33m Create[0m[33m an[0m[33m Open[0m[33mShift[0m[33m cluster[0m[33m using[0m[33m one[0m[33m of[0m[33m the[0m[33m following[0m[33m methods[0m[33m:
[33m*[0m[33m **[0m[33mMin[0m[33mish[0m[33mift[0m[33m**:[0m[33m Use[0m[33m Min[0m[33mish[0m[33mift[0m[33m,[0m[33m a[0m[33m tool

## Key Takeaways
This tutorial demonstrates how to set up and use the built-in RAG tool for ingesting user-provided documents in a vector DB and later utilizing them during inference via direct retrieval. Please check out our [complementary tutorial]([Level3_agentic_RAG.ipynb](demos/rag_agentic/notebooks/Level3_agentic_RAG.ipynb) for an agentic RAG example.