# Level 1: Simple RAG

This notebook will show you how to build a simple RAG application with Llama Stack. You will learn how the API's provided by Llama Stack can be used to directly control and invoke all common RAG stages, including indexing, retrieval and inference. 

_Note: This notebook contains a non-agentic implementation of RAG. We will show you how to build an agentic RAG application later in this tutorial in [Level4_RAG_agent](Level4_RAG_agent.ipynb)._

## Overview

This tutorial covers the following steps:
1. Indexing a collection of documents into a vector database for later retrieval.
2. Executing the built-in RAG tool to retrieve the document chunks relevant to a given query.
3. Using the retrieved context to answer user queries during the inference step.

## 1. Setting Up this Notebook

First, we will start with a few imports.

In [1]:
import uuid

from llama_stack_client import RAGDocument
from llama_stack_client.types.shared.content_delta import TextDelta, ToolCallDelta

Next, we will initialize our environment as described in detail in our ["Getting Started" notebook](Level0_getting_started_with_Llama_Stack.ipynb). Please refer to it for additional explanations.

In [2]:
# for accessing the environment variables
import os
from dotenv import load_dotenv
load_dotenv()

# for communication with Llama Stack
from llama_stack_client import LlamaStackClient

# pretty print of the results returned from the model/agent
import sys
sys.path.append('..')  
from src.utils import step_printer
from termcolor import cprint

base_url = os.getenv("REMOTE_BASE_URL")

# Tavily search API key is required for some of our demos and must be provided to the client upon initialization.
# We will cover it in the agentic demos that use the respective tool. Please ignore this parameter for all other demos.
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}


client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)
    
print(f"Connected to Llama Stack server")

# model_id for the model you wish to use that is configured with the Llama Stack server
model_id = "granite32-8b"

temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 4096))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tModel: {model_id}\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Connected to Llama Stack server
Inference Parameters:
	Model: granite32-8b
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 4096}
	stream: False


Finally, we complete the setup by initializing the document collection we will use for RAG ingestion and retrieval.

In [3]:
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

## 2. Indexing the Documents
- Initialize a new document collection in our vector database. All parameters related to the vector database, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, converting, and chunking the content of the documents.

In [4]:
# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=os.getenv("VDB_EMBEDDING"),
    embedding_dimension=int(os.getenv("VDB_EMBEDDING_DIMENSION", 384)),
    provider_id=os.getenv("VDB_PROVIDER"),
)

# ingest the documents into the newly created document collection
urls = [
    ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=int(os.getenv("VECTOR_DB_CHUNK_SIZE", 512)),
)

## 3. Executing Queries via the Built-in RAG Tool
- Directly invoke the RAG tool to query the vector database we ingested into at the previous stage.
- Construct an extended prompt using the retrieved chunks.
- Query the model with the extended prompt.
- Output the reply received from the model.

In [5]:
queries = [
    "How do I install OpenShift?",
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call
    rag_response = client.tool_runtime.rag_tool.query(content=prompt, vector_db_ids=[vector_db_id])

    # the list of messages to be sent to the model must start with the system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # construct the actual prompt to be executed, incorporating the original query and the retrieved content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # use Llama Stack inference API to directly communicate with the desired model
    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=stream,
    )
    
    # print the response
    cprint("inference> ", color="magenta", end='')
    if stream:
        for chunk in response:
            response_delta = chunk.event.delta
            if isinstance(response_delta, TextDelta):
                cprint(response_delta.text, color="magenta", end='')
            elif isinstance(response_delta, ToolCallDelta):
                cprint(response_delta.tool_call, color="magenta", end='')
    else:
        cprint(response.completion_message.content, color="magenta")

[34m
User> How do I install OpenShift?[0m
[35minference> [0m[35mTo install OpenShift, follow these steps:

1. Open your web browser and navigate to console.redhat.com/openshift/create/local.
2. Download the latest release of OpenShift Local and the "pull secret" file. The latter is a file containing a key identifying your copy of OpenShift Local to your Red Hat Developer account.
3. Unzip the file containing the OpenShift Local executable.
4. Using your terminal, run the command `crc setup`. This command will prepare your copy of OpenShift Local, verifying requirements and setting the required configuration values.
5. Once the `crc setup` command is ready, launch `crc start`. Running `crc start` can take around 20 minutes on a recent PC.
6. Once started, access the OpenShift Web Console with the `crc console` command, which will open your default browser. OpenShift Local uses the developer username and password to log in as a low-privilege user, while the `kubeadmin` user uses a r

## Key Takeaways
This notebook demonstrated how to set up and use the built-in RAG tool for ingesting user-provided documents in a vector database and utilizing them during inference via direct retrieval. 

Now that we've seen how easy it is to implement RAG with Llama Stack, We'll move on to building a simple agent with Llama Stack next in our [Simple Agents](./Level2_simple_agent_with_websearch.ipynb) notebook.

#### Any Feedback?

If you have any feedback on this or any other notebook in this demo series we'd love to hear it! Please go to https://www.feedback.redhat.com/jfe/form/SV_8pQsoy0U9Ccqsvk and help us improve our demos. 