# Simple RAG

This notebook will show you how to build a simple RAG application with Llama Stack. You will learn how the API's provided by Llama Stack can be used to directly control and invoke all common RAG stages, including indexing, retrieval and inference.

## Overview

This tutorial covers the following steps:
1. Indexing a collection of documents into a vector database for later retrieval.
2. Executing the built-in RAG tool to retrieve the document chunks relevant to a given query.
3. Using the retrieved context to answer user queries during the inference step.

## 1. Setting Up this Notebook

First of all, we need to install the correct version of the Llama Stack client which we will use for this use case. This needs to match the version of the Llama Stack server which is part of OpenShift AI, in this case: 

In [None]:
%pip install llama_stack_client==0.2.10

Then, we will need a few imports we use throughout the notebook:

In [None]:
from llama_stack_client import RAGDocument, LlamaStackClient

# Just used for better readability of std output
from pprint import pprint

Now we initialize the Llama Stack client itself. We only need to provide the endpoint it's reachable with. This is exposed via a Kubernetes Service on the underlying cluster:

In [None]:
client = LlamaStackClient(base_url="http://lsd-llama-milvus-service.llamastack.svc.cluster.local:8321")

print(f"Connected to Llama Stack server")

Next, we want to check which models are available to us and set the `model_id` to use later on:

In [None]:
# Fetch all registered models
models = client.models.list()

# Let's see what's in there
print('Following models are available:')

models

# Set the `model_id`to the one which is a "llm"
model_id = next(m.identifier for m in models if m.model_type == "llm")

Finally, we complete the setup by initializing the document collection we will use for RAG ingestion and retrieval.

In [None]:
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

# UPDATED UNTIL HERE

## 2. Indexing the Documents
- Initialize a new document collection in our vector database. All parameters related to the vector database, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, converting, and chunking the content of the documents.

In [None]:
# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=os.getenv("VDB_EMBEDDING"),
    embedding_dimension=int(os.getenv("VDB_EMBEDDING_DIMENSION", 384)),
    provider_id=os.getenv("VDB_PROVIDER"),
)

# ingest the documents into the newly created document collection
urls = [
    ("https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/pdf/architecture/OpenShift_Container_Platform-4.19-Architecture-en-US.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=int(os.getenv("VECTOR_DB_CHUNK_SIZE", 512)),
)

## 3. Executing Queries via the Built-in RAG Tool
- Directly invoke the RAG tool to query the vector database we ingested into at the previous stage.
- Construct an extended prompt using the retrieved chunks.
- Query the model with the extended prompt.
- Output the reply received from the model.

In [None]:
queries = [
    "How do I install OpenShift?",
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call
    rag_response = client.tool_runtime.rag_tool.query(content=prompt, vector_db_ids=[vector_db_id])

    # the list of messages to be sent to the model must start with the system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # construct the actual prompt to be executed, incorporating the original query and the retrieved content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # use Llama Stack inference API to directly communicate with the desired model
    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=stream,
    )
    
    # print the response
    cprint("inference> ", color="magenta", end='')
    if stream:
        for chunk in response:
            response_delta = chunk.event.delta
            if isinstance(response_delta, TextDelta):
                cprint(response_delta.text, color="magenta", end='')
            elif isinstance(response_delta, ToolCallDelta):
                cprint(response_delta.tool_call, color="magenta", end='')
    else:
        cprint(response.completion_message.content, color="magenta")

## Key Takeaways
This notebook demonstrated how to set up and use the built-in RAG tool for ingesting user-provided documents in a vector database and utilizing them during inference via direct retrieval. 

Now that we've seen how easy it is to implement RAG with Llama Stack, We'll move on to building a simple agent with Llama Stack next in our [Simple Agents](./Level2_simple_agent_with_websearch.ipynb) notebook.

#### Any Feedback?

If you have any feedback on this or any other notebook in this demo series we'd love to hear it! Please go to https://www.feedback.redhat.com/jfe/form/SV_8pQsoy0U9Ccqsvk and help us improve our demos. 