# Level 3: Agentic RAG

This tutorial presents an example of executing queries with agentic RAG in Llama Stack. It shows how to initialize an agent with the RAG tool provided by Llama Stack and to invoke it such that retrieval from a vector DB is activated when necessary. The tutorial also covers document ingestion using the RAG tool.
For a foundational (non-agentic) RAG tutorial, please refer to [Level1_foundational_RAG.ipynb](demos/rag_agentic/notebooks/Level1_foundational_RAG.ipynb).

## Overview

This tutorial covers the following steps:
1. Connecting to a llama-stack server.
2. Indexing a collection of documents in a vector DB for later retrieval.
3. Initializing the agent capable of retrieving content from vector DB via tool use.
4. Launching the agent and using it to answer user queries during the inference step.


## Prerequisites

Before starting, ensure you have a running instance of the Llama Stack server (local or remote) with at least one preconfigured vector DB. For more information, please refer to the corresponding [Llama Stack tutorials](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).

## 1. Setting Up the Environment
- Import the necessary libraries.
- Define the settings for the RAG pipeline, including the Llama Stack server URL, inference and document ingestion parameters.
- Initialize the connection to the server.

In [1]:
import os
import uuid

from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient

import sys
sys.path.append('..')  
from src.utils import step_printer

# the server endpoint
LLAMA_STACK_SERVER_URL = "http://localhost:8321"

# inference settings
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
SYSTEM_PROMPT = "You are a helpful assistant. "
TEMPERATURE = 0.0
TOP_P = 0.95
MAX_TOKENS = 4096

# RAG settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512

# For this demo, we are using Milvus Lite, which is our preferred solution. Any other Vector DB supported by Llama Stack can be used.
VECTOR_DB_PROVIDER_ID = 'milvus'

# initialize the inference strategy
if TEMPERATURE > 0.0:
    strategy = {"type": "top_p", "temperature": TEMPERATURE, "top_p": TOP_P}
else:
    strategy = {"type": "greedy"}

# set this setting to True to stream the agent reply
STREAM_OUTPUT = False
    
# initialize the document collection to be used for RAG
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
    
# initialize the server connection
client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL))

## 2. Indexing the Documents
- Initialize a new document collection in the target vector DB. All parameters related to the vector DB, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, conversion and chunking of the documents' content.

In [3]:
# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=VECTOR_DB_EMBEDDING_MODEL,
    embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
    provider_id=VECTOR_DB_PROVIDER_ID,
)

# ingest the documents into the newly created document collection
urls = [
    ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
    ("https://www.cdflaborlaw.com/_images/content/2023_OCBJ_GC_Awards_Article.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)

## 3. Executing queries via the RAG-aware agent
- Initialize an agent with a list of tools including the built-in RAG tool. The RAG tool specification must include a list of document collection IDs to retrieve from.
- For each prompt, initialize a new agent session, execute a turn during which a retrieval call may be requested, and output the reply received from the agent.

In [4]:
queries = [
    "How to install OpenShift?",
    "Are employees based in California eligible for remote work?",
]

# initializing the agent
agent = Agent(
    client,
    model=MODEL_ID,
    instructions=SYSTEM_PROMPT,
    sampling_params={
        "strategy": strategy,
        "max_tokens": MAX_TOKENS,
    },
    # we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools
    tools=[
        dict(
            name="builtin::rag/knowledge_search",
            args={
                "vector_db_ids": [vector_db_id],  # list of IDs of document collections to consider during retrieval
            },
        )
    ],
)

for prompt in queries:
    print(f"User> {prompt}")
    
    # create a new turn with a new session ID for each prompt
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=agent.create_session(f"rag-session_{uuid.uuid4()}"),
        stream=STREAM_OUTPUT,
    )
    
    # print the response, including tool calls output
    if STREAM_OUTPUT:
        for log in EventLogger().log(response):
            log.print()
    else:
        step_printer(response.steps)

User> How to install OpenShift?

---------- üìç Step 1: InferenceStep ----------
üõ†Ô∏è Tool call Generated:
[33mTool call: knowledge_search, Arguments: {'query': 'installing OpenShift'}[0m

---------- üìç Step 2: ToolExecutionStep ----------
üîß Executing tool...



---------- üìç Step 3: InferenceStep ----------
ü§ñ Model Response:
[33mThe provided text appears to be an excerpt from a book or documentation about Red Hat OpenShift, a platform for building, deploying, and managing applications in a cloud-native environment. The text covers various topics related to working with OpenShift, including:

1. **Importing Applications**: The text explains how to import applications into OpenShift using different methods:
	* Importing from a Git repository (e.g., GitHub, GitLab, Gitea).
	* Importing YAML files directly.
	* Importing JAR files for Java applications.
	* Creating a container image and importing it into OpenShift.
2. **Creating and Debugging Applications with the odo Tool**: The text introduces the odo tool, which is designed for software developers to create applications using "Devfiles" (devfile.yaml). Devfiles contain information about an application's programming language, dependencies, and other essential details. The odo tool can be 


---------- üìç Step 3: InferenceStep ----------
ü§ñ Model Response:
[33mBased on the provided text, it appears that there are two separate topics being discussed: one related to navigating remote work challenges for California employers and the other related to Tekton and OpenShift.

For the first topic, the top five issues for California employers when it comes to remote work are:

1. Expense Reimbursement
2. Data Security and Privacy
3. Meal and Rest Breaks
4. Expense Reimbursement (again, with a focus on establishing a comprehensive policy)
5. Employee Classification

For the second topic, the text discusses Tekton and OpenShift, specifically how to create a CI/CD pipeline using YAML files and how to inspect the pipeline's status and logs.

If you're looking for information on how to navigate remote work challenges for California employers, I can provide some general guidance:

1. Establish a comprehensive expense reimbursement policy that outlines eligible expenses and the reim

## Key Takeaways
This tutorial demonstrates how to implement agentic RAG with Llama Stack. We do so by initializing an agent while giving it access to the RAG tool, then invoking the agent on each of the specified queries. Please check out our [complementary tutorial](demos/rag_agentic/notebooks/Level1_foundational_RAG.ipynb) for a non-agentic RAG example.