# Level 5: MCP Based RAG (Medium Difficulty)

This tutorial is for developers who are already familiar with [basic Agentic workflows](./Level2_simple_agentic_with_websearch.ipynb). This tutorial will highlight a couple of slightly more advanced use cases for agents where a single tool call is insufficient to complete the required task. Here we will rely on both agentic RAG and MCP server to expand our agents capabilities.

We will also use MCP tools hosted locally or on an OpenShift cluster throughout this demo to showcase how users can go beyond Llama Stacks's current set of builtin tools to connect to many different services and data sources to build their own custom agents.

Agent Examples:
This notebook will walkthrough how to build a system that can answer each of the following 3 questions via agents built with Llama Stack:

"Check the status of my OpenShift cluster. If it’s running, create a new pod named test-pod in the dev namespace."
"Search for the latest Red Hat OpenShift version on the Red Hat website. Summarize the version number and draft a short email to my team."
"Review OpenShift logs for pods node-123 and node-456. Categorize each as ‘Normal’ or ‘Error’. If any show ‘Error’, send a Slack message to the ops team. Otherwise, show a simple summary."

### Agent Examples:

This notebook will walkthrough how to build a system that can answer each of the following question via agents built with Llama Stack:

1. [*"Generate a random number, insert it into: "How much is an OpenShift subscription {number}?", then query the vector DB with that question and return the results."*](#deploy-a-new-pod-in-our-openshift-cluster-with-mcp-enabled-agent)

### MCP Tools:

Throughout this notebook we will be relying on the [custom-mcp-server](hhttps://github.com/opendatahub-io/llama-stack-demos/tree/main/kubernetes/mcp-servers/custom-mcp) to interact with our custom MCP tools.

Please see installation instructions below if you do not already have this deployed in your environment. 

* [Custom MCP installation instructions](../../../mcp-servers/custom-mcp/README.md)



## Overview

In this tutorial we will be connecting to a llama-stack instance, building a RAG agent with a custom MCP tool available to it, and inferencing against the agent.

## Pre-Requisites

Before starting, ensure you have the following:
- User variables configured (see section `Setting your ENV variables` below).

## General Setup


### Setting your ENV variables:

As mentioned above, for this demo there are a few ENV variables that need to set:
- `REMOTE` (boolean): dictates if you are using a remote llama-stack instance.
- `REMOTE_BASE_URL` (string): the URL for your llama-stack instance if using remote connection.
- `REMOTE_CUSTOM_MCP_URL` (string): the URL for your CUSTOM MCP server. If the client does not find the tool registered to the llama-stack instance, it will use this URL to register the custom tool.

### Installing dependencies

This code requires `llama-stack` and the `llama-stack-client`, both at version `0.1.9`. Lets begin by installing them:

In [17]:
!pip install llama-stack-client==0.1.9 llama-stack==0.1.9


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Configuring logging

Now that we have our dependencies, lets setup logging for the application:

In [67]:
from llama_stack_client.lib.agents.event_logger import EventLogger
import logging

logger = logging.getLogger(__name__)
if not logger.hasHandlers():  
    logger.setLevel(logging.INFO)
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.INFO)
    formatter = logging.Formatter('%(message)s')
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)

### Configuration
This section sets up key parameters for model inference and the RAG (Retrieval-Augmented Generation) vector database.

In [49]:
import uuid

# Inference settings
MODEL="meta-llama/Llama-3.2-3B-Instruct"
TEMPERATURE = 0.0
TOP_P = 0.95
if TEMPERATURE > 0.0:
    strategy = {"type": "top_p", "temperature": TEMPERATURE, "top_p": TOP_P}
else:
    strategy = {"type": "greedy"}

# RAG vector DB settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
VECTOR_DB_PROVIDER_ID = "faiss"

# Unique DB ID for session
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

### Connecting to llama-stack server

For the llama-stack instance, you can either run it locally or connect to a remote llama-stack instance.

#### Remote llama-stack

- For remote, be sure to set `remote` to `True` and populate the `remote_llama_stack_endpoint` variable with your llama-stack remote.
- [Remote Setup Guide](https://github.com/opendatahub-io/llama-stack-on-ocp/tree/main/kubernetes)

#### Local llama-stack
- For local, be sure to set `remote` to `False` and validate the `local_llama_stack_endpoint` variable. It is based off of the default llama-stack port which is `8321` but is configurable with your deployment of llama-stack.
- [Local Setup Guide](https://github.com/redhat-et/agent-frameworks/tree/main/prototype/frameworks/llamastack)

In [50]:
import os
from dotenv import load_dotenv
load_dotenv()

remote = os.getenv("REMOTE", False) # Use the `remote` variable to switching between a local development environment and a remote kubernetes cluster.

if remote:
    base_url = os.getenv("REMOTE_BASE_URL")
else:
    base_url = "http://localhost:8321"

tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY") # Replace with your Tavily API key (required for demo 2)

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url=base_url,
    provider_data={
        "tavily_search_api_key": tavily_search_api_key # This is required for demo 2
    }
)
    
logger.info(f"Connected to Llama Stack server @ {base_url} \n")

Connected to Llama Stack server @ http://localhost:8321 



### Indexing the Documents
- Initialize a new document collection in the target vector DB. All parameters related to the vector DB, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, conversion and chunking of the documents' content.

In [51]:
from llama_stack_client import RAGDocument

# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=VECTOR_DB_EMBEDDING_MODEL,
    embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
    provider_id=VECTOR_DB_PROVIDER_ID,
)

# ingest the documents into the newly created document collection
urls = [
    ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
    ("https://www.cdflaborlaw.com/_images/content/2023_OCBJ_GC_Awards_Article.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)

### Validate tools are available in our llama-stack instance

When an instance of llama-stack is redeployed your tools need to re-registered. Also if a tool is already registered with a llama-stack instance, if you try to register one with the same `toolgroup_id`, llama-stack will throw you an error.

For this reason it is recommended to include some code to validate your tools and toolgroups. This is where the `mcp_url` comes into play. The following code will check that both the `builtin::rag` and the `mcp::custom_mcp_server` tools are registered as tools, but if the `mcp::custom_mcp_server` tool is not listed there, it will attempt to register it using the mcp url.

If you are running the MCP server from source, the default value for this is: `http://localhost:8000/sse`.

If you are running the MCP server from a container, the default value for this is: `http://host.containers.internal:8000/sse`.

Make sure to pass the corresponding MCP URL for the server you are trying to register/validate tools for.

In [64]:
# Optional: Enter your MCP server URL here
mcp_url = os.getenv("LOCAL_MCP_URL") ######### REMOTE_CUSTOM_MCP_URL

# Get list of registered tools and extract their toolgroup IDs
registered_tools = client.tools.list()
registered_toolgroups = [tool.toolgroup_id for tool in registered_tools]

# # Unregister MCP tools
# try:
#     # Unregister MCP tools
#     client.toolgroups.unregister(toolgroup_id="mcp::custom_mcp_server")
#     print(f"Successfully unregistered MCP tool group: mcp:custom_tools")
# except Exception as e:
#     print(f"Error unregistering MCP tool group: {e}")

# Register MCP custom tool group if not already registered (Required for demo 2)
if "mcp::custom_mcp_server" not in registered_toolgroups:
    client.toolgroups.register(
        toolgroup_id="mcp::custom_mcp_server",
        provider_id="model-context-protocol",
        mcp_endpoint={"uri": mcp_url},
    )

# Log the current toolgroups registered
logger.info(
    f"Your Llama Stack server is already registered with the following tool groups: {set(registered_toolgroups)}\n"
)

Your Llama Stack server is already registered with the following tool groups: {'builtin::rag', 'mcp::custom_mcp_server', 'builtin::websearch', 'builtin::wolfram_alpha'}



## Query 1: (Agentic) `Using MCP Based RAG to Enhance Queries`

### System Prompts for different models

**Note:** If you have multiple models configured with your Llama Stack server, you can choose which one to run your queries against. When switching to a different model, you may need to adjust the system prompt to align with that model’s expected behavior. Many models provide recommended system prompts for optimal and reliable outputs these are typically documented on their respective websites.

In [65]:
# Here is a system prompt we have come up with which works well for this query

granite_model="granite3.2:8b-instruct-fp16"
llama_model="meta-llama/Llama-3.2-3B-Instruct"
sys_prompt1= """You are a helpful assistant. Use tools to answer. When you use a tool always respond with a summary of the result."""

In [66]:
from llama_stack_client import Agent
# Create simple agent with tools
agent = Agent(
    client,
    model=MODEL, # replace this with your choice of model
    instructions = sys_prompt1 , # update system prompt based on the model you are using
    tools=[
        dict(
            name="builtin::rag/knowledge_search",
            args={
                "vector_db_ids": [vector_db_id],  # list of IDs of document collections to consider during retrieval
            },
        ),

        "mcp::custom_mcp_server"

           
           ],
    tool_config={"tool_choice":"auto"},
    sampling_params={"max_tokens":4096}
)

user_prompts = ["""
Generate a random number, insert it into: "How much is an OpenShift subscription {number}?", then query the vector DB with that question and return the results.
"""]


                
session_id = agent.create_session(session_name="OCP_demo")

for prompt in user_prompts:
    turn_response = agent.create_turn(
        messages=[
            {
                "role":"user",
                "content": prompt
            }
        ],
        session_id=session_id,
        stream=True,
    )
    for log in EventLogger().log(turn_response):
        log.print()

[33minference> [0m[33m[[0m[33mgenerate[0m[33m_random[0m[33m_number[0m[33m(min[0m[33m=[0m[33m1[0m[33m,[0m[33m max[0m[33m=[0m[33m100[0m[33m),[0m[33m knowledge[0m[33m_search[0m[33m(query[0m[33m="[0m[33mOpen[0m[33mShift[0m[33m subscription[0m[33m {[0m[33mnumber[0m[33m}")[0m[33m][0m[97m[0m
[32mtool_execution> Tool:generate_random_number Args:{'min': 1.0, 'max': 100.0}[0m
[32mtool_execution> Tool:generate_random_number Response:{"type":"text","text":"84","annotations":null}[0m
[33minference> [0m[33m[k[0m[33mnowledge[0m[33m_search[0m[33m(query[0m[33m="[0m[33mHow[0m[33m much[0m[33m is[0m[33m an[0m[33m Open[0m[33mShift[0m[33m subscription[0m[33m [0m[33m84[0m[33m?[0m[33m")][0m[97m[0m
[32mtool_execution> Tool:knowledge_search Args:{'query': 'How much is an OpenShift subscription 84?'}[0m
[32mtool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN 