# Level 1: Simple RAG

This notebook will show you how to build a simple RAG application with Llama Stack. You will learn how the API's provided by Llama Stack can be used to directly control and invoke all common RAG stages, including indexing, retrieval and inference. 

_Note: This notebook contains a non-agentic implementation of RAG. We will show you how to build an agentic RAG application later in this tutorial in [Level4_RAG_agent](Level4_RAG_agent.ipynb)._

## Overview

This tutorial covers the following steps:
1. Indexing a collection of documents into a vector database for later retrieval.
2. Executing the built-in RAG tool to retrieve the document chunks relevant to a given query.
3. Using the retrieved context to answer user queries during the inference step.

## 1. Setting Up this Notebook

First, we will start with a few imports.

In [None]:
import uuid

from llama_stack_client import RAGDocument
from llama_stack_client.types import Document
from llama_stack_client.lib.agents.agent import Agent


In [None]:
import pandas as pd
from pathlib import Path

# Load the CSV file from the data directory
data_dir = Path("../data")
file_path = data_dir / "synthetic-it-call-center-tickets.csv"

# Carrega o CSV em um DataFrame do pandas
df = pd.read_csv(file_path)

# Exibe as primeiras 5 linhas e informa√ß√µes do DataFrame
df.head()

In [None]:
df.info()

Next, we will initialize our environment as described in detail in our ["Getting Started" notebook](Level0_getting_started_with_Llama_Stack.ipynb). Please refer to it for additional explanations.

In [None]:
# Import required libraries
import os
import sys
from pathlib import Path

# Add root src directory to path to import shared config
root_dir = Path("../..").resolve()
sys.path.insert(0, str(root_dir / "src"))

# Import centralized configuration
from config import LLAMA_STACK_URL, MODEL, CONFIG

# For communication with Llama Stack
from llama_stack_client import LlamaStackClient
from termcolor import cprint

# Configuration values (automatically detected based on environment)
llamastack_url = LLAMA_STACK_URL
model = MODEL

if not llamastack_url:
    raise ValueError(
        "LLAMA_STACK_URL is not configured!\n"
        "Please run: ./scripts/setup-env.sh\n"
        "Or set LLAMA_STACK_URL environment variable:\n"
        "  export LLAMA_STACK_URL='https://llamastack-route-my-first-model.apps.ocp.example.com'"
    )

print(f"üì° LlamaStack URL: {llamastack_url}")
print(f"ü§ñ Model: {model}")
print(f"üìç Environment: {'Inside OpenShift cluster' if CONFIG['inside_cluster'] else 'Outside OpenShift cluster'}")
print(f"üì¶ Namespace: {CONFIG['namespace']}")

# Tavily search API key is optional (only needed for some demos)
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}

# Initialize LlamaStack client
client = LlamaStackClient(
    base_url=llamastack_url,
    provider_data=provider_data
)

# Verify connection
try:
    models = client.models.list()
    model_count = len(models.data) if hasattr(models, 'data') else len(models)
    print(f"\n‚úÖ Connected to LlamaStack")
    print(f"   Available models: {model_count}")
except Exception as e:
    print(f"\n‚ùå Cannot connect to LlamaStack: {e}")
    print("\nüí° Troubleshooting:")
    print("   1. Check if route exists: oc get route llamastack-route -n my-first-model")
    print("   2. Run setup script: ./scripts/setup-env.sh")
    print("   3. Or set LLAMA_STACK_URL manually in .env file")
    raise

# model_id for the model you wish to use that is configured with the Llama Stack server
model_id = "ollama/llama3.2:3b"

temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 4096))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tModel: {model_id}\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Finally, we complete the setup by initializing the document collection we will use for RAG ingestion and retrieval.

## 2. Indexing the Documents
- Initialize a new document collection in our vector database. All parameters related to the vector database, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, converting, and chunking the content of the documents.

In [None]:
client.providers.list()

In [None]:
client.models.list()

In [None]:

# Explicit - specify embedding model and/or provider when you need specific ones
vs_chroma = client.vector_stores.create(
    extra_body={
        "provider_id": "chromadb",  # Optional: specify vector store provider
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768  # Optional: will be auto-detected if not provided
    }
)

In [None]:
df = df.fillna("")

In [None]:
df_1000 = df.head(1000)

In [None]:
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=df_1000.iloc[i]["short_description"],
        mime_type="text/plain",
        metadata=df_1000.iloc[i].drop("short_description").to_dict(),
    )
    for i in range(len(df_1000))
]

In [None]:
documents

In [None]:
client.tool_runtime.rag_tool.insert( 
    chunk_size_in_tokens=1024,
    documents=documents,
    vector_db_id=str(vs_chroma.id),
    extra_body={"vector_store_id": str(vs_chroma.id)},
    extra_headers=None,
    extra_query=None,
    timeout=None
)

In [None]:
client.vector_io.query(vector_db_id=vs_chroma.id,query="ZTrend crashes")

## 3. Executing Queries via the Built-in RAG Tool
- Directly invoke the RAG tool to query the vector database we ingested into at the previous stage.
- Construct an extended prompt using the retrieved chunks.
- Query the model with the extended prompt.
- Output the reply received from the model.

In [None]:
queries = [
    "What was the root cause and resolution for application crashes related to memory issues?",
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call
    rag_response = client.tool_runtime.rag_tool.query(
        content=prompt,
        vector_db_ids=[str(vs_chroma.id)],   # o SDK exige isso
        extra_body={"vector_store_ids": [str(vs_chroma.id)]},  # o backend exige isso
    )

    print(rag_response.content)
    # the list of messages to be sent to the model must start with the system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # construct the actual prompt to be executed, incorporating the original query and the retrieved content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # use Llama Stack inference API to directly communicate with the desired model
    response = client.chat.completions.create(
        messages=messages,
        model=model,
        stream=stream,
        max_tokens=max_tokens,
        temperature=temperature,
    )
    
if stream:
    for chunk in response:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()  # nova linha ap√≥s streaming
else:
    print(response.choices[0].message.content)

## Key Takeaways
This notebook demonstrated how to set up and use the built-in RAG tool for ingesting user-provided documents in a vector database and utilizing them during inference via direct retrieval. 

Now that we've seen how easy it is to implement RAG with Llama Stack, We'll move on to building a simple agent with Llama Stack next in our [Simple Agents](./Level2_simple_agent_with_websearch.ipynb) notebook.

#### Any Feedback?

If you have any feedback on this or any other notebook in this demo series we'd love to hear it! Please go to https://www.feedback.redhat.com/jfe/form/SV_8pQsoy0U9Ccqsvk and help us improve our demos. 