# Level 2: Hierarchical RAG

This notebook will show you how to build a RAG application with Llama Stack using multi-field document indexing. You will learn how the API's provided by Llama Stack can be used to directly control and invoke all common RAG stages, including indexing, retrieval and inference. 

In this example, we'll work with IT call center tickets and combine multiple fields (`short_description`, `content`, and `close_notes`) to create richer document representations that improve retrieval quality and context understanding.


## Overview

This tutorial covers the following steps:
1. Loading and preparing the dataset, combining multiple fields to create comprehensive document content.
2. Indexing the enriched documents into a vector database (ChromaDB) for later retrieval.
3. Executing the built-in RAG tool to retrieve the document chunks relevant to a given query.
4. Using the retrieved context to answer user queries during the inference step.
5. Understanding why multi-field RAG outperforms single-field RAG through practical query examples.

## 1. Setting Up this Notebook

First, we will start with a few imports.

In [1]:
import uuid

from llama_stack_client import RAGDocument
from llama_stack_client.types import Document
from llama_stack_client.lib.agents.agent import Agent


In [2]:
import pandas as pd
from huggingface_hub import hf_hub_download

# Faz o download do arquivo CSV do Hugging Face Hub
file_path = "synthetic-it-call-center-tickets.csv"

# Carrega o CSV em um DataFrame do pandas
df = pd.read_csv(file_path)

# Exibe as primeiras 5 linhas e informações do DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,number,type,date,contact_type,short_description,content,category,subcategory,customer,...,resolution_time,issue/request,software/system,output,assignment_group,item_id,role,poor_close_notes,info_score_close_notes,info_score_poor_close_notes
0,0,TASK0049212,Request,3/31/2021 14:13,Chat,Request for PostgreSQL upgrade to the latest v...,I would like to request an upgrade for our Pos...,SOFTWARE,INSTALLATION,"Morgan, Gregory",...,514.97,PostgreSQL Upgrade Request,PostgreSQL,Software/System: PostgreSQL ; Issue/Request: P...,DBTED SUPPORT GROUP,7586,customer,See worknotes,0.8,0.0
1,1,INC0048604,Incident,3/27/2021 10:09,Email,ZTrend crashes unexpectedly when saving files,User reports ZTrend crashes unexpectedly when ...,SOFTWARE,ERROR,"Adams, Kenneth",...,876.01,ZTrend Crashing When Saving Files,ZTrend,Software/System: ZTrend ; Issue/Request: ZTren...,APPLICATION SUPPORT,7287,agent,All set.,0.9,0.0
2,2,INC0034238,Incident,1/1/2021 15:41,Chat,Compatibility issues between CodeReview and ne...,Compatibility issues reported between CodeRevi...,SOFTWARE,MALFUNCTION,"Fischer, Noah",...,331.32,CodeReview Compatibility Issue,CodeReview,Software/System: CodeReview ; Issue/Request: C...,APPLICATION SUPPORT,104,agent,Resolved,0.7,0.0
3,3,INC0068299,Incident,1/22/2021 6:45,Chat,Resolution steps for AirWave's disk space conc...,A user submitted a ticket regarding their Arub...,SOFTWARE,ERROR,"Martin, Mia",...,13934.15152,Insufficient disk space leading to incomplete ...,Aruba Networks AirWave,Software/System: Aruba Networks AirWave ; Issu...,,12745,agent,Ticket closed. Issue addressed.,0.8,0.0
4,4,INC002060,Incident,1/6/2021 15:16,Self-service,Aruba Networks AirWave not working,I'm currently experiencing an issue with Aruba...,SOFTWARE,ERROR,"Stone, Isaiah",...,29771.07712,Error - SQLite3.DatabaseError: database disk i...,Aruba Networks AirWave,Software/System: Aruba Networks AirWave ; Issu...,TIER 2 TEAM,8683,customer,Issue resolved. User can access Aruba Networks...,0.9,0.1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27602 entries, 0 to 27601
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   27602 non-null  int64  
 1   number                       27602 non-null  object 
 2   type                         27602 non-null  object 
 3   date                         27602 non-null  object 
 4   contact_type                 27602 non-null  object 
 5   short_description            27602 non-null  object 
 6   content                      27602 non-null  object 
 7   category                     27599 non-null  object 
 8   subcategory                  27599 non-null  object 
 9   customer                     27602 non-null  object 
 10  resolved_at                  20677 non-null  object 
 11  close_notes                  27249 non-null  object 
 12  agent                        27240 non-null  object 
 13  reassigned_count

Next, we will initialize our environment as described in detail in our ["Getting Started" notebook](Level0_getting_started_with_Llama_Stack.ipynb). Please refer to it for additional explanations.

In [4]:
# for accessing the environment variables
import os
from dotenv import load_dotenv
load_dotenv()

# for communication with Llama Stack
from llama_stack_client import LlamaStackClient

# pretty print of the results returned from the model/agent
import sys
sys.path.append('..')  
#from src.utils import step_printer
from termcolor import cprint
base_url = os.getenv("REMOTE_BASE_URL", "http://localhost:8321")
#base_url = os.getenv("REMOTE_BASE_URL")

# Tavily search API key is required for some of our demos and must be provided to the client upon initialization.
# We will cover it in the agentic demos that use the respective tool. Please ignore this parameter for all other demos.
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}


client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)
    
print(f"Connected to Llama Stack server")

# model_id for the model you wish to use that is configured with the Llama Stack server
model_id = "ollama/llama3.2:3b"

temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 4096))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tModel: {model_id}\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Connected to Llama Stack server
Inference Parameters:
	Model: ollama/llama3.2:3b
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 4096}
	stream: True


Finally, we complete the setup by initializing the document collection we will use for RAG ingestion and retrieval.

## 2. Indexing the Documents
- Initialize a new document collection in our vector database. All parameters related to the vector database, such as the embedding model and dimension, must be specified here.
- Create RAG documents by combining multiple fields from the dataset: `short_description`, `content`, and `close_notes`. This provides richer context for better retrieval and understanding of the tickets.
- Provide the list of documents to the RAG tool. Llama Stack will handle chunking and indexing the content into the vector database.

In [5]:
client.providers.list()

INFO:httpx:HTTP Request: GET http://localhost:8321/v1/providers "HTTP/1.1 200 OK"


[ProviderInfo(api='inference', config={'url': 'http://localhost:11434'}, health={'status': 'Not Implemented', 'message': 'Provider does not implement health check'}, provider_id='ollama', provider_type='remote::ollama'),
 ProviderInfo(api='inference', config={'url': 'https://api.fireworks.ai/inference/v1', 'api_key': '********'}, health={'status': 'Not Implemented', 'message': 'Provider does not implement health check'}, provider_id='fireworks', provider_type='remote::fireworks'),
 ProviderInfo(api='inference', config={'url': 'https://api.together.xyz/v1', 'api_key': '********'}, health={'status': 'Not Implemented', 'message': 'Provider does not implement health check'}, provider_id='together', provider_type='remote::together'),
 ProviderInfo(api='inference', config={}, health={'status': 'Not Implemented', 'message': 'Provider does not implement health check'}, provider_id='bedrock', provider_type='remote::bedrock'),
 ProviderInfo(api='inference', config={'api_key': '********', 'base_u

In [6]:
client.models.list()

INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"


[Model(identifier='ollama/llama3.2:3b', metadata={}, api_model_type='llm', provider_id='ollama', type='model', provider_resource_id='llama3.2:3b', model_type='llm'),
 Model(identifier='bedrock/meta.llama3-1-8b-instruct-v1:0', metadata={}, api_model_type='llm', provider_id='bedrock', type='model', provider_resource_id='meta.llama3-1-8b-instruct-v1:0', model_type='llm'),
 Model(identifier='bedrock/meta.llama3-1-70b-instruct-v1:0', metadata={}, api_model_type='llm', provider_id='bedrock', type='model', provider_resource_id='meta.llama3-1-70b-instruct-v1:0', model_type='llm'),
 Model(identifier='bedrock/meta.llama3-1-405b-instruct-v1:0', metadata={}, api_model_type='llm', provider_id='bedrock', type='model', provider_resource_id='meta.llama3-1-405b-instruct-v1:0', model_type='llm'),
 Model(identifier='sentence-transformers/nomic-ai/nomic-embed-text-v1.5', metadata={'embedding_dimension': 768.0}, api_model_type='embedding', provider_id='sentence-transformers', type='model', provider_resourc

In [7]:

# Explicit - specify embedding model and/or provider when you need specific ones
vs_chroma = client.vector_stores.create(
    extra_body={
        "provider_id": "chromadb",  # Optional: specify vector store provider
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768  # Optional: will be auto-detected if not provided
    }
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/vector_stores "HTTP/1.1 200 OK"


In [8]:
df = df.fillna("")

In [9]:
# Limita o dataset para os primeiros 1000 registros para processamento mais rápido
df_1000 = df.head(1000)

In [10]:
# Cria documentos RAG combinando múltiplos campos para enriquecer o contexto
# Os campos 'short_description', 'content' e 'close_notes' são combinados no conteúdo principal
# Os demais campos são armazenados como metadados para filtragem e referência
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=f"{df_1000.iloc[i]['short_description']}\n\n{df_1000.iloc[i]['content']}\n\n{df_1000.iloc[i]['close_notes']}",
        mime_type="text/plain",
        metadata=df_1000.iloc[i].drop(["short_description", "content", "close_notes"]).to_dict(),
    )
    for i in range(len(df_1000))
]

In [11]:
documents

[{'document_id': 'num-0',
  'content': 'Request for PostgreSQL upgrade to the latest version.\n\nI would like to request an upgrade for our PostgreSQL to the latest version in order to utilize new features and improvements. Before proceeding, please assess the impact this upgrade may have on our current projects and confirm compatibility with our existing data. Could you escalate this request to the Software Upgrade Team for me?\n\nUpgraded PostgreSQL to the latest version to access new features and improvements. Assessed the impact on current projects and verified compatibility with existing data. Confirmed backups of all databases. Followed upgrade procedures: stopped the PostgreSQL service, performed the upgrade, and restarted the service. Verified successful upgrade by running basic queries and checking for errors. Informed the user about the new features and any changes in functionalities. Upgrade completed successfully and system is functioning as expected.',
  'mime_type': 'text

In [12]:
client.tool_runtime.rag_tool.insert( 
    chunk_size_in_tokens=1024,
    documents=documents,
    vector_db_id=str(vs_chroma.id),
    extra_body={"vector_store_id": str(vs_chroma.id)},
    extra_headers=None,
    extra_query=None,
    timeout=None
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/tool-runtime/rag-tool/insert "HTTP/1.1 200 OK"


In [13]:
client.vector_io.query(vector_db_id=vs_chroma.id,query="ZTrend crashes")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/vector-io/query "HTTP/1.1 200 OK"


QueryChunksResponse(chunks=[Chunk(content="ZTrend crashes unexpectedly when saving files\n\nUser reports ZTrend crashes unexpectedly when attempting to save files. \nPerformed initial diagnostics to check log files for error codes. \n\nEscalating to Software Maintenance Team\n\nPerformed initial diagnostics to check log files for error codes. The logs indicated a corruption in the application's config files. Followed these troubleshooting steps: 1. Cleared temporary files and cache related to ZTrend. 2. Re-installed the ZTrend application to ensure a clean installation. 3. Restored default settings by renaming the config file and letting the application generate a new one upon launch. 4. Confirmed that the application no longer crashes when saving files. User was able to save multiple files without issues after these steps.", metadata={'Unnamed: 0': 1.0, 'number': 'INC0048604', 'type': 'Incident', 'date': '3/27/2021 10:09', 'contact_type': 'Email', 'category': 'SOFTWARE', 'subcategory'

## 3. Executing Queries via the Built-in RAG Tool
- Directly invoke the RAG tool to query the vector database we ingested into at the previous stage.
- Construct an extended prompt using the retrieved chunks.
- Query the model with the extended prompt.
- Output the reply received from the model.

In [15]:
queries = [
    "What was the root cause and resolution for application crashes related to memory issues?",
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call
    rag_response = client.tool_runtime.rag_tool.query(
        content=prompt,
        vector_db_ids=[str(vs_chroma.id)],   # o SDK exige isso
        extra_body={"vector_store_ids": [str(vs_chroma.id)]},  # o backend exige isso
    )

    print(rag_response.content)
    # the list of messages to be sent to the model must start with the system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # construct the actual prompt to be executed, incorporating the original query and the retrieved content
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # use Llama Stack inference API to directly communicate with the desired model
    response = client.chat.completions.create(
        messages=messages,
        model=model_id,
        stream=stream,
        max_tokens=sampling_params.get("max_tokens"),
        temperature=sampling_params["strategy"].get("temperature"),
    )
    
if stream:
    for chunk in response:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()  # nova linha após streaming
else:
    print(response.choices[0].message.content)

[34m
User> What was the root cause and resolution for application crashes related to memory issues?[0m


INFO:httpx:HTTP Request: POST http://localhost:8321/v1/tool-runtime/rag-tool/query "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1\nContent: Firefox crashes system due to high memory\n\nHello support team,\n\nI am writing to report an issue with Firefox that has been affecting the overall performance of my computer. Whenever I use Firefox for more than 30 minutes, especially with multiple tabs open, my system starts to considerably slow down, often becoming unresponsive. It appears that the browser's memory usage is excessively high compared to when I use other browsers or similar applications. I've already tried disabling extensions and reinstalling Firefox, but the issue persists. Your assistance in resolving this would be appreciated as it severely hampers my productivity.\n\nThank you,\nJohn Doe\n\nRemoted in to assess the reported issue with Firefox crashing the system due to high memory usage. \nMonitored the memory consumption while running Firefox with multi

## 4. Why Multi-Field RAG is Better: Example Queries

Using multiple fields (`short_description`, `content`, and `close_notes`) instead of just `short_description` significantly improves retrieval quality for certain types of queries. Here are examples where multi-field RAG outperforms single-field RAG:

### Example 1: Troubleshooting Steps and Solutions
**Query**: "How do I fix ZTrend crashes when saving files?"

- **Single-field (short_description only)**: May retrieve tickets about crashes, but won't have the solution steps
- **Multi-field**: Retrieves tickets with both the problem description AND the detailed troubleshooting steps from `close_notes`, providing complete answers

### Example 2: Historical Context and Resolution
**Query**: "What was the root cause and resolution for application crashes related to memory issues?"

- **Single-field**: Only finds tickets mentioning "crashes" but misses the diagnostic details and resolution steps
- **Multi-field**: Retrieves tickets with full context from `content` (initial problem description) and `close_notes` (diagnostic findings and resolution), enabling comprehensive answers

### Example 3: Pattern Recognition Across Problem-Solution Pairs
**Query**: "What are common solutions for software crashes that involve configuration files?"

- **Single-field**: Can identify crash-related tickets but can't see the solutions
- **Multi-field**: Can match both problem patterns (from `short_description`/`content`) and solution patterns (from `close_notes`), enabling identification of recurring problem-solution patterns

### Example 4: Detailed Technical Information
**Query**: "Show me tickets where log file analysis revealed the issue"

- **Single-field**: May miss tickets where log analysis is only mentioned in `content` or `close_notes`
- **Multi-field**: Captures technical details from all fields, ensuring comprehensive retrieval of relevant tickets

### Example 5: End-to-End Ticket Understanding
**Query**: "Find tickets where the customer reported a problem, diagnostics were performed, and the issue was resolved by reinstalling software"

- **Single-field**: Can't capture the full narrative flow from problem → diagnosis → solution
- **Multi-field**: Preserves the complete ticket lifecycle, enabling retrieval based on complex multi-stage scenarios

**Key Insight**: Multi-field RAG is especially powerful for queries that require understanding both the problem AND the solution, or queries that need to match patterns across different stages of the ticket lifecycle.


## Key Takeaways
This notebook demonstrated how to set up and use the built-in RAG tool for ingesting user-provided documents in a vector database and utilizing them during inference via direct retrieval. 

Key points:
- **Multi-field content**: We combined `short_description`, `content`, and `close_notes` fields to create richer document representations, improving the quality of retrieval and context understanding.
- **Metadata preservation**: Other fields from the dataset are stored as metadata, allowing for filtering and additional context during retrieval.
- **Vector database integration**: The documents are chunked and indexed into ChromaDB using Llama Stack's RAG tool, enabling semantic search over the ticket data.
- **Query advantages**: As shown in Section 4, multi-field RAG excels at queries requiring both problem and solution context, pattern recognition across ticket lifecycle stages, and comprehensive technical information retrieval.

Now that we've seen how easy it is to implement RAG with Llama Stack, We'll move on to building a simple agent with Llama Stack next in our [Simple Agents](./Level2_simple_agent_with_websearch.ipynb) notebook.