# RAG (Retrieval-Augmented Generation) with Agentic AI Demo

## Overview

This notebook demonstrates a RAG (Retrieval-Augmented Generation) system using LlamaStack, which combines:
- **Document Retrieval**: Using vector databases to search through ingested documents
- **Agentic AI**: Using ReAct (Reasoning + Acting) agents that can use multiple tools
- **Multi-Tool Workflows**: Combining RAG, web search, and custom tools for comprehensive question answering

## Approach & Architecture

### Why RAG?
RAG addresses the limitation of LLMs having static knowledge by:
1. **Retrieval**: Finding relevant information from a knowledge base (vector database)
2. **Augmentation**: Adding retrieved context to the prompt
3. **Generation**: Using the LLM to generate answers based on the augmented context

### Why Agentic AI?
Traditional RAG only searches documents. Agentic AI enables:
- **Tool Selection**: Automatically choosing the right tool (RAG, web search, custom tools)
- **Multi-Step Reasoning**: Breaking down complex queries into steps
- **Dynamic Information**: Accessing real-time data (stock prices, web search)

### System Components
1. **LlamaStack Client**: Single point interface to LLM services, vector databases and agents
2. **Vector Database (Milvus)**: Stores document embeddings for semantic search
3. **Docling**: Advanced PDF extraction with OCR capabilities
4. **ReAct Agents**: Intelligent agents that reason and act using tools
5. **Custom Tools**: Domain-specific functions (e.g., Yahoo Finance for stock data)

---

In [None]:
# Install notebook dependencies. 
# Will take a while to download and install numerous dependencies. 
# Wait until it finishes before proceeding
%pip install llama_stack_client==0.2.22 docling yfinance rich

In [None]:
# Verify the installed version of llama_stack_client (0.2.22)
# This ensures we're using the correct version for compatibility
import llama_stack_client

print(llama_stack_client.__version__)

In [None]:
# Python stdlib imports
import os
import json
from datetime import date, datetime, timedelta
import re
import logging

# Suppress verbose and noisy HTTP logs
logging.getLogger("httpx").setLevel(logging.WARNING)

# Llamastack imports
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
from llama_stack_client import Document
from llama_stack_client.lib.agents.react.agent import ReActAgent
from llama_stack_client.lib.agents.react.tool_parser import ReActOutput
from llama_stack_client.lib.agents.client_tool import client_tool
from llama_stack_client.lib.agents.event_logger import EventLogger

# Yahoo finance API for custom agents
import yfinance as yf

# Docling imports
from docling.document_converter import DocumentConverter

# pretty printing
import rich

## Setting up Configurations

### LLM Sampling Parameters

These parameters control how the LLM generates responses:

- **temperature**: Controls randomness (0.0 = deterministic, 1.0+ = more creative)
  - Lower values (0.1-0.3): More focused, deterministic responses
  - Higher values (0.7-1.0): More creative, diverse responses
  - We use 0.7 for balanced creativity and accuracy

- **top_p** (nucleus sampling): Probability mass threshold for token selection
  - Only considers tokens whose cumulative probability is within top_p
  - 0.95 means considering tokens that make up 95% of probability mass
  - Works with temperature to control diversity

- **max_tokens**: Maximum number of tokens in the generated response
  - Prevents excessively long outputs
  - 512 tokens ≈ 400-500 words

In [None]:
# Temperature: Controls randomness in LLM output
# 0.0 = deterministic (always same output for same input)
# 0.7 = balanced creativity and consistency
# 1.0+ = highly creative/variable outputs
temperature = 0.7

# Configure sampling strategy based on temperature
if temperature > 0.0:
    # Top-p (nucleus sampling): Only consider tokens whose cumulative probability 
    # is within the top_p threshold (0.95 = 95% probability mass)
    # This provides more focused sampling than pure temperature
    top_p = float(os.getenv("TOP_P", 0.95))
    # Top-p strategy: Uses both temperature and top_p for controlled randomness
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    # Greedy strategy: Always selects the most probable token (deterministic)
    strategy = {"type": "greedy"}

# Maximum tokens in the generated response
# 512 tokens ≈ 400-500 words, prevents excessively long outputs
max_tokens = 512

# Sampling parameters dictionary
# Will be passed to LlamaStack Agents/Inference APIs to control text generation
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

## Initializing LlamaStack Client and Selecting Models

### Client Setup
The LlamaStackClient connects to the LlamaStack service endpoint, which provides:
- LLM inference services
- Vector database management
- Agent orchestration
- Tool execution

### Model Selection
We need two types of models:
 1. **LLM Model**: For text generation (e.g., Granite-3.3-8B-Instruct)
 2. **Embedding Model**: For converting text to vectors (e.g., granite-embedding-125m)
    - **Embedding Dimension**: Size of the vector space (e.g., 768 dimensions)
    - Used for semantic similarity search in vector databases

In [None]:
# LlamaStack service URL (in-cluster)
LLAMASTACK_URL = "http://llama-stack-dist-service.competitor-analysis.svc.cluster.local:8321"

# Vector DB name (logical identifier used by Llamastack)
VECTOR_DB_NAME = "agentic-rag-db"

# Initialize client
client = LlamaStackClient(base_url=LLAMASTACK_URL)
    
# Test connection by listing models
models = client.models.list()
    
rich.print(models)

In [None]:
# Get the main inference model and embedding model
model_id = next(m.identifier for m in models if m.model_type == "llm")
embedding_model = next(m for m in models if m.model_type == "embedding")
embedding_model_id = embedding_model.identifier
embedding_dimension = int(embedding_model.metadata["embedding_dimension"])

vector_db = client.vector_dbs.register(
    vector_db_id=VECTOR_DB_NAME,
    embedding_model=embedding_model_id,
    embedding_dimension=embedding_dimension,
    provider_id="milvus-remote",
)

# IMPORTANT: Need to use vector DB identifier UUID instead of logical name for ingestion and queries
vector_db_id = vector_db.identifier

In [None]:
rich.print(f"Using inference model: {model_id}")
rich.print(f"Using embedding model: [red]{embedding_model_id}[/red] with dimension: {embedding_dimension}")
rich.print(f"Using vector DB with ID: [red]{vector_db_id}[/red]")

## Document Ingestion using Docling

### Why Docling?
Docling is an advanced document converter that provides:
- **Intelligent PDF Parsing**: Extracts text, tables, and structure
- **OCR Capabilities**: Handles scanned documents and images
- **Table Extraction**: Preserves table structure and formatting
- **Better than Basic Extractors**: Maintains document hierarchy and context

### Document Sources
As an example, we will ingest Indian Bank financial documents from their official website:
- Financial results
- Presentations
- Notes and disclosures

> WARNING: This approach of listing URLs manually should only be used during development and testing!. For bulk ingestion of documents, use the KFP pipeline approach outlined in the previous notebook.

In [None]:
# URLs of sample Indian Bank financial documents to ingest
# These PDFs contain financial results, presentations, and notes
urls = [
     "https://indianbank.bank.in/wp-content/uploads/2025/10/Notes-forming-part-of-Reviewed-Financial-Results-for-September-2025.pdf",
     "https://indianbank.bank.in/wp-content/uploads/2025/10/Presentation-September-2025.pdf",
     "https://indianbank.bank.in/wp-content/uploads/2025/10/Reviewed-Financial-Results-Consolidated.pdf"
]

## Docling-Powered Document Ingestion

### Ingestion Process
1. **Download & Convert**: Docling converts PDF's to structured text (Markdown)
2. **Create Document Objects**: Wrap extracted text in Llamastack `Document` objects with metadata
3. **Chunk & Embed**: Documents are split into chunks and converted to embeddings
4. **Store in Vector DB**: Chunks are stored in the vector database for retrieval

### Chunking Strategy
- **chunk_size_in_tokens**: 512 tokens per chunk
  - Balances context size with retrieval precision
  - Smaller chunks = more precise matches
  - Larger chunks = more context per match

> WARNING: Converting documents using Docling will take a lot of time depending on your GPU and hardware capacity. Wait until the conversion is complete, and the embeddings ingested into the vector database before proceeding!

In [None]:
# Loop through URLs and get the PDFs
# Pass the PDFs to Docling for conversion
# Will take a while depending on if you have GPUs
# Wait until it finishes.

for pdf_file in urls:
    # Initialize docling converter
    converter = DocumentConverter()
    result = converter.convert(pdf_file)
    text_content = result.document.export_to_markdown()

    document_2 = Document(
        document_id=f"{pdf_file}",
        content=text_content,
        mime_type="text/markdown",
        metadata={"source": pdf_file}
    )

    # Insert into vector DB
    client.tool_runtime.rag_tool.insert(
        documents=[document_2],
        vector_db_id=vector_db_id, 
        chunk_size_in_tokens=512
    )

### a. Manual RAG Search

**Approach**: Direct control over retrieval and generation steps.

**Process**:
1. Query the vector database for relevant chunks
2. Format retrieved chunks as context
3. Build a prompt with query + context
4. Call LLM to generate answer

**Advantages**:
- Full control over retrieval parameters
- Customizable prompt templates
- Easy to debug and inspect intermediate steps

**Use Cases**: When you need fine-grained control over the RAG pipeline

#### Step 1: Retrieving Relevant Chunks

**Query Configuration Parameters**:
- **query_generator_config**: How to process the query
  - `type: "default"`: Standard query processing
  - `separator: " "`: Token separator for query parsing
- **max_tokens_in_context**: Maximum total tokens from retrieved chunks (4096)
- **max_chunks**: Number of chunks to retrieve (5)
- **chunk_template**: Format for each chunk in the response
- **mode: "vector"**: Use vector similarity search (semantic search)

In [None]:
# User query about Indian Bank shareholding
query = "As per the documents, tell me about Percentage of shares held by Government of India in Indian bank "

# Query the vector database for relevant document chunks
# This performs semantic search to find chunks most similar to the query
response = client.tool_runtime.rag_tool.query(
        vector_db_ids=[vector_db_id],  # Which vector database(s) to search
        content=query,  # The user's query/question
        query_config={
            # Query generation configuration
            "query_generator_config": {
                "type": "default",  # Standard query processing
                "separator": " "  # Token separator for query parsing
            },
            # Maximum total tokens from all retrieved chunks
            # Prevents exceeding LLM context window limits
            "max_tokens_in_context": 4096,
            # Maximum number of chunks to retrieve
            # More chunks = more context but potentially less focused
            "max_chunks": 5,
            # Template for formatting each retrieved chunk
            # {index}: Chunk number, {chunk.content}: Text content, {metadata}: Document metadata
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
            # Search mode: "vector" uses semantic similarity search
            # Alternative: "keyword" for keyword-based search
            "mode": "vector"
        },
    )
rich.print(response)

#### Step 2: Complete RAG Pipeline Function

This function combines retrieval and generation:
1. **Retrieve**: Get relevant chunks from vector DB
2. **Format**: Combine chunks into context string
3. **Augment**: Add context to prompt
4. **Generate**: Call LLM with augmented prompt
5. **Return**: Final answer text

**Prompt Engineering**:
- Instructs LLM to only use provided context
- Handles cases where answer isn't in context
- Clear separation between question and context

In [None]:
def rag_pipeline(question: str) -> str:
    """
    Complete RAG pipeline: Retrieve relevant chunks and generate answer.
    
    Args:
        question: User's question to answer
        
    Returns:
        Final answer text generated by LLM based on retrieved context
    """
    # Step 1: Retrieve relevant chunks via RAG tool
    # This performs semantic search in the vector database
    response = client.tool_runtime.rag_tool.query(
        vector_db_ids=[vector_db_id],
        content=question,
        query_config={
            "query_generator_config": {
                "type": "default",
                "separator": " "
            },
            "max_tokens_in_context": 4096,
            "max_chunks": 5,
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
            "mode": "vector"
        },
    )

    # 2. Extract plain text from retrieved chunks
    #    (rag_res.content is a list of content items; each item has .text)
    rag_text_chunks = []
    for item in response.content:
        # Depending on the client version, this may be item.text or item["text"]
        rag_text_chunks.append(str(item.text))

    context = "\n\n".join(rag_text_chunks)

    # 3. Build a prompt that includes both the question and the retrieved context
    prompt = f"""You are a question-answering assistant.
Answer the question ONLY using the context provided. 
If the answer is not in the context, respond with 'I don't know'.

<question>
{question}
</question>

<context>
{context}
</context>
"""

    # 4. Ask the LLM to generate an answer using that context
    completion = client.inference.chat_completion(
        model_id=model_id,   # use your registered model id here
        messages=[{"role": "user", "content": prompt}],
    )

    # 5. Return the answer text
    return completion.completion_message.content


# Test the RAG pipeline
answer = rag_pipeline(question=query)
rich.print(answer)

### b. Using File Search API

**Approach**: Simplified RAG via LlamaStack's Responses API.

**Process**:
1. Single API call handles retrieval + generation
2. LlamaStack manages chunking, retrieval, and prompt construction
3. Returns final answer directly

**Advantages**:
- Simpler code (one API call)
- Less configuration needed
- Built-in optimizations

**Use Cases**: When you want a quick, production-ready RAG solution without fine-tuning

In [None]:
# Same query as before
query = "As per the documents, tell me about Percentage of shares held by Government of India in Indian bank "

# Use LlamaStack's Responses API for simplified RAG
# This API handles retrieval + generation in a single call
response = client.responses.create(
    model=model_id,  # LLM model to use for generation
    input=query,  # User's question
    tools=[
        {
            "type": "file_search",  # Built-in RAG tool type
            # vector_store_ids: Which vector databases to search
            # The API will automatically:
            # 1. Retrieve relevant chunks
            # 2. Format them as context
            # 3. Generate answer using LLM
            "vector_store_ids": [vector_db_id],
        }
    ],
)
# Extract the output text from the response
print("Responses API result:", getattr(response, "output_text", response))

### c. Using RAG Agent

**Approach**: Agent-based RAG with built-in knowledge search tool.

**Process**:
1. Create an Agent with `builtin::rag/knowledge_search` tool
2. Agent automatically decides when to search documents
3. Agent reasons about the query and generates answer

**Advantages**:
- Agent can reason about when to use RAG
- Can combine with other tools
- More flexible and extensible

**Use Cases**: When building complex systems that need multiple tools and reasoning

In [None]:
# User query
query = "As per the documents, tell me about Percentage of shares held by Government of India in Indian bank"

def agent_qa(user_question: str) -> str:
    """
    RAG using Agent with built-in knowledge search tool.
    
    The Agent automatically:
    - Decides when to search documents
    - Retrieves relevant chunks
    - Generates answer based on retrieved context
    
    Args:
        user_question: User's question to answer
        
    Returns:
        Final answer from the agent
    """
    # Create an Agent with RAG knowledge search capability
    agent = Agent(
        client,  # LlamaStack client
        model=model_id,  # LLM model for the agent
        # Instructions guide the agent's behavior
        # "Answer strictly based on retrieved documents" prevents hallucination
        instructions="You are a helpful assistant. Answer strictly based on retrieved documents.",
        tools=[
            {
                # builtin::rag/knowledge_search: Built-in RAG tool
                # Automatically searches vector databases and retrieves relevant chunks
                "name": "builtin::rag/knowledge_search",
                "args": {
                    # vector_db_ids: Which vector databases to search
                    "vector_db_ids": [vector_db_id]
                },
            }
        ],
    )
    
    # Create a session for this conversation
    # Sessions maintain conversation history and context
    session_id = agent.create_session("web-session")
    
    # Create a turn (one interaction) in the conversation
    response = agent.create_turn(
        messages=[
            {
                "role": "user",  # User message
                "content": user_question,  # The question
            }
        ],
        session_id=session_id,  # Associate with this session
        stream=False,  # Get complete response (not streaming)
    )

    # Extract the raw content from agent's response
    raw_content = response.output_message.content

    # Parse the response (may be JSON format)
    # ReAct agents often return structured JSON with thought/action/answer
    try:
        react_obj = json.loads(raw_content)
        # Extract the final answer from the structured response
        final_answer = react_obj.get("answer", raw_content)
    except json.JSONDecodeError:
        # Fallback: if not JSON, return raw content
        final_answer = raw_content

    return final_answer

# Test the agent-based RAG
answer = agent_qa(user_question=query)
rich.print(answer)

### Limitations of Pure RAG

**Problem**: RAG only searches ingested documents. It cannot answer questions about:
- Real-time information (current stock prices, latest news)
- Information not in the document corpus
- Dynamic data that changes frequently

**Example**: Asking about "latest stock price" when documents only contain historical financial data.

**Solution**: Combine RAG with other tools (web search, APIs) using Agentic AI.

In [None]:
# Example query that RAG cannot answer (requires real-time data)
query = "can you tell me about Indian bank's stock latest price?"

# This will fail or give incomplete answer because:
# 1. Documents contain historical financial data, not real-time prices
# 2. Stock prices change constantly and aren't in static documents
# 3. RAG can only retrieve from ingested documents
rag_pipeline(question=query)

#### Solution: Multi-Tool Agentic AI

**Approach**: Use agents that can select and use multiple tools:
1. **RAG Tool**: For questions about ingested documents
2. **Web Search Tool**: For real-time information and current events
3. **Custom Tools**: For domain-specific data (e.g., stock prices via APIs)

**Agent Capabilities**:
- **Tool Selection**: Automatically chooses the right tool(s) for each query
- **Multi-Step Reasoning**: Can use multiple tools in sequence
- **Context Awareness**: Understands when to use RAG vs. web search vs. custom tools

This creates a comprehensive system that handles both document-based and real-time queries.

# Part B: Using Web Search Tool for Real-Time Information

## Web Search Integration

### Why Web Search?
- **Real-Time Data**: Access current information not in documents
- **Broader Knowledge**: Search the entire web, not just ingested documents
- **Dynamic Updates**: Information updates automatically

### Implementation
We'll use LlamaStack's built-in web search tool (`builtin::websearch`) which:
- Searches the web using search APIs (e.g., Tavily)
- Returns relevant web pages and snippets
- Integrates seamlessly with agents

### Setup Requirements
- **API Key**: Tavily search API key (set as environment variable)
- **Provider Data**: Pass API key to LlamaStackClient for web search access

In [None]:
# Get Tavily search API key from environment variable
# Tavily is a search API that provides web search capabilities
tavily_search_api_key = os.getenv('TAVILY_SEARCH_API_KEY')

# Configure provider data for web search
# If API key is available, pass it to enable web search functionality
if tavily_search_api_key is None:
    provider_data = None  # Web search will not be available
else:
    # provider_data: Configuration for external service providers
    # tavily_search_api_key: API key for Tavily search service
    provider_data = {"tavily_search_api_key": tavily_search_api_key}

# Reinitialize client with provider data for web search
# provider_data enables the client to use external services like Tavily
client = LlamaStackClient(
    base_url=LLAMASTACK_URL,
    provider_data=provider_data  # Enables web search if API key is provided
)

In [None]:
# Verify that tavily search is available as a tool in Llamastack
client.toolgroups.list()

In [None]:
client.vector_dbs.list()

In [None]:
agent = ReActAgent(
            client=client,
            model=model_id,
            tools=["builtin::websearch"],
            instructions="You are a helpful assistant. Answer strictly based on web search results.",
            response_format={
                "type": "json_schema",
                "json_schema": ReActOutput.model_json_schema(),
            },
            sampling_params={'temperature':0.2, 'max_tokens':4000},
        )

qu= ["Can you tell me the latest news about HDFC Bank?"]
stream=False

session_id = agent.create_session("web-session")
for prompt in qu:
    rich.print(f"Processing user query: {prompt}")
    turn = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
        stream=stream
    )
    raw_content = turn.output_message.content
    data_dict = json.loads(raw_content)
    answer = data_dict["answer"]
    

    rich.print(answer)