<a href="https://colab.research.google.com/github/vjagatha/multi-agent-course/blob/main/Module_1/Agentic_RAG/Agentic_RAG_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Dive Agentic Retrieval Augmented Generation

An Agentic RAG is required when we use reasoning to determine which action(s) to take and in which order to take them. Essentially we use agents instead of a LLM directly to accomplish a set of tasks which requires planning, multi step reasoning, tool use and/or learning over time. Agents give us agency!

Agency : The ability to take action or to choose what action to take

In the context of RAG, we can plug in agents to enhance the reasoning prior to selection of RAG pipelines, within a RAG pipeline for retrieval or reranking and finally for synthesising before we send out the response. This improves RAG to a large extent by automating complex workflows and decisions that are required for a non trivial RAG use case.

### Purpose of this Agentic RAG
This notebook presents a practical implementation of Agentic Retrieval-Augmented Generation (RAG)—a system where decision-making and tool selection are delegated to an intelligent agent before executing a response. Rather than passing every query through a static RAG pipeline, this system introduces agency—the ability to choose the best course of action depending on the nature of the query.

At the heart of this implementation is a router prompt, which classifies user queries into one of three categories:

- OpenAI documentation: Queries related to tools, APIs, or usage guidelines for OpenAI models
- 10-K financial reports: Questions requiring retrieval from company filings or financial datasets
- Live Internet search: Broader, current, or comparative queries that need web access

Once the query is classified, the system invokes a corresponding route handler:

- For OpenAI and 10-K queries, it retrieves relevant context from a vector database (Qdrant) using text embeddings, then applies a RAG-based response generator.
- For Internet queries, it fetches real-time information using a web-access API (ARES).

This approach is an example of Agentic RAG, where reasoning precedes retrieval and generation. By plugging in agents before and within the RAG pipeline, we make the system smarter and more adaptive. This allows us to:

- Automatically choose the right retrieval method based on context
- Combine structured knowledge with real-time search
- Scale RAG beyond trivial use cases by integrating multi-step decision logic

Importantly, no external agentic frameworks are used—this is a ground-up implementation that demonstrates how to build a lightweight but intelligent agentic system using only a language model, prompt engineering, and retrieval tools.

## Setup and Dependencies

In [1]:
# Install the necessary libraries
!pip install openai
!pip install qdrant_client
!pip install transformers


Collecting qdrant_client
  Downloading qdrant_client-1.14.2-py3-none-any.whl.metadata (10 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant_client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading qdrant_client-1.14.2-py3-none-any.whl (327 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.7/327.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, qdrant_client
Successfully installed portalocker-2.10.1 qdrant_client-1.14.2


In [2]:
# Import basic libraries
import requests             # Used for making HTTP requests (e.g., calling ARES API for live internet queries)
import json                 # For parsing and structuring JSON data (especially OpenAI and routing responses)

# Google Colab-specific (for securely handling API keys)
from google.colab import userdata  # To securely store and retrieve credentials in Colab

# OS operations
import os                   # Useful for accessing environment variables and managing paths

# OpenAI API client
from openai import OpenAI   # Official OpenAI client library to interface with GPT models for routing and generation

# Text processing
import re                   # Regular expressions for cleaning or preprocessing inputs (if needed)

# Optional visualization (for analysis/debugging purposes)
import matplotlib.pyplot as plt       # For displaying charts or visual debug outputs (e.g., embeddings visualizations)
import matplotlib.image as mpimg      # For loading/displaying images if needed (rare in RAG, but helpful in demos)

# Embedding models (used for text vectorization during retrieval)
from transformers import AutoTokenizer, AutoModel  # For loading custom transformer models if not using OpenAI embeddings

# Vector database client
from qdrant_client import QdrantClient   # Qdrant is used as the vector store to retrieve documents based on similarity


## 1. Defining the Internet Tool

First, we will define a tool function that enables our system to answer queries requiring real-time, internet-based information. Not all questions can be answered using static documents like OpenAI docs or financial filings—sometimes users ask about current trends, comparisons, or live updates.

To handle this, we introduce a live search capability using the **ARES API** by Traversaal.

### What is ARES API?  
ARES is a web-based tool that allows you to:

- Search the internet in real time.
- Get LLM-generated answers based on live search results.

This is particularly useful for questions about:

- Current events (e.g., *“Latest AI tools in 2025”*),
- Tech comparisons (e.g., *“Gemini vs GPT-4”*),
- General knowledge outside internal datasets.

Please generate the API key [here](https://api.traversaal.ai)


In [3]:
#loads ares api key from colab secrets
ares_api_key=userdata.get('ARES_API_KEY')

In [4]:
import requests  # For sending HTTP POST requests to the ARES API

def get_internet_content(user_query: str, action: str):
    """
    Fetches a response from the internet using ARES-API based on the user's query.

    This function serves as the tool invoked when the router classifies a query
    as requiring real-time information beyond internal datasets—i.e., "INTERNET_QUERY".
    It sends the query to a live search API (ARES) and returns the result.

    Args:
        user_query (str): The user's question that needs a live answer.
        action (str): Route type (always expected to be "INTERNET_QUERY").

    Returns:
        str: Response text generated using internet search or an error message.
    """
    print("Getting your response from the internet 🌐 ...")

    # API endpoint for the ARES live search tool
    url = "https://api-ares.traversaal.ai/live/predict"

    # Payload structure expected by the ARES API
    payload = {"query": [user_query]}

    # Authentication and content headers for API access
    headers = {
        "x-api-key": ares_api_key,  # Your secret API key (should be securely loaded from environment)
        "content-type": "application/json"
    }

    try:
        # Send the query to the ARES API and check for success
        response = requests.post(url, json=payload, headers=headers)
        response.raise_for_status()

        # Extract and return the main response text from the API's nested JSON
        return response.json().get('data', {}).get('response_text', "No response received.")

    # Handle HTTP-level errors (e.g., 400s or 500s)
    except requests.exceptions.HTTPError as http_err:
        return f"HTTP error occurred: {http_err}"

    # Handle general connection, timeout, or request formatting issues
    except requests.exceptions.RequestException as req_err:
        return f"Request error occurred: {req_err}"

    # Catch-all for any unexpected failure
    except Exception as err:
        return f"An unexpected error occurred: {err}"


In [5]:
print(get_internet_content("Tell me about best travel destinations in 2025?","INTERNET_QUERY")) #run internet function to test results

Getting your response from the internet 🌐 ...
Here are some of the best travel destinations for 2025:

1. **Budapest, Hungary** - Known for its rich history and stunning architecture, Budapest offers a vibrant cultural experience.
2. **Bukhara, Uzbekistan** - A city steeped in history, Bukhara is famous for its well-preserved medieval architecture and cultural heritage.
3. **Charleston, South Carolina, USA** - Renowned for its charming historic district, Charleston is a great destination for those interested in American history and Southern culture.
4. **Inverness and the Flow Country, Scotland** - This area is perfect for nature lovers, offering breathtaking landscapes and opportunities for outdoor activities.
5. **Seoul, South Korea** - A bustling metropolis that blends modernity with tradition, Seoul is a top choice for solo travelers.
6. **Kathmandu, Nepal** - Known for its rich culture and proximity to the Himalayas, Kathmandu is a must-visit for adventure seekers.
7. **Cusco, Per

## 2. Router Query Function — Giving the Agent Its Brain

In this step, we will define the router function, which plays a critical role in our Agentic RAG system.

### What is a Router?

A router is like the decision-making brain of our assistant.

Before trying to answer a user's question, the system first needs to figure out:

> “Where should I go to find the right answer?”

To make this decision, we use the OpenAI GPT model. We provide it with a detailed system prompt that explains how to classify the user's question into one of these categories:

- **OPENAI_QUERY** → Questions about OpenAI tools, APIs, models, or documentation.
- **10K_DOCUMENT_QUERY** → Questions about companies, financial filings, or analysis based on 10-K reports.
- **INTERNET_QUERY** → Anything else that likely requires real-time or general web information.

### What does the function do?

- Sends the user's question to the OpenAI API.
- Receives a JSON response containing:
  - `action`: The category the query belongs to.
  - `reason`: A short explanation for the decision.
  - `answer`: (Optional) A quick response if it’s simple enough (left blank for internet queries).
- Parses the response and returns it as a Python dictionary.

### Why is this important?

This router gives the system agency—the ability to decide which knowledge source to use. It’s what makes this pipeline agentic, not just static.

Without the router, every query would follow the same path. With it, we can:

- Dynamically switch between tools and data sources.
- Handle different types of user questions intelligently.
- Avoid wasting resources on unnecessary steps.


In [7]:
# Securely retrieve the OpenAI API key from Colab's user data store
# This avoids hardcoding sensitive credentials directly in the notebook
openai_api_key = userdata.get('OPENAI_API_KEY')

# Initialize the OpenAI client with the retrieved API key
# This client will be used for:
# - Query classification via the router prompt
# - Potentially generating responses from retrieved context
openaiclient = OpenAI(api_key=openai_api_key)


In [8]:
from openai import OpenAIError

def route_query(user_query: str):
    router_system_prompt =f"""
    As a professional query router, your objective is to correctly classify user input into one of three categories based on the source most relevant for answering the query:
    1. "OPENAI_QUERY": If the user's query appears to be answerable using information from OpenAI's official documentation, tools, models, APIs, or services (e.g., GPT, ChatGPT, embeddings, moderation API, usage guidelines).
    2. "10K_DOCUMENT_QUERY": If the user's query pertains to a collection of documents from the 10k annual reports, datasets, or other structured documents, typically for research, analysis, or financial content.
    3. "INTERNET_QUERY": If the query is neither related to OpenAI nor the 10k documents specifically, or if the information might require a broader search (e.g., news, trends, tools outside these platforms), route it here.

    Your decision should be made by assessing the domain of the query.

    Always respond in this valid JSON format:
    {{
        "action": "OPENAI_QUERY" or "10K_DOCUMENT_QUERY" or "INTERNET_QUERY",
        "reason": "brief justification",
        "answer": "AT MAX 5 words answer. Leave empty if INTERNET_QUERY"
    }}

    EXAMPLES:

    - User: "How to fine-tune GPT-3?"
    Response:
    {{
        "action": "OPENAI_QUERY",
        "reason": "Fine-tuning is OpenAI-specific",
        "answer": "Use fine-tuning API"
    }}

    - User: "Where can I find the latest financial reports for the last 10 years?"
    Response:
    {{
        "action": "10K_DOCUMENT_QUERY",
        "reason": "Query related to annual reports",
        "answer": "Access through document database"
    }}

    - User: "Top leadership styles in 2024"
    Response:
    {{
        "action": "INTERNET_QUERY",
        "reason": "Needs current leadership trends",
        "answer": ""
    }}

    - User: "What's the difference between ChatGPT and Claude?"
    Response:
    {{
        "action": "INTERNET_QUERY",
        "reason": "Cross-comparison of different providers",
        "answer": ""
    }}

    Strictly follow this format for every query, and never deviate.
    User: {user_query}
    """

    try:
        # Query the GPT-4 model with the router prompt and user input
        response = openaiclient.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "system", "content": router_system_prompt}]
        )

        # Extract and parse the model's JSON response
        task_response = response.choices[0].message.content
        json_match = re.search(r"\{.*\}", task_response, re.DOTALL)
        json_text = json_match.group()
        parsed_response = json.loads(json_text)
        return parsed_response

    # Handle OpenAI API errors (e.g., rate limits, authentication)
    except OpenAIError as api_err:
        return {
            "action": "INTERNET_QUERY",
            "reason": f"OpenAI API error: {api_err}",
            "answer": ""
        }

    # Handle case where model response isn't valid JSON
    except json.JSONDecodeError as json_err:
        return {
            "action": "INTERNET_QUERY",
            "reason": f"JSON parsing error: {json_err}",
            "answer": ""
        }

    # Catch-all for any other unforeseen issues
    except Exception as err:
        return {
            "action": "INTERNET_QUERY",
            "reason": f"Unexpected error: {err}",
            "answer": ""
        }

In [9]:
route_query("what is the revenue of uber in 2021?")


{'action': '10K_DOCUMENT_QUERY',
 'reason': "Query related to company's financials",
 'answer': "Check Uber's 10K report"}

## 3. Setting Up Qdrant Vector Database for Agentic RAG
In this step, we are connecting our agent to a pre-built vector database using Qdrant—a tool used to store and search document embeddings (numerical representations of text).

What Are We Doing?
We are loading an existing Qdrant database that was downloaded from a GitHub repository. This database already contains:

- Vectorized OpenAI documentation
- Vectorized 10-K financial filings

By loading this saved data:

- We save time (no need to re-embed the documents)
- We enable fast similarity search to retrieve relevant text chunks

This setup allows our system to perform semantic search, meaning it can understand the meaning of the user query and match it with the most relevant pieces of information stored in the database.


### Why This Matters in Agentic RAG
Once the router decides that the query should go to the OpenAI docs or the 10-K reports, our system uses Qdrant to:

- Search for the most relevant pieces of text
- Pass those to the model to generate a grounded answer

So, this step is essential to support retrieval-augmented generation (RAG) within our agentic flow.

#Data Sources:

**10K Database: Lyft 2024 & Uber 2021 SEC filings**

**OpenAI Docs: Official OpenAI documentation**

For lecture demo purposes, the vecitr database has already been created and hosted on Github which we will clone here. In order to create your own embeddings, the notebook and data will be hosted and shared on github

In [10]:
# Clone the project repository that contains prebuilt vector data (e.g., Qdrant collections)
# This includes document embeddings and configurations needed for retrieval (10-K, OpenAI docs)
!git clone https://github.com/hamzafarooq/multi-agent-course.git


Cloning into 'multi-agent-course'...
remote: Enumerating objects: 203, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 203 (delta 6), reused 3 (delta 3), pack-reused 191 (from 1)[K
Receiving objects: 100% (203/203), 5.21 MiB | 11.16 MiB/s, done.
Resolving deltas: 100% (51/51), done.


In [11]:
# 🗄️ Initializing Qdrant client with local path to vector database
# The path points to prebuilt Qdrant collections (10-K and OpenAI docs) cloned from the repository
# This enables fast, local retrieval of relevant document chunks based on semantic similarity
client = QdrantClient(path="/content/multi-agent-course/Module_1/Agentic_RAG/qdrant_data")


## 4. Building the Retriever and RAG for Vector Databases
In this section, we build the core logic that allows our agent to find relevant documents and generate grounded answers using them.

###Step 1: Import the Embedding Model
We start by importing the nomic-ai/nomic-embed-text-v1.5 model from Hugging Face. This model is used to convert any text (such as a user query) into a dense vector, known as an embedding. These embeddings capture the semantic meaning of text, allowing us to later compare and retrieve similar documents.


In [12]:
# Load the tokenizer and embedding model from Hugging Face
# This model converts raw text into dense vector representations (embeddings)
# Used for similarity search in Qdrant during document retrieval
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

def get_text_embeddings(text):
    """
    Converts input text into a dense embedding using the Nomic embedding model.
    These embeddings are used to query Qdrant for semantically relevant document chunks.

    Args:
        text (str): The input text or query from the user.

    Returns:
        np.ndarray: A fixed-size vector representing the semantic meaning of the input.
    """
    # Tokenize and prepare input for the model
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # Forward pass to get model outputs
    outputs = text_model(**inputs)

    # Take the mean across all token embeddings to get a single vector (pooled representation)
    embeddings = outputs.last_hidden_state.mean(dim=1)

    # Convert to NumPy array and detach from computation graph
    return embeddings[0].detach().numpy()

# Example usage: Generate and preview the embedding of a test sentence
text = "This is a test sentence."
embeddings = get_text_embeddings(text)
print(embeddings[:5])  # Print first 5 dimensions for inspection


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

configuration_hf_nomic_bert.py:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_nomic_bert.py:   0%|          | 0.00/104k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]



[ 1.2799692   0.40158355 -3.5162656  -0.3981321   1.5919138 ]


### Step 2: Define the Embedding Function
We then define a function get_text_embeddings() which:

- Tokenizes the input text
- Runs it through the model
- Computes the average of all token embeddings
- Returns a single vector that represents the full sentence

This vector will be used to query Qdrant to find the most relevant document chunks based on similarity.

In [13]:
def rag_formatted_response(user_query: str, context: list):
    """
    Generate a response to the user query using the provided context,
    with article references formatted as [1][2], etc.

    This function performs the final step in the RAG pipeline—synthesizing an answer
    from retrieved document chunks (context). It prompts the model to generate a
    grounded response, explicitly citing sources using a reference format.

    Args:
        user_query (str): The user's original question.
        context (list): List of text chunks retrieved from Qdrant (10-K or OpenAI docs).

    Returns:
        str: A generated response grounded in the retrieved context, with numbered citations.
    """

    # Construct a RAG prompt that includes both:
    # 1. The user's query
    # 2. The supporting context documents
    # The prompt instructs the model to answer using only the provided context,
    # and to include citations like [1], [2], etc. based on chunk IDs or order.
    rag_prompt = f"""
       Based on the given context, answer the user query: {user_query}\nContext:\n{context}
       and employ references to the ID of articles provided [ID], ensuring their relevance to the query.
       The referencing should always be in the format of [1][2]... etc. </instructions>
    """

    #  Call GPT-4o to generate the response using the RAG-style prompt
    response = openaiclient.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": rag_prompt},
        ]
    )

    # Return the model's generated answer
    return response.choices[0].message.content


### Step 3: Define the RAG Response Generator
After retrieving relevant text chunks from Qdrant, we use the rag_formatted_response() function to generate a final answer. This function:

- Takes the user query and the retrieved document chunks
- Builds a prompt that asks the language model (GPT-4o) to answer the question using only the provided context
- Instructs the model to include references like [1], [2] for traceability

This ensures the output is not only informative but also grounded in actual retrieved data.

Together, these two functions lay the foundation for combining retrieval (from vector DB) and generation (from LLM) — the two pillars of a RAG system.



In [14]:
def retrieve_and_response(user_query: str, action: str):
    """
    Retrieves relevant text chunks from the appropriate Qdrant collection
    based on the query type, then generates a response using RAG.

    This function powers the retrieval and response generation pipeline
    for queries that are classified as either OPENAI-related or 10-K related.
    It uses semantic search to fetch relevant context from a Qdrant vector store
    and then generates a response using that context via a RAG prompt.

    Args:
        user_query (str): The user's input question.
        action (str): The classification label from the router (e.g., "OPENAI_QUERY", "10K_DOCUMENT_QUERY").

    Returns:
        str: A model-generated response grounded in retrieved documents, or an error message.
    """

    # Define mapping of routing labels to their respective Qdrant collections
    collections = {
        "OPENAI_QUERY": "opnai_data",           # Collection of OpenAI documentation embeddings
        "10K_DOCUMENT_QUERY": "10k_data"        # Collection of 10-K financial document embeddings
    }

    try:
        # Ensure that the provided action is valid
        if action not in collections:
            return "Invalid action type for retrieval."

        # Step 1: Convert the user query into a dense vector (embedding)
        try:
            query = get_text_embeddings(user_query)
        except Exception as embed_err:
            return f"Embedding error: {embed_err}"  # Fail early if embedding fails

        # Step 2: Retrieve top-matching chunks from the relevant Qdrant collection
        try:
            text_hits = client.query_points(
                collection_name=collections[action],  # Choose the right collection based on routing
                query=query,                          # The embedding of the user's query
                limit=3                               # Fetch top 3 relevant chunks
            ).points
        except Exception as qdrant_err:
            return f"Vector DB query error: {qdrant_err}"  # Handle Qdrant access issues

        # Extract the raw content from the retrieved vector hits
        contents = [point.payload['content'] for point in text_hits]

        # If no relevant content is found, return early
        if not contents:
            return "No relevant content found in the database."

        # Step 3: Pass the retrieved context to the RAG model to generate a response
        try:
            response = rag_formatted_response(user_query, contents)
            return response
        except Exception as rag_err:
            return f"RAG response error: {rag_err}"  # Handle generation failures

    # Catch any unforeseen errors in the overall process
    except Exception as err:
        return f"Unexpected error: {err}"


# 5. Putting It All Together: Running the Agentic RAG
In this final step, we combine everything into a single function that controls the entire Agentic RAG workflow. The agentic_rag() function acts as the main orchestrator of the system.

Here’s what it does:

- Prints the user's query for reference.
- Uses the router function (powered by GPT) to decide which type of data source to use:
  - OpenAI documentation
  - 10-K financial reports
- Internet search
- Calls the correct function based on the route:
- If it’s an OpenAI or 10-K query, it retrieves data from Qdrant and generates a RAG response.
- If it’s an Internet query, it uses the ARES API to fetch live information.
- Displays the final response, neatly formatted in the console.

This step brings the agentic loop full circle—from understanding the question, reasoning about where to search, to finally responding with the best possible answer.

In [15]:
# Dictionary that maps the route labels (decided by the router) to their respective functions
# Each type of query is handled differently:
# - OPENAI_QUERY and 10K_DOCUMENT_QUERY use document retrieval + RAG
# - INTERNET_QUERY uses a web search API
routes = {
    "OPENAI_QUERY": retrieve_and_response,
    "10K_DOCUMENT_QUERY": retrieve_and_response,
    "INTERNET_QUERY": get_internet_content,
}

def agentic_rag(user_query: str):
    """
    Main function that runs the full Agentic RAG system.

    This function takes a user's question, decides what type of query it is (OpenAI-related,
    financial document-related, or general internet), and then calls the right function
    to handle it. Finally, it prints out the full conversation and response.

    Args:
        user_query (str): The user's input question.

    Returns:
        None (It just prints the result nicely to the console)
    """

    #  Terminal color codes to make the printed output easier to read and visually structured
    CYAN = "\033[96m"
    GREY = "\033[90m"
    BOLD = "\033[1m"
    RESET = "\033[0m"

    try:
        # Step 1: Print the user's original question to the console
        print(f"{BOLD}{CYAN}👤 User Query:{RESET} {user_query}\n")

        # Step 2: Use the router (powered by GPT) to decide which route the query belongs to
        try:
            response = route_query(user_query)
        except Exception as route_err:
            # If something goes wrong while classifying the query, show an error message
            print(f"{BOLD}{CYAN}🤖 BOT RESPONSE:{RESET}\n")
            print(f"Routing error: {route_err}\n")
            return

        # Extract the routing decision and the reason behind it
        action = response.get("action")  # e.g., "OPENAI_QUERY"
        reason = response.get("reason")  # e.g., "Related to OpenAI tools"

        # Step 3: Show the selected route and why it was chosen
        print(f"{GREY}📍 Selected Route: {action}")
        print(f"📝 Reason: {reason}")
        print(f"⚙️ Processing query...{RESET}\n")

        # Step 4: Call the correct function depending on the route (retrieval or web search)
        try:
            route_function = routes.get(action)  # Find the function to use for this route
            if route_function:
                result = route_function(user_query, action)  # Run the function with the user's input
            else:
                result = f"Unsupported action: {action}"  # Catch unknown routing types
        except Exception as exec_err:
            result = f"Execution error: {exec_err}"  # Handle failure in the chosen route function

        # Step 5: Print the final response to the user
        print(f"{BOLD}{CYAN}🤖 BOT RESPONSE:{RESET}\n")
        print(f"{result}\n")

    except Exception as err:
        # Catch-all for any unexpected errors in the overall logic
        print(f"{BOLD}{CYAN}🤖 BOT RESPONSE:{RESET}\n")
        print(f"Unexpected error occurred: {err}\n")


In [16]:

agentic_rag("what was uber revenue in 2021?")

[1m[96m👤 User Query:[0m what was uber revenue in 2021?

[90m📍 Selected Route: 10K_DOCUMENT_QUERY
📝 Reason: Revenue figures are in annual reports
⚙️ Processing query...[0m

[1m[96m🤖 BOT RESPONSE:[0m

Uber's revenue in 2021 was $17,455 million [1].



In [17]:
agentic_rag("what was lyft revenue in 2024?")

[1m[96m👤 User Query:[0m what was lyft revenue in 2024?

[90m📍 Selected Route: INTERNET_QUERY
📝 Reason: Requires most recent financial data
⚙️ Processing query...[0m

Getting your response from the internet 🌐 ...
[1m[96m🤖 BOT RESPONSE:[0m

Lyft's revenue in 2024 was approximately $5.79 billion, which represents a 31.39% increase from 2023.



In [18]:
agentic_rag("List me down new LLMs in 2025")

[1m[96m👤 User Query:[0m List me down new LLMs in 2025

[90m📍 Selected Route: INTERNET_QUERY
📝 Reason: Query about future LLMs
⚙️ Processing query...[0m

Getting your response from the internet 🌐 ...
[1m[96m🤖 BOT RESPONSE:[0m

### New LLMs Released in 2025

1. **GPT-4o** - OpenAI's latest iteration in the GPT series.
2. **Claude 3.5** - Developed by Anthropic, this model continues the Claude series.
3. **Gemini 2.0 Flash** - Released by Google DeepMind, this model is part of the Gemini series.
4. **Sora** - Another model from OpenAI, released shortly after Gemini.
5. **Nova** - Amazon's new entry into the LLM space.
6. **Claude 3.5 Sonnet** - An advanced version of Claude 3.5.
7. **LLaMA 3** - Meta's latest open-source model.
8. **Google Gemma 2** - A continuation of Google's efforts in LLMs.
9. **Command R+** - A new model focusing on command-based interactions.
10. **Mistral-8x22b** - A high-performance model from Mistral.
11. **Falcon 2** - An updated version of the Falcon mo

In [19]:
agentic_rag("how to work with chat completions?")

[1m[96m👤 User Query:[0m how to work with chat completions?

[90m📍 Selected Route: OPENAI_QUERY
📝 Reason: Chat completions is OpenAI-specific
⚙️ Processing query...[0m

[1m[96m🤖 BOT RESPONSE:[0m

To work with chat completions, you can accomplish different tasks through OpenAI API. 

To retrieve chat messages, you can use the chat completion ID through the endpoint https://api.openai.com/v1/chat/completions/chat_abc123/messages. The chat completion ID is required while the 'after' parameter (identifier for the last message from the previous pagination request) and limit (number of messages to retrieve) are optional. You would need to state a limit, the default is 20, and ordering sequence, like ascending or descending order [1].

To retrieve a chat completion, you would need the chat completion ID and would need to follow the endpoint https://api.openai.com/v1/chat/completions/chatcmpl-abc123. The response would be a ChatCompletion object matching the given ID [1].

To list store

#Assignment: Implement sub-query division

In our current agentic retrieval-augmented generation (RAG) setup, there's a key limitation: when a user submits a query that contains multiple distinct questions phrased as a single input, the system treats it as a single unified search. As a result, the retrieval engine performs only one operation on the vector databases or external tools, which often leads to incomplete or less relevant results.


To address this, the assignment introduces a new functionality called subquery division. This involves breaking down complex, compound queries into multiple, focused subqueries. Each subquery is processed independently, allowing the system to retrieve more accurate and context-specific information. By handling these subqueries separately, the agent can generate more complete and relevant responses.

In [36]:
#Reference Code for sub query division (For Guidance Only)

def sub_queries(user_query):
  sub_queries_prompt= f'''
    You are a query router. If the input contains a single question, break it as one.
    Otherwise, break it up into multiple distinct sub-questions, break it into sub-questions and return it as a JSON object
    like
        {{
            "subQuestions": ["..."]
        }}

    Here are a few examples of questions that might come your way. Model them based on this.
    Query: "What is the value of X and Y?"
    Output:
    {{
        "subQuestions": ["What is the value of X?",
                        "What is the value of Y?"
                        ]
    }}

    Query: "What is the margins of Tesla and Marvell in 2023?"
    Output:
    {{
        "subQuestions": ["What is the value of Tesla in 2023?",
                        "What is the value of Marvell in 2023?"
                        ]
    }}

    Query: "What is the best hotel in Cancun and the worst hotel in Cabo?"
    Output:
    {{
        "subQuestions": ["What is the best hotel in Cancun?",
                        "What is the worst hotel in Cabo?"
                        ]
    }}

    Query: "What is the National bird of Angola and Fiji respectively"
    Output:
    {{
        "subQuestions": ["What is the National bird of Angola?",
                        "What is the National bird of Fiji?"
                        ]
    }}

    Query: "What happens when you add one more to two, three and four?"
    Output:
    {{
        "subQuestions": ["What happens when you add one more to two?",
                        "What happens when you add one more to three?",
                        "What happens when you add one more to four?"
                        ]
    }}

    Query: "{user_query}"
    Output:
    '''
  response = openaiclient.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": sub_queries_prompt},
        ]
    )
  return response.choices[0].message.content


In [40]:
print(sub_queries("what is revenue of lyft when you add $1000 to the revenues in 2023 and 2024. Also, what is Lyft's brand"))

{
    "subQuestions": ["What is the revenue of Lyft when you add $1000 to the revenues in 2023?",
                     "What is the revenue of Lyft when you add $1000 to the revenues in 2024?",
                     "What is Lyft's brand?"
                     ]
}
