# Agentic RAG with Tool Use

Tool use allows for greater flexibility in accessing and utilizing data sources, thus unlocking new use cases not possible with a standard RAG approach.

In an enterprise setting where data sources are diverse with non-homogeneous formats (structured/semi-structured/unstructured), this approach becomes even more important.

In this notebook, we'll look at how we can implement an agentic RAG system using a tool use approach. We'll do this by building a Weights & Biases assistant. The assistant can search for information about how to use the product, retrieve information from the internet, search code examples, and even perform data analysis.

Concretely, we'll cover the following use cases:
1. Tool routing
2. Parallel tool use
3. Multi-step tool use
4. Self-correction
5. Structured queries
6. Structured data queries

We'll give the assistant access to the following tools:
- `search_developer_docs`: Searches the Weights & Biases developer documentation
- `search_internet`: Searches the internet for general queries
- `search_code_examples`: Searches code examples and tutorials on using Weights & Biases
- `analyze_evaluation_results`: Analyzes a table containing results from evaluating an LLM application

Note that for simplicity, we are not implementing a full-fledge search. Instead, we'll use a mock datasets containing small, pre-defined data for each tool.

# Setup

To get started, first we need to install the `cohere` library and create a Cohere client.

In [None]:
import json
import os
import cohere

from tool_def_v2 import (
    search_developer_docs,
    search_internet,
    search_code_examples,
    analyze_evaluation_results,
    search_tools,
    analysis_tool
)

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"]) # Get your free API key: https://dashboard.cohere.com/api-keys

In [None]:
# ! pip install -U cohere pandas -q

# Setting up the tools

In an agentic RAG system, each data source is represented as a "tool". A tool is broadly any function or service that can receive and send objects to the LLM. But in the case of RAG, this becomes a more specific case of a tool that takes a query as input and return a set of documents.

Here, we are defining a Python function for each tool, but more broadly, the tool can be any function or service that can receive and send objects. 

Note: refer to the `tool_def_v2.py` file for the implementation of each tool.

These functions are mapped to a dictionary called functions_map for easy access.


In [2]:
functions_map = {
    "search_developer_docs": search_developer_docs,
    "search_internet": search_internet,
    "search_code_examples": search_code_examples,
    "analyze_evaluation_results": analyze_evaluation_results
}

# Running an agentic RAG workflow

We can now run an agentic RAG workflow using a tool use approach. We can think of the system as consisting of four components:
- The user
- The application
- The LLM
- The tools

At its most basic, these four components interact in a workflow through four steps:
- **Step 1: Get user message** – The LLM gets the user message (via the application)
- **Step 2: Tool planning and calling** – The LLM makes a decision on the tools to call (if any) and generates - the tool calls
- **Step 3: Tool execution** - The application executes the tools and the results are sent to the LLM
- **Step 4: Response and citation generation** – The LLM generates the response and citations to back to the user

We wrap all these steps in a function called `run_agent`.

In [None]:
system_message="""## Task and Context
You are an assistant who helps developers use Weights & Biases. The company is also referred to as Wandb or W&B for short. You are equipped with a number of tools that can provide different types of information. If you can't find the information you need from one tool, you should try other tools if there is a possibility that they could provide the information you need. Use the internet to search for information not available in the sources provided by Weights & Biases"""

In [37]:
model = "command-r-08-2024"

def run_agent(query, tools, messages=None):
    if messages is None:
        messages = []
        
    if "system" not in {m.get("role") for m in messages}:
        messages.append({"role": "system", "content": system_message})
    
    # Step 1: get user message
    print(f"Question:\n{query}")
    print("="*50)
    
    messages.append({"role": "user", "content": query})

    # Step 2: Generate tool calls (if any)
    response = co.chat(
        model=model,
        messages=messages,
        tools=tools,
        temperature=0.1
    )

    while response.message.tool_calls:
        
        print("Tool plan:")
        print(response.message.tool_plan,"\n")
        print("Tool calls:")
        for tc in response.message.tool_calls:
            if tc.function.name == "analyze_evaluation_results":
                print(f"Tool name: {tc.function.name}")
                tool_call_prettified = print("\n".join(f"  {line}" for line_num, line in enumerate(json.loads(tc.function.arguments)["code"].splitlines())))
                print(tool_call_prettified)
            else:
                print(f"Tool name: {tc.function.name} | Parameters: {tc.function.arguments}")
        print("="*50)

        messages.append({"role": "assistant", "tool_calls": response.message.tool_calls, "tool_plan": response.message.tool_plan})        
        
        # Step 3: Get tool results
        tool_content = []
        for idx, tc in enumerate(response.message.tool_calls):
            tool_result = functions_map[tc.function.name](**json.loads(tc.function.arguments))
            tool_content.append(json.dumps(tool_result))
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": tool_content})
        
        # Step 4: Generate response and citations 
        response = co.chat(
            model=model,
            messages=messages,
            tools=tools,
            temperature=0.1
        )
    
    messages.append({"role": "assistant", "content": response.message.content[0].text})
        
    # Print final response
    print("Response:")
    print(response.message.content[0].text)
    print("="*50)
    
    # Print citations (if any)
    if response.message.citations:
        print("\nCITATIONS:")
        for citation in response.message.citations:
            print(citation, "\n")
    
    return messages

# 1: Tool routing

With tool routing, the agent decides which tool(s) to use based on the user's query.

In [7]:
messages = run_agent("Where can I find the output of a run", search_tools)

Question:
Where can I find the output of a run
Tool plan:
I will search for 'where to find output of a run' in the developer documentation. 

Tool calls:
Tool name: search_developer_docs | Parameters: {"query":"where to find output of a run"}
Response:
To view a run, navigate to the W&B App UI, select the relevant project, and then choose the run from the 'Runs' table.

CITATIONS:
start=15 end=118 text="navigate to the W&B App UI, select the relevant project, and then choose the run from the 'Runs' table." sources=[ToolSource(type='tool', id='search_developer_docs_cncf7bghy0x8:0', tool_output={'developer_docs': '[{"text":"What is a W\\u0026B Run?\\nA W\\u0026B run is a single unit of computation logged by W\\u0026B, representing an atomic element of your project."},{"text":"How do I view a specific run?\\nTo view a run, navigate to the W\\u0026B App UI, select the relevant project, and then choose the run from the \'Runs\' table."},{"text":"What are Artifact Outputs?\\nArtifact Outputs

In [10]:
messages = run_agent("Who are the authors of the sentence BERT paper?", search_tools)
# chooses search_internet

Question:
Who are the authors of the sentence BERT paper?
Tool plan:
I will search for the authors of the sentence BERT paper. 

Tool calls:
Tool name: search_internet | Parameters: {"query":"authors of the sentence BERT paper"}
Response:
The authors of the sentence BERT paper are Nils Reimers and Iryna Gurevych.

CITATIONS:
start=43 end=47 text='Nils' sources=[ToolSource(type='tool', id='search_internet_c9s1dt82d6mx:0', tool_output={'documents': '[{"content":"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - ACL Anthology In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (Reimers \\u0026 Gurevych, EMNLP-IJCNLP 2019) abstract = \\"BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair 

# 2: Parallel tool use

The agent can call multiple tools in parallel. In this example, given that the user is asking about two different things in a single message, the agent generates two parallel tool calls.

In [11]:
messages = run_agent("Explain what is a W&B Run and how do I view a specific run", search_tools)

Question:
Explain what is a W&B Run and how do I view a specific run
Tool plan:
I will search for 'W&B Run' and 'view a specific run' in the developer documentation. 

Tool calls:
Tool name: search_developer_docs | Parameters: {"query":"W\u0026B Run"}
Tool name: search_developer_docs | Parameters: {"query":"view a specific run"}
Response:
A W&B run is a single unit of computation logged by W&B, representing an atomic element of your project.

To view a specific run, navigate to the W&B App UI, select the relevant project, and then choose the run from the 'Runs' table.

CITATIONS:
start=15 end=55 text='single unit of computation logged by W&B' sources=[ToolSource(type='tool', id='search_developer_docs_0gqbme8c2q67:0', tool_output={'developer_docs': '[{"text":"What is a W\\u0026B Run?\\nA W\\u0026B run is a single unit of computation logged by W\\u0026B, representing an atomic element of your project."},{"text":"How do I view a specific run?\\nTo view a run, navigate to the W\\u0026B App

# 3: Multi-step tool use

There will be scenarios where tool calling needs to happen in a sequence. For example, when the output of one tool call is needed as input for another tool call.

In this example, the agent first searches the developer docs for information about how to view a run. Then, it uses the information to search for a code example.

In [12]:
chat_history = run_agent("What's that feature to automate hyperparameter search? Do you have some code tutorials?", search_tools)
# Does two steps of tool use in a sequence
# Returns code examples

Question:
What's that feature to automate hyperparameter search? Do you have some code tutorials?
Tool plan:
I will search for the feature to automate hyperparameter search. Then, I will search for code tutorials for this feature. 

Tool calls:
Tool name: search_developer_docs | Parameters: {"query":"feature to automate hyperparameter search"}
Tool plan:
I found that the feature to automate hyperparameter search is called W&B Sweeps. I will now search for code tutorials on W&B Sweeps. 

Tool calls:
Tool name: search_code_examples | Parameters: {"file_type":"null","language":"null","query":"W\u0026B Sweeps"}
Response:
W&B Sweeps is the feature to automate hyperparameter search. Unfortunately, I was unable to find any code tutorials for this feature.

CITATIONS:
start=0 end=10 text='W&B Sweeps' sources=[ToolSource(type='tool', id='search_developer_docs_bgka6480daeb:0', tool_output={'developer_docs': '[{"text":"What is a W\\u0026B Run?\\nA W\\u0026B run is a single unit of computation log

# 4: Self-correction

The concept of multi-step tool use can be extended to self-correction. Given the output of the current tool call, the agent may decide to change its plan i.e. self-correct. 

In this example, the agent doesn't find the information it's looking for in the developer docs. Thus, it generates a new tool call to search the internet.

In [13]:
messages = run_agent("What is Wandb's weave solution?", search_tools)

Question:
What is Wandb's weave solution?
Tool plan:
I will search for 'Wandb's weave solution' in the developer documentation. 

Tool calls:
Tool name: search_developer_docs | Parameters: {"query":"Wandb's weave solution"}
Tool plan:
I couldn't find any information about Wandb's weave solution in the developer documentation. I will now search the internet to see if I can find any information. 

Tool calls:
Tool name: search_internet | Parameters: {"query":"Wandb's weave solution"}
Response:
Weights & Biases Weave is a lightweight toolkit for tracking and evaluating LLM applications. It is designed with the developer experience in mind, providing capabilities for straightforward evaluation and debugging of GenAI applications.

You can use Weave to:
- Log and debug language model inputs, outputs, and traces
- Build rigorous, apples-to-apples evaluations for language model use cases
- Organize all the information generated across the LLM workflow, from experimentation to evaluations to p

# 5: Structured queries

The tool use setup can be leveraged to perform structured queries. For data sources that contain rich metadata, structured queries can be used to perform highly-specific queries, returning more accurate results.

In this example, we can take advantage of metadata available in the code examples dataset such as the file type and language.

In [14]:
messages = run_agent("Any jupyter notebook for Data Versioning with Artifacts?", search_tools)
# Tool call: Searches search_code_examples with file_type = ipynb
# Answer: Returns file - Model/Data Versioning with Artifacts (PyTorch)

Question:
Any jupyter notebook for Data Versioning with Artifacts?
Tool plan:
I will search for a Jupyter notebook for Data Versioning with Artifacts. 

Tool calls:
Tool name: search_code_examples | Parameters: {"file_type":"ipynb","language":"en","query":"Data Versioning with Artifacts"}
Response:
Yes, there is a Jupyter notebook for Data Versioning with Artifacts (PyTorch).

CITATIONS:
start=16 end=77 text='Jupyter notebook for Data Versioning with Artifacts (PyTorch)' sources=[ToolSource(type='tool', id='search_code_examples_f4m0g1kqnedp:0', tool_output={'code_examples': '[{"content":"Interactive W\\u0026B Charts Inside Jupyter","file_type":"ipynb","language":"en"},{"content":"Model/Data Versioning with Artifacts (PyTorch)","file_type":"ipynb","language":"en"},{"content":"Create a hyperparameter search with W\\u0026B PyTorch integration","file_type":"ipynb","language":"en"},{"content":"Get started with W\\u0026B Weave","file_type":"ipynb","language":"en"}]'})] 



# 6: Structured data queries

The agent can generate queries against structured data sources, such as a CSV file or a database.

In this example, we'll use a mock dataset containing LLM application evaluation results for different use cases and settings. Since it's a CSV file, we can create the `analyze_evaluation_results` tool to perform queries on the dataset using the pandas library, executed by a Python interpreter.

In [18]:
messages = run_agent("What's the average evaluation score in run A", analysis_tool)
# Answer: 0.63

Question:
What's the average evaluation score in run A
Tool plan:
I will use the 'analyze_evaluation_results' tool to find the average evaluation score in run A. 

Tool calls:
Tool name: analyze_evaluation_results
  import pandas as pd
  
  df = pd.read_csv("evaluation_results.csv")
  
  # Filter the dataframe to only include rows where the run is "A"
  filtered_df = df[df["run"] == "A"]
  
  # Calculate the average score for run A
  average_score = filtered_df["score"].mean()
  
  print(f"Average score for run A: {average_score}")
None
Response:
The average evaluation score in run A is **0.63**.

CITATIONS:
start=43 end=47 text='0.63' sources=[ToolSource(type='tool', id='analyze_evaluation_results_7pay84nc1943:0', tool_output={'python_answer': 'Average score for run A: 0.6333333333333334\n'})] 



In [21]:
messages = run_agent("What's the latency of the highest-scoring run for the summarize_article use case?", analysis_tool)
# Answer: 4.8

Question:
What's the latency of the highest-scoring run for the summarize_article use case?
Tool plan:
I will use the 'analyze_evaluation_results' tool to find the latency of the highest-scoring run for the summarize_article use case. 

Tool calls:
Tool name: analyze_evaluation_results
  import pandas as pd
  
  df = pd.read_csv("evaluation_results.csv")
  
  # Filter for the summarize_article use case
  filtered_df = df[df["usecase"] == "summarize_article"]
  
  # Find the highest-scoring run
  highest_score_run = filtered_df.loc[filtered_df["score"].idxmax()]
  
  # Get the latency of the highest-scoring run
  latency = highest_score_run["latency"]
  
  print(f"The latency of the highest-scoring run for the summarize_article use case is {latency} seconds.")
None
Response:
The latency of the highest-scoring run for the summarize_article use case is 4.8 seconds.

CITATIONS:
start=77 end=89 text='4.8 seconds.' sources=[ToolSource(type='tool', id='analyze_evaluation_results_0jqgr5x3e8nb:

In [22]:
messages = run_agent("Which use case uses the least amount of tokens on average and what's the average token count?", analysis_tool)
# Answer: extract_names (106.25)

Question:
Which use case uses the least amount of tokens on average and what's the average token count?
Tool plan:
I will use the `analyze_evaluation_results` tool to find out which use case uses the least amount of tokens on average and what the average token count is. 

Tool calls:
Tool name: analyze_evaluation_results
  import pandas as pd
  
  df = pd.read_csv("evaluation_results.csv")
  
  # Calculate the average number of tokens for each use case
  average_tokens_per_usecase = df.groupby("usecase")["tokens"].mean()
  
  # Find the use case with the lowest average number of tokens
  min_tokens_usecase = average_tokens_per_usecase.idxmin()
  min_tokens_usecase_avg = average_tokens_per_usecase.min()
  
  print(f"Use case with the lowest average number of tokens: {min_tokens_usecase}")
  print(f"Average number of tokens for that use case: {min_tokens_usecase_avg}")
None
Response:
The use case with the lowest average number of tokens is 'extract_names', with an average of 106.25 token

# Conclusion

This notebook demonstrates how we can implement an agentic RAG system with tool use.

We covered the following use cases:
1. Tool routing
2. Parallel tool use
3. Multi-step tool use
4. Self-correction
5. Structured queries
6. Structured data queries