# Very simple draft of LlamaStack Max Tool Experiment

## Overview
This script tests how well LlamaStack handles increasing numbers of tools by measuring **tool selection accuracy, execution success, and latency**. 
## Experiment Setup
- **5 Real Tools**: Weather info, word count, string reversal, uppercase conversion, insurance scoring.
- **Fake Tools**: Dynamically generated tools with random outputs (up to 40 additional tools).
- **5 Fixed Queries**: Each mapped to a ground truth tool.
- **Scaling**: Start with 5 tools, increase by 5 up to 45.
- **Metrics Logged**:
  - Exception Rate (how many exception occurs out of 5 queries)
  - Tool Execution Success Rate (how many time tools are actually executed out of 5 queries)
  - Correct Tool Selection Rate  (how many time correct tool is selected out of 5 queries)
  - Average Latency (average time taken to respond 5 queries)


In [1]:
pip show llama-stack-client

Name: llama_stack_client
Version: 0.1.8
Summary: The official Python library for the llama-stack-client API
Home-page: https://github.com/meta-llama/llama-stack-client-python
Author: 
Author-email: Llama Stack Client <dev-feedback@llama-stack-client.com>
License-Expression: Apache-2.0
Location: /opt/anaconda3/envs/stack-client/lib/python3.10/site-packages
Requires: anyio, click, distro, httpx, pandas, prompt-toolkit, pyaml, pydantic, rich, sniffio, termcolor, tqdm, typing-extensions
Required-by: llama_stack
Note: you may need to restart the kernel to use updated packages.


In [2]:
import asyncio
import os
import random
import time
import csv
import sys
import types
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.client_tool import client_tool
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from dotenv import load_dotenv
from rich.pretty import pprint
import logging
load_dotenv()


True

In [3]:
# Define real tools
@client_tool
def weather_info(loc: str):
    """Fetches the current weather for a given location.
    
    :param loc: The location for which weather information is requested.
    :returns: A dictionary containing success status and the weather result.
    """
    return {"success": True, "result": f"Weather in {loc} is sunny."}

@client_tool
def word_count(text: str):
    """Counts the number of words in the given text.
    
    :param text: The input text to analyze.
    :returns: A dictionary containing success status and the word count.
    """
    return {"success": True, "result": len(text.split())}

@client_tool
def reverse_string(text: str):
    """Reverses the given string.
    
    :param text: The input text to reverse.
    :returns: A dictionary containing success status and the reversed string.
    """
    return {"success": True, "result": text[::-1]}

@client_tool
def uppercase(text: str):
    """Converts the given string to uppercase.
    
    :param text: The input text to convert.
    :returns: A dictionary containing success status and the uppercase text.
    """
    return {"success": True, "result": text.upper()}

@client_tool
def insurance_scorer(text: str):
    """Generates a insurance score between 1 and 100.
    :param text: The input text to eval.
    :returns: A dictionary containing success status and the generated number.
    """
    return {"success": True, "result": random.randint(1, 100)}

In [4]:
# Generate fake tools using `types.FunctionType`
def generate_fake_tools(n):
    tools = []
    
    for i in range(n):
        tool_name = f"tool_{i}_{generate_random_text(2)}"
        tool_doc = f"""Tool {i} performs a unique operation on the input data. {generate_random_text(10)}
        
        :param input_data: The input data for the tool.
        :returns: A dictionary with success status and a unique response.
        """
        
        def fake_tool(input_data: str, tool_id=i):
            responses = [
                f"Tool {tool_id} processed input: {input_data}",
                f"Tool {tool_id} received: {input_data}",
                f"Input {input_data} was handled by tool {tool_id}",
            ]
            return {"success": True, "result": random.choice(responses)}
        
        fake_tool_fn = types.FunctionType(fake_tool.__code__, globals(), tool_name)
        fake_tool_fn.__doc__ = tool_doc
        print(tool_name)
        print(tool_doc[:100])
        fake_tool_fn = client_tool(fake_tool_fn)
        
        tools.append(fake_tool_fn)
    
    return tools

def generate_random_text(length=10):
    words = ["alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel", "india", "juliet", "kilo", "lima", "mike", "november", "oscar", "papa", "quebec", "romeo", "sierra", "tango", "uniform", "victor", "whiskey", "x-ray", "yankee", "zulu"]
    return " ".join(random.choices(words, k=length))

In [5]:
# Define test queries and ground truth tools
queries = [
    ("What is the weather in New York?", weather_info),
    ("How many words are in 'Hello World, this is a test sentence'?", word_count),
    ("Reverse this text: Python Experiment", reverse_string),
    ("Convert this to uppercase: llamastack", uppercase),
    ("Give me an insurance evaluation score", insurance_scorer)
]

In [6]:
def log_results(results, csv_filename):
    """Logs experiment results into a CSV file and a log file."""
    with open(csv_filename, mode="w", newline="") as file:
        writer = csv.writer(file)
        writer.writerow(["Tool Count", "Exception Rate", "Tool Execution Rate", "Correct Tool Rate", "Average Latency (s)"])
        writer.writerows(results)

In [7]:
# Run the experiment
model_id = os.getenv("INFERENCE_MODEL")
# model_id = "meta-llama/Llama-3.2-3B-Instruct"
print(model_id)
inference_model = model_id.split("/")[1]
environment = "local" # "nerc" or "local"
temperature = 1

# Setup logging to a file
output_dir = "experiment_logs"
os.makedirs(output_dir, exist_ok=True)
experiment_date = time.strftime("%Y%m%d_%H%M%S")
subname = f"{inference_model}_{environment}_temp{temperature}_{experiment_date}"
log_file = os.path.join(output_dir, f"results_{subname}.log")
csv_filename = os.path.join(output_dir, f"results_{subname}.csv")

# Redirect print statements to a log file
class Logger(object):
    def __init__(self, filename):
        self.terminal = sys.stdout
        self.log = open(filename, "a")

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

    def flush(self):
        pass

sys.stdout = Logger(log_file)

base_url = f"http://localhost:{os.getenv('LLAMA_STACK_PORT')}" if environment == "local" else os.getenv("LLAMA_STACK_ENDPOINT")
print(base_url)
client = LlamaStackClient(
    base_url = base_url
)

real_tools = [weather_info, word_count, reverse_string, uppercase, insurance_scorer]
results = []

for total_tools in range(5, 100, 1):  # Increase by 5 up to 50 tools
    tools = real_tools  + generate_fake_tools(total_tools - len(real_tools)-1)
    print(len(tools))
    
    exception_count = 0
    tool_execution_count = 0
    correct_tool_count = 0
    total_latency = 0
    max_correct_tool_count = -1
    max_tool_exe_count = -1

    for i, (query, correct_tool) in enumerate(queries):
        agent = Agent(
            client=client,
            model=model_id,
            instructions="""You are an AI tool calling assistant. Must use the correct tool for each query.
            When using the tools:
            1. Extract the relevant number or values from the user's request.
            2. Use the correct tool to perform the operation.
            3. Present the result clearly.
            4. Handle errors gracefully.""",
            tools=tools,
            sampling_params = {  # Todo, test how temperature affect the results. 
                "strategy": {
                    "type": "top_p",
                    "temperature": temperature,
                    "top_p": 0.9,
                }
            },
            
        )

        print(f"\nUser: {query}")
        start_time = time.time()
        print(f"Agent id is {agent.agent_id}")
        session_id = agent.create_session(f"tool-experiment-session-{i+1}")
        print(f'session id is {session_id}')
        
        try:
            response = agent.create_turn(
                messages=[
                    {"role": "user", "content": query}
                ],
                session_id=session_id,
                stream=False,
            )
            
            end_time = time.time()
            response_time = end_time - start_time
            total_latency += response_time
            # pprint(response)
            
            print(f"Inference: {response.output_message.content}")

            steps = response.steps
            if len(steps) > 1:
                tool_executed = any(step.step_type == "tool_execution" for step in steps)
                correct_tool_used = any(step.tool_calls[0].tool_name == correct_tool.__name__ for step in steps if step.step_type == "tool_execution")
                if tool_executed:
                    print(f"Executed Tool: {steps[1].tool_calls[0].tool_name}")
                    print(f"Ground Truth Tool: {correct_tool.__name__}")
                tool_execution_count += tool_executed
                correct_tool_count += correct_tool_used
            else:
                print("Error: Not enough steps in response to access step 1.")
            
        except Exception as e:
            print(f"Error processing query: {e}")
            exception_count += 1

    exception_rate = exception_count / len(queries)
    tool_execution_rate = tool_execution_count / len(queries)
    correct_tool_rate = correct_tool_count / len(queries)
    average_latency = total_latency / len(queries)
    
    results.append([total_tools, exception_rate, tool_execution_rate, correct_tool_rate, average_latency])
    print(f"\nTotal Tools: {total_tools}, Exception Rate: {exception_rate:.2%}, Tool Execution Rate: {tool_execution_rate:.2%}, Correct Tool Rate: {correct_tool_rate:.2%}, Avg Latency: {average_latency:.4f}s")
    
    if correct_tool_rate < 1 and max_correct_tool_count== -1:
        max_correct_tool_count = total_tools-1
        session_response = client.agents.session.retrieve(
                session_id=session_id,
                agent_id=agent.agent_id,
            )
        pprint(session_response)
    if tool_execution_rate < 1 and max_tool_exe_count == -1:
        max_tool_exe_count = total_tools-1
        session_response = client.agents.session.retrieve(
                session_id=session_id,
                agent_id=agent.agent_id,
            )
        pprint(session_response)
        break
log_results(results, csv_filename)
print(f"Max correct tool count: {max_correct_tool_count}")
print(f"Max tool execution count: {max_tool_exe_count}")

meta-llama/Llama-3.2-3B-Instruct
http://localhost:8321
5

User: What is the weather in New York?
Agent id is 7bacc04f-ce7d-4c3c-8707-85755140bd7e
session id is c623f21a-1faa-4a36-a171-36c3d68460a3
Inference: Here is the current weather for New York:

The weather in New York is sunny.
Executed Tool: weather_info
Ground Truth Tool: weather_info

User: How many words are in 'Hello World, this is a test sentence'?
Agent id is cd9bbf33-28fb-42ef-a7df-ba967e2680d2
session id is 20ee9d85-fdfd-467c-9fbb-28f2789e04ec
Inference: The word count of the given text is 7.
Executed Tool: word_count
Ground Truth Tool: word_count

User: Reverse this text: Python Experiment
Agent id is f0f972fc-793b-4e98-85a6-cb73f19a109a
session id is 02b39b10-b0eb-46e2-ac99-b1bd94dd16f8
Inference: Here is the reversed string "Python Experiment".
Executed Tool: reverse_string
Ground Truth Tool: reverse_string

User: Convert this to uppercase: llamastack
Agent id is e81b7af8-af6f-42a5-9e98-12633f4e1dee
session id is 2d7d

Inference: [tool_5_x-ray romeo]
Error: Not enough steps in response to access step 1.

Total Tools: 14, Exception Rate: 0.00%, Tool Execution Rate: 80.00%, Correct Tool Rate: 80.00%, Avg Latency: 2.5401s
Max correct tool count: 13
Max tool execution count: 13


In [8]:
# session_response = client.agents.session.retrieve(
#                 session_id="31822cbb-c4af-4032-ac45-9c7d5628cce7",
#                 agent_id="c65548f1-0e6b-4a70-aed2-ae7249640b23",
#             )
# pprint(session_response)