# Evaluating Tool Calling Limitations and Performance of Small LLMs
Goal: The primary goal of this experiment is to evaluate the tool calling limitations of small LLMs (1B - 3B parameters) and to identify methods (e.g., prompting, tool descriptions) to enhance their performance.

Evaluation Set: This analysis uses a custom evaluation set comprising 600 queries. The queries were all generated by Gemini, there are 15 queries per function tool and a total of 40 function tools.

This initial experiment demonstrates that `llama3.2:3B` exhibits a complete degradation in accuracy when provided with more than 32 tools. The `TOOL_LIMIT` is set to 32 because any increase beyond this number results in a complete loss of tool-calling accuracy, which was a surprising outcome given that one might intuitively expect only a decrease in performance.

In [None]:
import os
import utils
from dotenv import load_dotenv
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.client_tool import ClientTool
from tools import tools, tools_only_params, tools_no_extra_tags, tools_bad_function_names
from tests import load_queries, run_client_tool_test

load_dotenv()

In [None]:
# Set up reused logger and client
logger = utils.setup_logger()

base_url = os.getenv('REMOTE_BASE_URL')
if not base_url:
        logger.error("REMOTE_BASE_URL environment variable not set")
        exit(1)

llama_client = LlamaStackClient(base_url=base_url)

In [None]:
normal_client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries.json")
bad_function_names_client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries_bad_functions.json")

In [None]:
def run_test(models, tool_module, client_tool_queries, analysis_plot_path):
    tool_list = []
    for name in sorted(dir(tool_module)):
        if len(tool_list) >= TOOL_LIMIT:
            break
        attribute = getattr(tool_module, name)
        if isinstance(attribute, ClientTool):
            tool_list.append(attribute)
    tool_name_set = {tool.__name__ for tool in tool_list}

    # Track statistics
    total_tests = 0
    successful_tests = 0

    # Loop through models (outermost loop)
    for model in models:
        logger.info(f"\n=== Testing with model: {model} ===\n")

        queries = load_queries(client_tool_queries)
        if not queries:
            logger.info(f"No queries found in {client_tool_queries}")
            continue

        for query_obj in queries:
            if query_obj["tool_call"] not in tool_name_set:
                continue
            total_tests += 1
            success = run_client_tool_test(model, query_obj, tool_list, llama_client, logger)
            if success:
                successful_tests += 1

    # Print summary
    logger.info(f"\n=== Test Summary ===")
    logger.info(f"Total tests: {total_tests}")
    logger.info(f"Successful tests: {successful_tests}")
    logger.info(f"Failed tests: {total_tests - successful_tests}")
    if total_tests > 0:
        success_rate = (successful_tests / total_tests) * 100
        logger.info(f"Success rate: {success_rate:.1f}%")

    # Generate plots
    logger.info(f"\n=== Generating plots ===")
    utils.get_analysis_plots(analysis_plot_path)

In [None]:
# Set tool limit to max for llama3.2:3B
TOOL_LIMIT = 32

In [None]:
run_test(
    models=["meta-llama/Llama-3.2-3B-Instruct"],
    tool_module=tools_no_extra_tags,
    client_tool_queries=normal_client_tool_queries,
    analysis_plot_path="./results/no_extra_tools_client_tool_metrics.csv"
)

# Results
![no_extra_tags_tool_call_match_per_function_tool.jpg](results/plots/no_extra_tags_tool_call_match_per_function_tool.jpg)

Based off the results, the llama3.2:3B model has quite high accuracy by just giving a good function name and `:param:` in the docstring

The next cell will show how adding a single tool after 32 will lead to complete accuracy loss

In [None]:
# Set tool limit to one more than the max (32)
TOOL_LIMIT = 33

In [None]:
run_test(
    models=["meta-llama/Llama-3.2-3B-Instruct"],
    tool_module=tools_no_extra_tags,
    client_tool_queries=normal_client_tool_queries,
    analysis_plot_path="./results/no_extra_tools_client_tool_metrics.csv"
)

# Results
![tool_calling_match_per_function_33_tools_3B](results/plots/tool_calling_match_per_function_33_tools_3B.jpg)
This shows how adding a single tool after 32 completely degrades the accuracy of successful tool calls to almost 0.

This next cell will test how adding explicit `:description:` and `:use_case:` annotations can help in increasing accuracy. It is **not** a fact that adding them will increase accuracy but for our query set it helped.

In [None]:
# Set tool limit back to max for llama3.2:3B
TOOL_LIMIT = 32

In [None]:
run_test(
    models=["meta-llama/Llama-3.2-3B-Instruct"],
    tool_module=tools,
    client_tool_queries=normal_client_tool_queries,
    analysis_plot_path="./results/normal_client_tool_metrics.csv"
)

# Results
![Tool Call Match Per Function Tool](results/plots/normal_tool_call_match_per_function_tool.png)
Overall majority of the tools have 100% accuracy, and there is a slight increase in correct tool call compared to having just `:param:` annotation.

The next few cells run experiments to test what truly matters when defining a tool: the tool name, the description, or the format of the docstring. A quick overview of how `llama-stack` parses the function when tagged with the `client_tool` decorator.

```python
def client_tool(func: T) -> ClientTool:
    """
    Decorator to convert a function into a ClientTool.
    ...
    """

    class _WrappedTool(ClientTool):
        __name__ = func.__name__
        __doc__ = func.__doc__
        __module__ = func.__module__

        def get_name(self) -> str:
            ...

        def get_description(self) -> str:
            ...

        def get_params_definition(self) -> Dict[str, Parameter]:
            hints = get_type_hints(func)
            # Remove return annotation if present
            hints.pop("return", None)

            # Get parameter descriptions from docstring
            params = {}
            sig = inspect.signature(func)
            doc = inspect.getdoc(func) or ""

            for name, type_hint in hints.items():
                # Look for :param name: in docstring
                param_doc = ""
                for line in doc.split("\n"):
                    if line.strip().startswith(f":param {name}:"):
                        param_doc = line.split(":", 2)[2].strip()
                        break

                if param_doc == "":
                    raise ValueError(f"No parameter description found for parameter {name}")

                ...

            return params
```
Full implementation can be found [here](https://github.com/meta-llama/llama-stack-client-python/blob/645d2195c5af1c6f903cb93c293319d8f94c36cc/src/llama_stack_client/lib/agents/client_tool.py#L150-L170).
An important thing to realize is that `llama-stack` **purposefully** disregards the return information from the docstring. Also that the docstring only **requires** one annotation, `:param:`, and everything _above_ that will be parsed together.

This next cell will test whether explicitly having `:description:` and `:use_case:` annotations help, compared to including them without any annotation.

Ex.
```python
@client_tool
def add_two_numbers(a: float, b: float) -> float:
    """
    :description: Adds two numbers.
    :use_case: Use when the user wants to find the sum, total, or combined value of two numbers.
    :param a: The first number.
    :param b: The second number.
    :returns: The sum of `a` and `b`.
    """
    return a + b
```

compared to

```python
@client_tool
def add_two_numbers(a: float, b: float) -> float:
    """
    Adds two numbers.
    Use when the user wants to find the sum, total, or combined value of two numbers.
    :param a: The first number.
    :param b: The second number.
    :returns: The sum of `a` and `b`.
    """
    return a + b
```

This next cell will test whether the tool name matters at all. To do this test, all functions were renamed to `function_1`, `function_2`, etc. but the docstring was left unchanged.

Ex.
```python
@client_tool
def function_1(a: float, b: float) -> float:
    """
    :description: Adds two numbers.
    :use_case: Use when the user wants to find the sum, total, or combined value of two numbers.
    :param a: The first number.
    :param b: The second number.
    :returns: The sum of `a` and `b`.
    """
    return a + b
```

In [None]:
run_test(
    models=["meta-llama/Llama-3.2-3B-Instruct"],
    tool_module=tools_bad_function_names,
    client_tool_queries=bad_function_names_client_tool_queries,
    analysis_plot_path="./results/bad_function_names_client_tool_metrics.csv"
)

# Results

![bad_function_names_tool_call_match_per_function_tool](results/plots/bad_function_names_tool_call_match_per_function_tool.jpg)

The results show a sharp degrade in accuracy, emphasizing the importance of good function naming practices. Another experiment which could spawn from this is seeing whether using unit test style function naming for client tools and MCP servers works well.

This next cell will test whether the tool description matters at all. To do this test, all docstrings have been reduced to only contain the required `:param:` annotation and function names have been kept the same.

Ex.
```python
@client_tool
def add_two_numbers(a: float, b: float) -> float:
    """
    :param a: The first number.
    :param b: The second number.
    """
    return a + b
```

In [None]:
run_test(
    models=["meta-llama/Llama-3.2-3B-Instruct"],
    tool_module=tools_only_params,
    client_tool_queries=normal_client_tool_queries,
    analysis_plot_path="./results/only_params_client_tool_metrics.csv"
)

# Results
![only_params_tool_call_match_per_function.jpg](results/plots/only_params_tool_call_match_per_function.jpg)

The results show that removing all details from the docstring other than the required `:param:` annotation does not lead to large decrease in accuracy. This is likely why `llama-stack` only requires the `:param:` annotation but nothing else, like `:use_case:`.

## We will now run the same evaluation set on the well constructed tools but swap `llama3.2:3B` with `llama3.2:1B`.

In [None]:
run_test(
    models=["llama3.2:1b"],
    tool_module=tools,
    client_tool_queries=normal_client_tool_queries,
    analysis_plot_path="./results/normal_client_tool_metrics_1B.csv"
)

# Results

![tool_calling_match_per_function_23_tools_1B.png.jpg](results/plots/tool_calling_match_per_function_23_tools_1B.png.jpg)

The results show that llama3.2:1B is far worse at tool calling compared to llama3.2:3B when using all the best practices we used above.

# Summary

The following observations have been made using our eval set of 600 queries.
- `llama3.2:3B` can handle a maximum of 32 tools before a complete degradation in accuracy.
- Well named functions are more important than a well written function description.
- Explicitly added `:description:` and `:use_case:` showed a slight improve in accuracy for our eval set.
- `llama3.2:1B` is too small of a model and is extremely inconsistent at tool calling.
  - Important note is that the quantized model was used which could have been a big factor for low performance