# Evaluating Tool Calling Limitations and Performance of Small LLMs
Goal: The primary goal of this experiment is to evaluate the tool calling limitations of small LLMs (1B - 3B parameters) and to identify methods (e.g., prompting, tool descriptions) to enhance their performance.

Evaluation Set: This analysis uses a custom evaluation set comprising 600 queries. The queries were all generated by Gemini, there are 15 queries per function tool and a total of 40 function tools.

This initial experiment demonstrates that `llama3.2:3B` exhibits a complete degradation in accuracy when provided with more than 32 tools. The `TOOL_LIMIT` is set to 32 because any increase beyond this number results in a complete loss of tool-calling accuracy, which was a surprising outcome given that one might intuitively expect only a decrease in performance.

In [None]:
import os
import utils
from dotenv import load_dotenv
from llama_stack_client import LlamaStackClient
from tools import tools, tools_only_params, tools_no_extra_tags, tools_bad_function_names
from tests import load_queries, run_client_tool_test

load_dotenv()

In [None]:
# Set tool limit to max for llama3.2:3B
TOOL_LIMIT = 32

In [None]:
def main():
    """Main function to run all tests."""
    # Set up logger
    logger = utils.setup_logger()

    # Create client
    base_url = os.getenv('REMOTE_BASE_URL')
    print(f"base_url={base_url}")
    if not base_url:
        logger.error("REMOTE_BASE_URL environment variable not set")
        return

    llama_client = LlamaStackClient(base_url=base_url)

    # Define models to test
    # make sure they are available in your LLS server
    models = ["meta-llama/Llama-3.2-3B-Instruct"]

    client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries.json")

    # Track statistics
    total_tests = 0
    successful_tests = 0

    # Loop through models (outermost loop)
    for model in models:
        logger.info(f"\n=== Testing with model: {model} ===\n")

        if client_tool_queries:
            queries = load_queries(client_tool_queries)

            if not queries:
                logger.info(f"No queries found in {client_tool_queries}")
                continue

            for query_obj in queries:
                total_tests += 1
                success = run_client_tool_test(model, query_obj, tools_no_extra_tags, llama_client, logger)
                if success:
                    successful_tests += 1

    # Print summary
    logger.info(f"\n=== Test Summary ===")
    logger.info(f"Total tests: {total_tests}")
    logger.info(f"Successful tests: {successful_tests}")
    logger.info(f"Failed tests: {total_tests - successful_tests}")
    if total_tests > 0:
        success_rate = (successful_tests / total_tests) * 100
        logger.info(f"Success rate: {success_rate:.1f}%")

    # Generate plots
    logger.info(f"\n=== Generating plots ===")
    utils.get_analysis_plots('./results/no_extra_tools_client_tool_metrics.csv')


main()


# Results
![no_extra_tags_tool_call_match_per_function_tool.jpg](results/plots/no_extra_tags_tool_call_match_per_function_tool.jpg)

Based off the results, the llama3.2:3B model has quite high accuracy by just giving a good function name and `:params:` in the docstring

The next cell will show how adding a single tool after 32 will lead to complete accuracy loss

In [None]:
# Set tool limit to one more than the max (32)
TOOL_LIMIT = 33

In [None]:
main()

# Results
![tool_calling_match_per_function_33_tools_3B](results/plots/tool_calling_match_per_function_33_tools_3B.jpg)
This shows how adding a single tool after 32 completely degrades the accuracy of successful tool calls to almost 0.

This next cell will test how adding explicit `:description:` and `:use_case:` annotations can help in increasing accuracy. It is **not** a fact that adding them will increase accuracy but for our query set it helped.

In [None]:
# Set tool limit back to max for llama3.2:3B
TOOL_LIMIT = 32

In [None]:
def main():
    """Main function to run all tests."""
    # Set up logger
    logger = utils.setup_logger()

    # Create client
    base_url = os.getenv('REMOTE_BASE_URL')
    print(f"base_url={base_url}")
    if not base_url:
        logger.error("REMOTE_BASE_URL environment variable not set")
        return

    llama_client = LlamaStackClient(base_url=base_url)

    # Define models to test
    # make sure they are available in your LLS server
    models = ["meta-llama/Llama-3.2-3B-Instruct"]

    client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries.json")

    # Track statistics
    total_tests = 0
    successful_tests = 0

    # Loop through models (outermost loop)
    for model in models:
        logger.info(f"\n=== Testing with model: {model} ===\n")

        if client_tool_queries:
            queries = load_queries(client_tool_queries)

            if not queries:
                logger.info(f"No queries found in {client_tool_queries}")
                continue

            for query_obj in queries:
                total_tests += 1
                success = run_client_tool_test(model, query_obj, tools, llama_client, logger)
                if success:
                    successful_tests += 1

    # Print summary
    logger.info(f"\n=== Test Summary ===")
    logger.info(f"Total tests: {total_tests}")
    logger.info(f"Successful tests: {successful_tests}")
    logger.info(f"Failed tests: {total_tests - successful_tests}")
    if total_tests > 0:
        success_rate = (successful_tests / total_tests) * 100
        logger.info(f"Success rate: {success_rate:.1f}%")

    # Generate plots
    logger.info(f"\n=== Generating plots ===")
    utils.get_analysis_plots('./results/normal_client_tool_metrics.csv')


main()

# Results
![Tool Call Match Per Function Tool](results/plots/normal_tool_call_match_per_function_tool.png)
Overall majority of the tools have 100% accuracy, the tool with the worst accuracy is `convert_fahrenheit_to_kelvin`. One possible explanation could be that the training data likely had very little, if any, data on this type of Q/A. `convert_celsius_to_kelvin` has 100% accuracy which supports this theory as that's a common conversion in science courses.

The next few cells run experiments to test what truly matters when defining a tool: the tool name, description, and the format of the docstring. A quick overview of how `llama-stack` parses the function when tagged with the `client_tool` decorator.

```python
def client_tool(func: T) -> ClientTool:
    """
    Decorator to convert a function into a ClientTool.
    ...
    """

    class _WrappedTool(ClientTool):
        __name__ = func.__name__
        __doc__ = func.__doc__
        __module__ = func.__module__

        def get_name(self) -> str:
            ...

        def get_description(self) -> str:
            ...

        def get_params_definition(self) -> Dict[str, Parameter]:
            hints = get_type_hints(func)
            # Remove return annotation if present
            hints.pop("return", None)

            # Get parameter descriptions from docstring
            params = {}
            sig = inspect.signature(func)
            doc = inspect.getdoc(func) or ""

            for name, type_hint in hints.items():
                # Look for :param name: in docstring
                param_doc = ""
                for line in doc.split("\n"):
                    if line.strip().startswith(f":param {name}:"):
                        param_doc = line.split(":", 2)[2].strip()
                        break

                if param_doc == "":
                    raise ValueError(f"No parameter description found for parameter {name}")

                ...

            return params
```
Full implementation can be found [here](https://github.com/meta-llama/llama-stack-client-python/blob/645d2195c5af1c6f903cb93c293319d8f94c36cc/src/llama_stack_client/lib/agents/client_tool.py#L150-L170).
An important thing to realize is that `llama-stack` **purposefully** disregards the return information from the docstring. Also that the docstring only **requires** one annotation, `:params:`, and everything _above_ that will be parsed as well.

This next cell will test whether explicitly having `:description:` and `:use_case:` annotations help, compared to including them without any annotation.

Ex.
```python
@client_tool
def add_two_numbers(a: float, b: float) -> float:
    """
    :description: Adds two numbers.
    :use_case: Use when the user wants to find the sum, total, or combined value of two numbers.
    :param a: The first number.
    :param b: The second number.
    :returns: The sum of `a` and `b`.
    """
    return a + b
```

compared to

```python
@client_tool
def add_two_numbers(a: float, b: float) -> float:
    """
    Adds two numbers.
    Use when the user wants to find the sum, total, or combined value of two numbers.
    :param a: The first number.
    :param b: The second number.
    :returns: The sum of `a` and `b`.
    """
    return a + b
```

This next cell will test whether the tool name matters at all. To do this test, all functions were renamed to `function_1`, `function_2`, etc. but the docstring was left unchanged.

Ex.
```python
@client_tool
def function_1(a: float, b: float) -> float:
    """
    :description: Adds two numbers.
    :use_case: Use when the user wants to find the sum, total, or combined value of two numbers.
    :param a: The first number.
    :param b: The second number.
    :returns: The sum of `a` and `b`.
    """
    return a + b
```

In [None]:
def main():
    """Main function to run all tests."""
    # Set up logger
    logger = utils.setup_logger()

    # Create client
    base_url = os.getenv('REMOTE_BASE_URL')
    print(f"base_url={base_url}")
    if not base_url:
        logger.error("REMOTE_BASE_URL environment variable not set")
        return

    llama_client = LlamaStackClient(base_url=base_url)

    # Define models to test
    # make sure they are available in your LLS server
    models = ["meta-llama/Llama-3.2-3B-Instruct"]

    client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries_bad_functions.json")

    # Track statistics
    total_tests = 0
    successful_tests = 0

    # Loop through models (outermost loop)
    for model in models:
        logger.info(f"\n=== Testing with model: {model} ===\n")

        if client_tool_queries:
            queries = load_queries(client_tool_queries)

            if not queries:
                logger.info(f"No queries found in {client_tool_queries}")
                continue

            for query_obj in queries:
                total_tests += 1
                success = run_client_tool_test(model, query_obj, tools_bad_function_names, llama_client, logger)
                if success:
                    successful_tests += 1

    # Print summary
    logger.info(f"\n=== Test Summary ===")
    logger.info(f"Total tests: {total_tests}")
    logger.info(f"Successful tests: {successful_tests}")
    logger.info(f"Failed tests: {total_tests - successful_tests}")
    if total_tests > 0:
        success_rate = (successful_tests / total_tests) * 100
        logger.info(f"Success rate: {success_rate:.1f}%")

    # Generate plots
    logger.info(f"\n=== Generating plots ===")
    utils.get_analysis_plots('./results/bad_function_names_client_tool_metrics.csv')


main()


# Results

![bad_function_names_tool_call_match_per_function_tool](results/plots/bad_function_names_tool_call_match_per_function_tool.jpg)

The results show a sharp degrade in accuracy, emphasizing the importance of good function naming practices. Another experiment which could spawn from this is seeing whether using unit test style function naming for client tools and MCP servers.

This next cell will test whether the tool description matters at all. To do this test, all docstrings have been reduced to only contain the required `:params:` annotation and function names have been kept the same.

Ex.
```python
@client_tool
def add_two_numbers(a: float, b: float) -> float:
    """
    :param a: The first number.
    :param b: The second number.
    """
    return a + b
```

In [None]:
def main():
    """Main function to run all tests."""
    # Set up logger
    logger = utils.setup_logger()

    # Create client
    base_url = os.getenv('REMOTE_BASE_URL')
    print(f"base_url={base_url}")
    if not base_url:
        logger.error("REMOTE_BASE_URL environment variable not set")
        return

    llama_client = LlamaStackClient(base_url=base_url)

    # Define models to test
    # make sure they are available in your LLS server
    models = ["meta-llama/Llama-3.2-3B-Instruct"]

    client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries.json")

    # Track statistics
    total_tests = 0
    successful_tests = 0

    # Loop through models (outermost loop)
    for model in models:
        logger.info(f"\n=== Testing with model: {model} ===\n")

        if client_tool_queries:
            queries = load_queries(client_tool_queries)

            if not queries:
                logger.info(f"No queries found in {client_tool_queries}")
                continue

            for query_obj in queries:
                total_tests += 1
                success = run_client_tool_test(model, query_obj, tools_only_params, llama_client, logger)
                if success:
                    successful_tests += 1

    # Print summary
    logger.info(f"\n=== Test Summary ===")
    logger.info(f"Total tests: {total_tests}")
    logger.info(f"Successful tests: {successful_tests}")
    logger.info(f"Failed tests: {total_tests - successful_tests}")
    if total_tests > 0:
        success_rate = (successful_tests / total_tests) * 100
        logger.info(f"Success rate: {success_rate:.1f}%")

    # Generate plots
    logger.info(f"\n=== Generating plots ===")
    utils.get_analysis_plots('./results/only_params_client_tool_metrics.csv')


main()


# Results
![only_params_tool_call_match_per_function.jpg](results/plots/only_params_tool_call_match_per_function.jpg)

The results show that removing all details from the docstring other than the required `:params:` annotation does not lead to large decrease in accuracy. This is likely why `llama-stack` only requires the `:params:` annotation but nothing else, like `:use_case:`.

## We will now run the same evaluation set on the well constructed tools but swap `llama3.2:3B` with `llama3.2:1B`.

In [None]:
def main():
    """Main function to run all tests."""
    # Set up logger
    logger = utils.setup_logger()

    # Create client
    base_url = os.getenv('REMOTE_BASE_URL')
    print(f"base_url={base_url}")
    if not base_url:
        logger.error("REMOTE_BASE_URL environment variable not set")
        return

    llama_client = LlamaStackClient(base_url=base_url)

    # Define models to test
    # make sure they are available in your LLS server
    models = ["llama3.2:1b"]

    client_tool_queries = os.path.join(os.getcwd(), "queries/", "client_tool_queries.json")

    # Track statistics
    total_tests = 0
    successful_tests = 0

    # Loop through models (outermost loop)
    for model in models:
        logger.info(f"\n=== Testing with model: {model} ===\n")

        if client_tool_queries:
            queries = load_queries(client_tool_queries)

            if not queries:
                logger.info(f"No queries found in {client_tool_queries}")
                continue

            for query_obj in queries:
                total_tests += 1
                success = run_client_tool_test(model, query_obj, tools, llama_client, logger)
                if success:
                    successful_tests += 1

    # Print summary
    logger.info(f"\n=== Test Summary ===")
    logger.info(f"Total tests: {total_tests}")
    logger.info(f"Successful tests: {successful_tests}")
    logger.info(f"Failed tests: {total_tests - successful_tests}")
    if total_tests > 0:
        success_rate = (successful_tests / total_tests) * 100
        logger.info(f"Success rate: {success_rate:.1f}%")

    # Generate plots
    logger.info(f"\n=== Generating plots ===")
    utils.get_analysis_plots('./results/normal_client_tool_metrics_1B.csv')


main()


# Results

![tool_calling_match_per_function_23_tools_1B.png.jpg](results/plots/tool_calling_match_per_function_23_tools_1B.png.jpg)

The results show that llama3.2:1B is far worse at tool calling compared to llama3.2:3B

# Summary

The following observations have been made using our eval set of 600 queries.
- `llama3.2:3B` can handle a maximum of 32 tools before a complete degradation in accuracy.
- Well named functions are more important than a well written function description.
- Explicitly added `:description:` and `:use_case:` showed a slight improve in accuracy for our eval set.
- LLMs need to be introduced to tool calling in the training data, otherwise will have difficulty correctly calling them.
    - This was seen by pulling `granite3.2` from ollama, and it wasn't able to call a single tool.
- `llama3.2:1B` is too small of a model and is extremely inconsistent at tool calling.
    - Even given just a single tool and explicitly instructed to call that tool, it is unable to consistently do it. Important note is that the quantized model was used which could have been a big factor for low performance