## Multiple Connected Function Calling with llama.cpp
### Adapted from the Ollama Notebook

### Requirements

#### 1. Install llama.cpp
llama.cpp installation instructions per OS (macOS, Linux, Windows) can be found on [their website](https://llama-cpp-python.readthedocs.io/en/latest/). 

#### 2. Python llama.cpp Library

For that:

In [None]:
%pip install llama-cpp-python

#### 3. Pull the model from HuggingFace

Download the GGUF NousHermes-2-Pro-Mistral-7B from HuggingFace (uploaded by adrienbrault) [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF). :

### Usage

#### 1. Define Tools

In [1]:
import random

def get_weather_forecast(location: str) -> dict[str, str]:
    """Retrieves the weather forecast for a given location"""
    # Mock values for test
    return {
        "location": location,
        "forecast": "sunny",
        "temperature": "25°C",
    }

def get_random_city() -> str:
    """Retrieves a random city from a list of cities"""
    cities = ["Groningen", "Enschede", "Amsterdam", "Istanbul", "Baghdad", "Rio de Janeiro", "Tokyo", "Kampala"]
    return random.choice(cities)

def get_random_number() -> int:
    """Retrieves a random number"""
    # Mock value for test
    return 31

#### 2. Define Function Caller

For this example in Jupyter format, I'm simply putting the functions in a list. In a python project, you can use the implementation here as an inspiration: https://github.com/AtakanTekparmak/ollama_langhcain_fn_calling/tree/main

In [2]:
import inspect

class FunctionCaller:
    """
    A class to call functions from tools.py.
    """

    def __init__(self):
        # Initialize the functions dictionary
        self.functions = {
            "get_weather_forecast": get_weather_forecast,
            "get_random_city": get_random_city,
            "get_random_number": get_random_number,
        }
        self.outputs = {}

    def create_functions_metadata(self) -> list[dict]:
        """Creates the functions metadata for the prompt. """
        def format_type(p_type: str) -> str:
            """Format the type of the parameter."""
            # If p_type begins with "<class", then it is a class type
            if p_type.startswith("<class"):
                # Get the class name from the type
                p_type = p_type.split("'")[1]
            
            return p_type
            
        functions_metadata = []
        i = 0
        for name, function in self.functions.items():
            i += 1
            descriptions = function.__doc__.split("\n")
            print(descriptions)
            functions_metadata.append({
                "name": name,
                "description": descriptions[0],
                "parameters": {
                    "properties": [ # Get the parameters for the function
                        {   
                            "name": param_name,
                            "type": format_type(str(param_type)),
                        }
                        # Remove the return type from the parameters
                        for param_name, param_type in function.__annotations__.items() if param_name != "return"
                    ],
                    
                    "required": [param_name for param_name in function.__annotations__ if param_name != "return"],
                } if function.__annotations__ else {},
                "returns": [
                    {
                        "name": name + "_output",
                        "type": {param_name: format_type(str(param_type)) for param_name, param_type in function.__annotations__.items() if param_name == "return"}["return"]
                    }
                ]
            })

        return functions_metadata

    def call_function(self, function):
        """
        Call the function from the given input.

        Args:
            function (dict): A dictionary containing the function details.
        """
    
        def check_if_input_is_output(input: dict) -> dict:
            """Check if the input is an output from a previous function."""
            for key, value in input.items():
                if value in self.outputs:
                    input[key] = self.outputs[value]
            return input

        # Get the function name from the function dictionary
        function_name = function["name"]
        
        # Get the function params from the function dictionary
        function_input = function["params"] if "params" in function else None
        function_input = check_if_input_is_output(function_input) if function_input else None
    
        # Call the function from tools.py with the given input
        # pass all the arguments to the function from the function_input
        output = self.functions[function_name](**function_input) if function_input else self.functions[function_name]()
        self.outputs[function["output"]] = output
        return output

    

#### 3. Setup The Function Caller and Prompt

In [3]:
# Initialize the FunctionCaller 
function_caller = FunctionCaller()

# Create the functions metadata
functions_metadata = function_caller.create_functions_metadata()

['Retrieves the weather forecast for a given location']
['Retrieves a random city from a list of cities']
['Retrieves a random number']


In [4]:
import json

# Create the system prompt
prompt_beginning = """
You are an AI assistant that can help the user with a variety of tasks. You have access to the following functions:

"""

system_prompt_end = """

When the user asks you a question, if you need to use functions, provide ONLY the function calls, and NOTHING ELSE, in the format:
<function_calls>    
[
    { "name": "function_name_1", "params": { "param_1": "value_1", "param_2": "value_2" }, "output": "The output variable name, to be possibly used as input for another function},
    { "name": "function_name_2", "params": { "param_3": "value_3", "param_4": "output_1"}, "output": "The output variable name, to be possibly used as input for another function"},
    ...
]
"""
system_prompt = prompt_beginning + f"<tools> {json.dumps(functions_metadata, indent=4)} </tools>" + system_prompt_end

#### 4. Load the model

In [6]:
import llama_cpp
model = llama_cpp.Llama(
    model_path='Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-F16.gguf',
    n_gpu_layers=32,
    n_threads=10,
    use_mlock=True,
    flash_attn=True,
    n_ctx=8192,
)

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-Instruct-Merged-DPO
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7: 

#### Inference

In [7]:

# Compose the prompt 
user_query = "What's the temperature in a random city?"

# Get the response from the model
model_name = 'adrienbrault/nous-hermes2pro:Q8_0'
messages = [
    {'role': 'system', 'content': system_prompt,
    },
    {'role': 'user', 'content': user_query}
]
response = model.create_chat_completion(messages=messages)
print(response)
# Get the function calls from the response



llama_print_timings:        load time =    1707.16 ms
llama_print_timings:      sample time =      20.74 ms /    63 runs   (    0.33 ms per token,  3037.75 tokens per second)
llama_print_timings: prompt eval time =    1706.23 ms /   450 tokens (    3.79 ms per token,   263.74 tokens per second)
llama_print_timings:        eval time =    4579.55 ms /    62 runs   (   73.86 ms per token,    13.54 tokens per second)
llama_print_timings:       total time =    6663.81 ms /   512 tokens


{'id': 'chatcmpl-8e53b561-32d6-433b-802d-5bb982660363', 'object': 'chat.completion', 'created': 1717239441, 'model': 'Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-F16.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '[\n    {\n        "name": "get_random_city",\n        "params": {},\n        "output": "random_city"\n    },\n    {\n        "name": "get_weather_forecast",\n        "params": {"location": "random_city"},\n        "output": "weather_forecast"\n    }\n]'}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 450, 'completion_tokens': 62, 'total_tokens': 512}}


In [8]:
response['choices'][0]['message']['content']

'[\n    {\n        "name": "get_random_city",\n        "params": {},\n        "output": "random_city"\n    },\n    {\n        "name": "get_weather_forecast",\n        "params": {"location": "random_city"},\n        "output": "weather_forecast"\n    }\n]'

In [9]:
function_calls = response['choices'][0]['message']['content']
# If it ends with a <function_calls>, get everything before it
if function_calls.startswith("<function_calls>"):
    function_calls = function_calls.split("<function_calls>")[1]

# Read function calls as json
try:
    function_calls_json: list[dict[str, str]] = json.loads(function_calls)
except json.JSONDecodeError:
    function_calls_json = []
    print ("Model response not in desired JSON format")
finally:
    print("Function calls:")
    print(function_calls_json)

Function calls:
[{'name': 'get_random_city', 'params': {}, 'output': 'random_city'}, {'name': 'get_weather_forecast', 'params': {'location': 'random_city'}, 'output': 'weather_forecast'}]


#### Append the assistant message to the chat and call the functions

In [14]:
# add <tool_call> to the function calls as mentioned in the chat template in Hugging Face
function_message = '<tool_call>' + str(function_calls_json) + '</tool_call>'

messages.append({'role': 'assistant', 'content': function_message})

In [15]:
for function in function_calls_json:
    output = f"Tool Response: {function_caller.call_function(function)}"
    print(output)

Tool Response: Amsterdam
Tool Response: {'location': 'Groningen', 'forecast': 'sunny', 'temperature': '25°C'}


In [16]:
# Call the functions
output = ""
for function in function_calls_json:
    output = f"{function_caller.call_function(function)}"

#Append the tool response to the messages with the chat format
tool_output = '<tool_response> ' + output + ' </tool_response>'
messages.append({'role': 'tool', 'content': tool_output})


In [18]:
messages

[{'role': 'system',
  'content': '\nYou are an AI assistant that can help the user with a variety of tasks. You have access to the following functions:\n\n<tools> [\n    {\n        "name": "get_weather_forecast",\n        "description": "Retrieves the weather forecast for a given location",\n        "parameters": {\n            "properties": [\n                {\n                    "name": "location",\n                    "type": "str"\n                }\n            ],\n            "required": [\n                "location"\n            ]\n        },\n        "returns": [\n            {\n                "name": "get_weather_forecast_output",\n                "type": "dict[str, str]"\n            }\n        ]\n    },\n    {\n        "name": "get_random_city",\n        "description": "Retrieves a random city from a list of cities",\n        "parameters": {\n            "properties": [],\n            "required": []\n        },\n        "returns": [\n            {\n                "name":

#### Inference the model again with the tool response

In [19]:
response=model.create_chat_completion(messages=messages,temperature=0)
response

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1707.16 ms
llama_print_timings:      sample time =      14.45 ms /    20 runs   (    0.72 ms per token,  1384.27 tokens per second)
llama_print_timings: prompt eval time =     775.81 ms /    86 tokens (    9.02 ms per token,   110.85 tokens per second)
llama_print_timings:        eval time =    1416.85 ms /    19 runs   (   74.57 ms per token,    13.41 tokens per second)
llama_print_timings:       total time =    2321.63 ms /   105 tokens


{'id': 'chatcmpl-25dfeae2-2184-497f-b838-e08565ad078c',
 'object': 'chat.completion',
 'created': 1717239671,
 'model': 'Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-F16.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "The temperature in the random city of Groningen is currently 25°C and it's sunny."},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 536, 'completion_tokens': 19, 'total_tokens': 555}}