# Function Calling

This notebook provides a quick-start guide to use function tooling using  SGLang chat completions API

## Supported Models

Currently, we added the support for tools calling in the following models:
  - Llama 3.2 models
  - Llama 3.1 models
  - Qwen 2.5 models
  - InternLM Models

## Usage

### Launch a server

This code block is equivalent to executing

`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0`
in your terminal and wait for the server to be ready. Once the server is running, you can send test requests using curl or requests. The server implements the OpenAI-compatible APIs.

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)


server_process = execute_shell_command(
    """
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0
"""
)

wait_for_server("http://localhost:30000")

[2025-01-23 19:37:56] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=736512671, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decod

[2025-01-23 19:38:14 TP0] Init torch distributed begin.


[2025-01-23 19:38:15 TP0] Load weight begin. avail mem=78.81 GB


[2025-01-23 19:38:16 TP0] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.81it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.44it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]

[2025-01-23 19:38:19 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-23 19:38:19 TP0] KV Cache is allocated. K size: 27.13 GB, V size: 27.13 GB.
[2025-01-23 19:38:19 TP0] Memory pool end. avail mem=8.34 GB


[2025-01-23 19:38:19 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:24,  1.11s/it]

  9%|▊         | 2/23 [00:01<00:12,  1.70it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.39it/s]

 17%|█▋        | 4/23 [00:01<00:06,  2.98it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.43it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.63it/s]

 30%|███       | 7/23 [00:02<00:04,  3.84it/s]

 35%|███▍      | 8/23 [00:02<00:04,  3.48it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.62it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.60it/s]

 48%|████▊     | 11/23 [00:03<00:03,  3.14it/s]

 52%|█████▏    | 12/23 [00:03<00:03,  3.33it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.57it/s]

 61%|██████    | 14/23 [00:04<00:03,  2.67it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.06it/s]

 70%|██████▉   | 16/23 [00:05<00:02,  3.42it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.70it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  3.89it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.10it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  4.27it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.38it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.45it/s]

100%|██████████| 23/23 [00:06<00:00,  4.49it/s]100%|██████████| 23/23 [00:06<00:00,  3.40it/s]
[2025-01-23 19:38:26 TP0] Capture cuda graph end. Time elapsed: 6.77 s


[2025-01-23 19:38:26 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-23 19:38:27] INFO:     Started server process [892213]
[2025-01-23 19:38:27] INFO:     Waiting for application startup.
[2025-01-23 19:38:27] INFO:     Application startup complete.
[2025-01-23 19:38:27] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)


[2025-01-23 19:38:27] INFO:     127.0.0.1:46874 - "GET /v1/models HTTP/1.1" 200 OK


[2025-01-23 19:38:27] INFO:     127.0.0.1:46888 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-23 19:38:27 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-01-23 19:38:30] INFO:     127.0.0.1:46894 - "POST /generate HTTP/1.1" 200 OK
[2025-01-23 19:38:30] The server is fired up and ready to roll!


### Single Round Invocation

In [2]:
from openai import OpenAI

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]
messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]

client = OpenAI(api_key="YOUR_API_KEY", base_url="http://0.0.0.0:30000/v1")
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,
    tools=tools,
)

print(response)

"""

ChatCompletion(id='d6f620e1767e490d85b5ce45c15151cf', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, 
role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"a": "3", "b": "5"}', name='add'), type='function')]), 
matched_stop=128008)], created=1735411703, model='meta-llama/Llama-3.2-1B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, 
usage=CompletionUsage(completion_tokens=23, prompt_tokens=198, total_tokens=221, completion_tokens_details=None, prompt_tokens_details=None))

"""

[2025-01-23 19:38:33] INFO:     127.0.0.1:43180 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-23 19:38:33 TP0] Prefill batch. #new-seq: 1, #new-token: 219, #cached-token: 1, cache hit rate: 0.44%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-01-23 19:38:33] INFO:     127.0.0.1:43180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ChatCompletion(id='1bf90557a6ca4e73971be99caf04ae5a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"unit": "fahrenheit", "location": "Boston, MA"}', name='get_current_weather'), type='function')]), matched_stop=128008)], created=1737661113, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=220, total_tokens=248, completion_tokens_details=None, prompt_tokens_details=None))


'\n\nChatCompletion(id=\'d6f620e1767e490d85b5ce45c15151cf\', choices=[Choice(finish_reason=\'stop\', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, \nrole=\'assistant\', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id=\'0\', function=Function(arguments=\'{"a": "3", "b": "5"}\', name=\'add\'), type=\'function\')]), \nmatched_stop=128008)], created=1735411703, model=\'meta-llama/Llama-3.2-1B-Instruct\', object=\'chat.completion\', service_tier=None, system_fingerprint=None, \nusage=CompletionUsage(completion_tokens=23, prompt_tokens=198, total_tokens=221, completion_tokens_details=None, prompt_tokens_details=None))\n\n'

In [3]:
terminate_process(server_process)

## How to support a new model?

For adding support of more different models:
 1. Update the `TOOLS_TAG_LIST` in `sglang/srt/utils.py` with the tool tag used by the model.
 2. Add support in `parse_tool_response` function for converting into tool calls `sglang/srt/utils.py`
