# SGLang – Tool Calling Feature Documentation

This guide demonstrates how to use SGLang’s **ToolCalling** functionality—from starting the server to invoking custom tool functions. The main steps include:

1.  Launching the server
2.  Defining tools (Tools)
3.  Defining messages (Messages)
4.  Initializing the client
5.  Sending requests (non-streaming and streaming)
6.  Handling tool calls from the server
7.  Simulating and executing tool functions

> **Note**: This example uses a `get_current_weather` function to fetch weather information, serving as a demonstration. You can replace or add any tool function depending on your use case.

## Launching the Server
First, you need to launch the SGLang server so it can handle incoming requests. The server is started with the `sglang.launch_server` command.

**Example command:**

```bash
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tool-call-parser llama3 \
  --port 30000 \
  --host 0.0.0.0

```

-   **`--model-path`**: Specifies the path of the model to be used, e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct`.
-   **`--tool-call-parser`**: Defines the parser used to interpret responses. Currently supported parsers include:
    -   llama3: Llama 3.1 / 3.2 (e.g. `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`)
    -   mistral: Mistral (e.g. `mistralai/Mistral-7B-Instruct-v0.3`, `mistralai/Mistral-Nemo-Instruct-2407`, `mistralai/Mistral-Nemo-Instruct-2407`, `mistralai/Mistral-7B-v0.3`)
    -   qwen25: Qwen 2.5 (e.g. `Qwen/Qwen2.5-1.5B-Instruct`, `Qwen/Qwen2.5-7B-Instruct`)
-   **`--port`**: Sets the port number for the server (e.g., 30000).
-   **`--host`**: Sets the hostname or IP address (using `0.0.0.0` allows connections from any network).

The following code block is used to launch the server.

In [None]:
import os
from openai import OpenAI
import json

os.environ["CUDA_VISIBLE_DEVICES"] = "7"

In [None]:
from sglang.utils import execute_shell_command, wait_for_server, terminate_process

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --port 30222 --host 0.0.0.0"  # llama3
)
wait_for_server("http://localhost:30222")

[2025-01-08 05:42:08] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, return_token_ids=False, host='0.0.0.0', port=30222, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=944500726, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_

Once the server is running, you’ll be ready to interact with it using a client, as described in the subsequent steps.

## Define Tools for Function Call
This example introduces a `get_current_weather` tool to fetch weather information. You can define any tool in a similar structure, according to your requirements.

Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes:

-   **Tool name**
-   **Description**
-   **Parameters** (with property definitions)

In [None]:
# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

## Define Messages

Set up an initial conversation history to simulate how a user interacts with the assistant. This helps the model understand the context:

In [None]:
messages = [
    {
        "role": "user",
        "content": "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}",
    }
]

## Initialize the Client

Use an OpenAI-like client to communicate with the SGLang server. Replace `YOUR_API_KEY` with the appropriate key, or leave it as a dummy value if you’re testing locally.

In [None]:
# Initialize OpenAI-like client
client = OpenAI(api_key="YOUR_API_KEY", base_url="http://0.0.0.0:30222/v1")
model_name = client.models.list().data[0].id

[2025-01-08 05:42:36] INFO:     127.0.0.1:52192 - "GET /v1/models HTTP/1.1" 200 OK


## Make Non-Streaming Request
By default, you can send requests to the server without streaming the results back. This returns a single, complete response.

In [None]:
# Non-streaming mode test
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=False,  # Non-streaming
    tools=tools,
)
print("Non-stream response:")
print(response_non_stream)

[2025-01-08 05:42:36 TP0] Prefill batch. #new-seq: 1, #new-token: 235, #cached-token: 1, cache hit rate: 0.41%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-08 05:42:36] INFO:     127.0.0.1:52192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Non-stream response:
ChatCompletion(id='6606168ebd2b4a8da599ba84577b196f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"location": "Boston, MA", "unit": "fahrenheit"}', name='get_current_weather'), type='function')]), matched_stop=128008)], created=1736314956, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=236, total_tokens=264, completion_tokens_details=None, prompt_tokens_details=None))


## Make Streaming Request
If you prefer receiving real-time responses or incremental updates, you can use streaming mode.

In [None]:
# Streaming mode test
print("Streaming response:")
response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.8,
    top_p=0.8,
    stream=True,  # Enable streaming
    tools=tools,
)

chunks = []
for chunk in response_stream:
    chunks.append(chunk)
    print(chunk)  # Observe streaming output

Streaming response:
[2025-01-08 05:42:36] INFO:     127.0.0.1:52192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-01-08 05:42:36 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 235, cache hit rate: 49.27%, token usage: 0.00, #running-req: 0, #queue-req: 0
ChatCompletionChunk(id='3494adc4dafe4c8ea00d1bf87ec12d83', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, refusal=None, role='assistant', tool_calls=None), finish_reason='', index=0, logprobs=None, matched_stop=None)], created=1736314956, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', service_tier=None, system_fingerprint=None, usage=None)
starting on new tool %d 0
[2025-01-08 05:42:36 TP0] Decode batch. #running-req: 1, #token: 241, token usage: 0.00, gen throughput (token/s): 6.94, #queue-req: 0
ChatCompletionChunk(id='3494adc4dafe4c8ea00d1bf87ec12d83', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role='assistant', tool_calls


The `chunks` list accumulates all response segments, which might be needed later to handle tool calls.

----------

## Handle Tool Calls

When the server determines it should call a particular tool, it will return arguments or partial arguments through the streaming response. You can parse these arguments and later invoke the tool accordingly.

In [None]:
# Parse and combine function call arguments
arguments = []
for chunk in chunks:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.tool_calls:
        tool_call = delta.tool_calls[0]
        if tool_call.function.name:
            print(f"Streamed function call name: {tool_call.function.name}")

        if tool_call.function.arguments:
            arguments.append(tool_call.function.arguments)
            print(f"Streamed function call arguments: {tool_call.function.arguments}")

# Combine all fragments into a single JSON string
full_arguments = "".join(arguments)
print(f"Final streamed function call arguments: {full_arguments}")

Streamed function call name: get_current_weather
Streamed function call arguments: {"unit": "
Streamed function call arguments: f
Streamed function call arguments: ahrenheit"
Streamed function call arguments: , "location": "
Streamed function call arguments: Boston
Streamed function call arguments: ,
Streamed function call arguments:  MA"}
Final streamed function call arguments: {"unit": "fahrenheit", "location": "Boston, MA"}


## Simulate a Tool Function

Next, define the actual function that implements the tool’s logic. In this example, `get_current_weather` simply returns a hard-coded weather description for Dallas, Texas.

In [None]:
# Define the actual function for getting current weather
def get_current_weather(location: str, unit: str):
    # Here you can integrate an actual weather API
    return f"The weather in {location} is 85 degrees {unit}. It is partly cloudy, with highs in the 90's."


# Simulate tool call
available_tools = {"get_current_weather": get_current_weather}


## Execute the Tool

Finally, parse the JSON arguments and call the relevant tool if it exists:

In [None]:
# Parse JSON arguments
try:
    call_data = json.loads(full_arguments)
except json.JSONDecodeError as e:
    print(f"JSON decoding error: {e}")
    call_data = {}

# Add user message and function call result to the message list
messages.append(
    {
        "role": "user",
        "content": "",
        "tool_calls": {"name": "get_current_weather", "arguments": full_arguments},
    }
)

# Call the corresponding tool function
if "tool_calls" in messages[-1] and "name" in messages[-1]["tool_calls"]:
    tool_name = messages[-1]["tool_calls"]["name"]
    if tool_name in available_tools:
        tool_to_call = available_tools[tool_name]
        result = tool_to_call(**call_data)
        print(f"Function call result: {result}")
        messages.append({"role": "tool", "content": result, "name": tool_name})
    else:
        print(f"Unknown tool name: {tool_name}")
else:
    print("Function call name not found.")

print(f"Updated message history: {messages}")

Function call result: The weather in Boston, MA is 85 degrees fahrenheit. It is partly cloudy, with highs in the 90's.
Updated message history: [{'role': 'user', 'content': "What's the weather like in Boston today? Please respond with the format: Today's weather is :{function call result}"}, {'role': 'user', 'content': '', 'tool_calls': {'name': 'get_current_weather', 'arguments': '{"unit": "fahrenheit", "location": "Boston, MA"}'}}, {'role': 'tool', 'content': "The weather in Boston, MA is 85 degrees fahrenheit. It is partly cloudy, with highs in the 90's.", 'name': 'get_current_weather'}]


## Terminate Server
Once you are done, you can terminate the server process using the following code.

In [None]:
terminate_process(server_process)