# SGLang Frontend Language

SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way.

## Launch A Server

Launch the server in your terminal and wait for it to initialize.

In [1]:
import requests
import os

from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

[2025-07-09 09:13:38] server_args=ServerArgs(model_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, impl='auto', host='0.0.0.0', port=37455, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=1004651207, constrained_json_whitespace_patter

[2025-07-09 09:13:49] Attention backend not set. Use fa3 backend by default.
[2025-07-09 09:13:49] Init torch distributed begin.


[2025-07-09 09:13:49] Init torch distributed ends. mem usage=0.00 GB


[2025-07-09 09:13:50] Load weight begin. avail mem=53.50 GB


[2025-07-09 09:13:51] The weight of LmHead is not packed
[2025-07-09 09:13:51] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.66it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.60it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.61it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.64it/s]

[2025-07-09 09:13:53] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=39.21 GB, mem usage=14.30 GB.
[2025-07-09 09:13:53] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-07-09 09:13:53] Memory pool end. avail mem=37.91 GB


[2025-07-09 09:13:54] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768, available_gpu_mem=37.82 GB


[2025-07-09 09:13:55] INFO:     Started server process [1302628]
[2025-07-09 09:13:55] INFO:     Waiting for application startup.
[2025-07-09 09:13:55] INFO:     Application startup complete.
[2025-07-09 09:13:55] INFO:     Uvicorn running on http://0.0.0.0:37455 (Press CTRL+C to quit)


[2025-07-09 09:13:56] INFO:     127.0.0.1:38860 - "GET /v1/models HTTP/1.1" 200 OK
[2025-07-09 09:13:56] INFO:     127.0.0.1:38870 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-09 09:13:56] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:13:56.202325


[2025-07-09 09:13:57] INFO:     127.0.0.1:38884 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:13:57] The server is fired up and ready to roll!


Server started on http://localhost:37455


Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints.

In [2]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))

[2025-07-09 09:14:01] INFO:     127.0.0.1:38886 - "GET /get_model_info HTTP/1.1" 200 OK


## Basic Usage

The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant.

In [3]:
@function
def basic_qa(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=512))

In [4]:
state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])

[2025-07-09 09:14:01] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:01.184396


[2025-07-09 09:14:01] INFO:     127.0.0.1:38890 - "POST /generate HTTP/1.1" 200 OK


## Multi-turn Dialog

SGLang frontend language can also be used to define multi-turn dialogs.

In [5]:
@function
def multi_turn_qa(s):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user("Please give me a list of 3 countries and their capitals.")
    s += assistant(gen("first_answer", max_tokens=512))
    s += user("Please give me another list of 3 countries and their capitals.")
    s += assistant(gen("second_answer", max_tokens=512))
    return s


state = multi_turn_qa()
print_highlight(state["first_answer"])
print_highlight(state["second_answer"])

[2025-07-09 09:14:01] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 18, #token: 18, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:01.484138
[2025-07-09 09:14:01] Decode batch. #running-req: 1, #token: 42, token usage: 0.00, cuda graph: False, gen throughput (token/s): 5.62, #queue-req: 0, timestamp: 2025-07-09T09:14:01.557380


[2025-07-09 09:14:01] INFO:     127.0.0.1:38904 - "POST /generate HTTP/1.1" 200 OK


[2025-07-09 09:14:01] Prefill batch. #new-seq: 1, #new-token: 23, #cached-token: 70, #token: 70, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:01.813255
[2025-07-09 09:14:01] Decode batch. #running-req: 1, #token: 104, token usage: 0.01, cuda graph: False, gen throughput (token/s): 109.44, #queue-req: 0, timestamp: 2025-07-09T09:14:01.922896


[2025-07-09 09:14:02] INFO:     127.0.0.1:38918 - "POST /generate HTTP/1.1" 200 OK


## Control flow

You may use any Python code within the function to define more complex control flows.

In [6]:
@function
def tool_use(s, question):
    s += assistant(
        "To answer this question: "
        + question
        + ". I need to use a "
        + gen("tool", choices=["calculator", "search engine"])
        + ". "
    )

    if s["tool"] == "calculator":
        s += assistant("The math expression is: " + gen("expression"))
    elif s["tool"] == "search engine":
        s += assistant("The key word to search is: " + gen("word"))


state = tool_use("What is 2 * 2?")
print_highlight(state["tool"])
print_highlight(state["expression"])

[2025-07-09 09:14:02] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 8, #token: 8, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:02.182224
[2025-07-09 09:14:02] INFO:     127.0.0.1:38920 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:02] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 31, #token: 31, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:02.208576
[2025-07-09 09:14:02] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 31, #token: 33, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-09T09:14:02.209746
[2025-07-09 09:14:02] INFO:     127.0.0.1:38928 - "POST /generate HTTP/1.1" 200 OK


[2025-07-09 09:14:02] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 33, #token: 33, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:02.247274
[2025-07-09 09:14:02] Decode batch. #running-req: 1, #token: 56, token usage: 0.00, cuda graph: False, gen throughput (token/s): 91.96, #queue-req: 0, timestamp: 2025-07-09T09:14:02.357864


[2025-07-09 09:14:02] Decode batch. #running-req: 1, #token: 96, token usage: 0.00, cuda graph: False, gen throughput (token/s): 110.57, #queue-req: 0, timestamp: 2025-07-09T09:14:02.719633


[2025-07-09 09:14:03] INFO:     127.0.0.1:38938 - "POST /generate HTTP/1.1" 200 OK


## Parallelism

Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.

In [7]:
@function
def tip_suggestion(s):
    s += assistant(
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += assistant(
            f"Now, expand tip {i+1} into a paragraph:\n"
            + gen("detailed_tip", max_tokens=256, stop="\n\n")
        )

    s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
    s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
    s += assistant(
        "To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
    )


state = tip_suggestion()
print_highlight(state["summary"])

[2025-07-09 09:14:03] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, #token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:03.078167
[2025-07-09 09:14:03] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, #token: 49, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-09T09:14:03.079400
[2025-07-09 09:14:03] Decode batch. #running-req: 2, #token: 58, token usage: 0.00, cuda graph: False, gen throughput (token/s): 84.50, #queue-req: 0, timestamp: 2025-07-09T09:14:03.204845


[2025-07-09 09:14:03] Decode batch. #running-req: 2, #token: 138, token usage: 0.01, cuda graph: False, gen throughput (token/s): 215.07, #queue-req: 0, timestamp: 2025-07-09T09:14:03.576825


[2025-07-09 09:14:03] Decode batch. #running-req: 2, #token: 218, token usage: 0.01, cuda graph: False, gen throughput (token/s): 216.65, #queue-req: 0, timestamp: 2025-07-09T09:14:03.946076


[2025-07-09 09:14:04] Decode batch. #running-req: 2, #token: 298, token usage: 0.01, cuda graph: False, gen throughput (token/s): 210.58, #queue-req: 0, timestamp: 2025-07-09T09:14:04.325972


[2025-07-09 09:14:04] Decode batch. #running-req: 2, #token: 378, token usage: 0.02, cuda graph: False, gen throughput (token/s): 215.54, #queue-req: 0, timestamp: 2025-07-09T09:14:04.697136
[2025-07-09 09:14:04] INFO:     127.0.0.1:35902 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:04] INFO:     127.0.0.1:35914 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:04] Prefill batch. #new-seq: 1, #new-token: 372, #cached-token: 39, #token: 39, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:04.823317


[2025-07-09 09:14:05] Decode batch. #running-req: 1, #token: 438, token usage: 0.02, cuda graph: False, gen throughput (token/s): 123.88, #queue-req: 0, timestamp: 2025-07-09T09:14:05.092673


[2025-07-09 09:14:05] Decode batch. #running-req: 1, #token: 478, token usage: 0.02, cuda graph: False, gen throughput (token/s): 113.43, #queue-req: 0, timestamp: 2025-07-09T09:14:05.445311


[2025-07-09 09:14:05] Decode batch. #running-req: 1, #token: 518, token usage: 0.03, cuda graph: False, gen throughput (token/s): 110.79, #queue-req: 0, timestamp: 2025-07-09T09:14:05.806349


[2025-07-09 09:14:06] Decode batch. #running-req: 1, #token: 558, token usage: 0.03, cuda graph: False, gen throughput (token/s): 111.39, #queue-req: 0, timestamp: 2025-07-09T09:14:06.165445
[2025-07-09 09:14:06] INFO:     127.0.0.1:35916 - "POST /generate HTTP/1.1" 200 OK


## Constrained Decoding

Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models.

In [8]:
@function
def regular_expression_gen(s):
    s += user("What is the IP address of the Google DNS servers?")
    s += assistant(
        gen(
            "answer",
            temperature=0,
            regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
        )
    )


state = regular_expression_gen()
print_highlight(state["answer"])

[2025-07-09 09:14:06] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 12, #token: 12, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:06.258314


[2025-07-09 09:14:07] INFO:     127.0.0.1:35930 - "POST /generate HTTP/1.1" 200 OK


Use `regex` to define a `JSON` decoding schema.

In [9]:
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)


@function
def character_gen(s, name):
    s += user(
        f"{name} is a character in Harry Potter. Please fill in the following information about this character."
    )
    s += assistant(gen("json_output", max_tokens=256, regex=character_regex))


state = character_gen("Harry Potter")
print_highlight(state["json_output"])

[2025-07-09 09:14:08] Prefill batch. #new-seq: 1, #new-token: 24, #cached-token: 14, #token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:08.337960


[2025-07-09 09:14:08] Decode batch. #running-req: 1, #token: 60, token usage: 0.00, cuda graph: False, gen throughput (token/s): 16.70, #queue-req: 0, timestamp: 2025-07-09T09:14:08.560303


[2025-07-09 09:14:08] Decode batch. #running-req: 1, #token: 100, token usage: 0.00, cuda graph: False, gen throughput (token/s): 109.99, #queue-req: 0, timestamp: 2025-07-09T09:14:08.923984


[2025-07-09 09:14:09] Decode batch. #running-req: 1, #token: 140, token usage: 0.01, cuda graph: False, gen throughput (token/s): 98.55, #queue-req: 0, timestamp: 2025-07-09T09:14:09.329843


[2025-07-09 09:14:09] INFO:     127.0.0.1:35944 - "POST /generate HTTP/1.1" 200 OK


## Batching 

Use `run_batch` to run a batch of prompts.

In [10]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True,
)

for i, state in enumerate(states):
    print_highlight(f"Answer {i+1}: {states[i]['answer']}")

[2025-07-09 09:14:09] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 13, #token: 13, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:09.645958
[2025-07-09 09:14:09] INFO:     127.0.0.1:35956 - "POST /generate HTTP/1.1" 200 OK


  0%|          | 0/3 [00:00<?, ?it/s]

 33%|███▎      | 1/3 [00:00<00:00,  8.35it/s]

 67%|██████▋   | 2/3 [00:00<00:00, 13.23it/s]

100%|██████████| 3/3 [00:00<00:00, 21.89it/s]

[2025-07-09 09:14:09] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 17, #token: 17, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:09.671087
[2025-07-09 09:14:09] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 17, #token: 28, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-09T09:14:09.672071
[2025-07-09 09:14:09] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 19, #token: 37, token usage: 0.00, #running-req: 2, #queue-req: 0, timestamp: 2025-07-09T09:14:09.699901
[2025-07-09 09:14:09] INFO:     127.0.0.1:35980 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:09] INFO:     127.0.0.1:35986 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:09] Decode batch. #running-req: 1, #token: 1, token usage: 0.00, cuda graph: False, gen throughput (token/s): 118.32, #queue-req: 0, timestamp: 2025-07-09T09:14:09.803149
[2025-07-09 09:14:09] INFO:     127.0.0.1:35966 - "POST /generate HTTP/1.1" 200 OK





## Streaming 

Use `stream` to stream the output to the user.

In [11]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


state = text_qa.run(
    question="What is the capital of France?", temperature=0.1, stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant


[2025-07-09 09:14:09] INFO:     127.0.0.1:35992 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:09] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 25, #token: 25, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:09.815365
The

 capital

 of

 France

 is

 Paris

.

<|im_end|>


## Complex Prompts

You may use `{system|user|assistant}_{begin|end}` to define complex prompts.

In [12]:
@function
def chat_example(s):
    s += system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += assistant_begin()
    s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
    s += assistant_end()


state = chat_example()
print_highlight(state["answer"])

[2025-07-09 09:14:09] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 14, #token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:09.901943
[2025-07-09 09:14:09] INFO:     127.0.0.1:35996 - "POST /generate HTTP/1.1" 200 OK


In [13]:
terminate_process(server_process)

[2025-07-09 09:14:10] Child process unexpectedly failed with exitcode=9. pid=1302902
[2025-07-09 09:14:10] Child process unexpectedly failed with exitcode=9. pid=1302836


## Multi-modal Generation

You may use SGLang frontend language to define multi-modal prompts.
See [here](https://docs.sglang.ai/supported_models/generative_models.html) for supported models.

In [14]:
server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

[2025-07-09 09:14:18] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, impl='auto', host='0.0.0.0', port=31809, nccl_port=None, mem_fraction_static=0.7866, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=842891104, constrained_json_whitespa

You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-09 09:14:20] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


[2025-07-09 09:14:20] Inferred chat template from model path: qwen2-vl


You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
[2025-07-09 09:14:31] You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


[2025-07-09 09:14:32] Attention backend not set. Use flashinfer backend by default.
[2025-07-09 09:14:32] Init torch distributed begin.


[2025-07-09 09:14:34] Init torch distributed ends. mem usage=0.00 GB


[2025-07-09 09:14:35] Load weight begin. avail mem=78.50 GB
[2025-07-09 09:14:35] Multimodal attention backend not set. Use sdpa.
[2025-07-09 09:14:35] Using sdpa as multimodal attention backend.


[2025-07-09 09:14:38] The weight of LmHead is not packed
[2025-07-09 09:14:38] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.21it/s]


Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.18it/s]


Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.17it/s]


Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.16it/s]


Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.34it/s]

[2025-07-09 09:14:43] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=47.22 GB, mem usage=31.28 GB.
[2025-07-09 09:14:43] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-07-09 09:14:43] Memory pool end. avail mem=45.85 GB


[2025-07-09 09:14:45] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=128000, available_gpu_mem=45.28 GB


[2025-07-09 09:14:45] INFO:     Started server process [1303849]
[2025-07-09 09:14:45] INFO:     Waiting for application startup.
[2025-07-09 09:14:45] INFO:     Application startup complete.
[2025-07-09 09:14:45] INFO:     Uvicorn running on http://0.0.0.0:31809 (Press CTRL+C to quit)


[2025-07-09 09:14:46] INFO:     127.0.0.1:47198 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-09 09:14:46] INFO:     127.0.0.1:47202 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-09 09:14:46] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:46.869130


[2025-07-09 09:14:48] INFO:     127.0.0.1:47218 - "POST /generate HTTP/1.1" 200 OK
[2025-07-09 09:14:48] The server is fired up and ready to roll!


Server started on http://localhost:31809


In [15]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))

[2025-07-09 09:14:51] INFO:     127.0.0.1:47234 - "GET /get_model_info HTTP/1.1" 200 OK


Ask a question about an image.

In [16]:
@function
def image_qa(s, image_file, question):
    s += user(image(image_file) + question)
    s += assistant(gen("answer", max_tokens=256))


image_url = "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
image_bytes, _ = load_image(image_url)
state = image_qa(image_bytes, "What is in the image?")
print_highlight(state["answer"])

[2025-07-09 09:14:52] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, #token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-09T09:14:52.173162


[2025-07-09 09:14:53] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: False, gen throughput (token/s): 5.06, #queue-req: 0, timestamp: 2025-07-09T09:14:53.289096


[2025-07-09 09:14:53] INFO:     127.0.0.1:47242 - "POST /generate HTTP/1.1" 200 OK


In [17]:
terminate_process(server_process)

[2025-07-09 09:14:53] Child process unexpectedly failed with exitcode=9. pid=1304189
[2025-07-09 09:14:53] Child process unexpectedly failed with exitcode=9. pid=1304123
