# SGLang Frontend Language

SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way.

## Launch A Server

Launch the server in your terminal and wait for it to initialize.

In [1]:
import requests
import os

from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

[2025-02-27 04:44:25] server_args=ServerArgs(model_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-7B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=34105, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=410982854, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, deco

[2025-02-27 04:44:44 TP0] Init torch distributed begin.
[2025-02-27 04:44:44 TP0] Load weight begin. avail mem=62.69 GB


[2025-02-27 04:44:44 TP0] The following error message 'operation scheduled before its operands' can be ignored.


[2025-02-27 04:44:45 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.31it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.29it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.26it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.25it/s]

[2025-02-27 04:44:48 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=30.50 GB
[2025-02-27 04:44:48 TP0] KV Cache is allocated. K size: 0.55 GB, V size: 0.55 GB.
[2025-02-27 04:44:48 TP0] Memory pool end. avail mem=29.28 GB


[2025-02-27 04:44:49 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768


[2025-02-27 04:44:49] INFO:     Started server process [3761112]
[2025-02-27 04:44:49] INFO:     Waiting for application startup.
[2025-02-27 04:44:49] INFO:     Application startup complete.
[2025-02-27 04:44:49] INFO:     Uvicorn running on http://0.0.0.0:34105 (Press CTRL+C to quit)


[2025-02-27 04:44:49] INFO:     127.0.0.1:33340 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-27 04:44:50] INFO:     127.0.0.1:33342 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-27 04:44:50 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


Server started on http://localhost:34105


Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints.

In [2]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))

[2025-02-27 04:44:54] INFO:     127.0.0.1:33350 - "GET /get_model_info HTTP/1.1" 200 OK


## Basic Usage

The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant.

In [3]:
@function
def basic_qa(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=512))

In [4]:
state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])

[2025-02-27 04:44:55 TP0] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 1, #queue-req: 0


[2025-02-27 04:44:56] INFO:     127.0.0.1:33348 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:44:56] The server is fired up and ready to roll!


[2025-02-27 04:44:56] INFO:     127.0.0.1:33364 - "POST /generate HTTP/1.1" 200 OK


## Multi-turn Dialog

SGLang frontend language can also be used to define multi-turn dialogs.

In [5]:
@function
def multi_turn_qa(s):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user("Please give me a list of 3 countries and their capitals.")
    s += assistant(gen("first_answer", max_tokens=512))
    s += user("Please give me another list of 3 countries and their capitals.")
    s += assistant(gen("second_answer", max_tokens=512))
    return s


state = multi_turn_qa()
print_highlight(state["first_answer"])
print_highlight(state["second_answer"])

[2025-02-27 04:44:56 TP0] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 18, cache hit rate: 24.66%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-27 04:44:56 TP0] Decode batch. #running-req: 1, #token: 46, token usage: 0.00, gen throughput (token/s): 6.18, #queue-req: 0


[2025-02-27 04:44:57] INFO:     127.0.0.1:33368 - "POST /generate HTTP/1.1" 200 OK


[2025-02-27 04:44:57 TP0] Prefill batch. #new-seq: 1, #new-token: 23, #cached-token: 67, cache hit rate: 52.15%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:44:57 TP0] Decode batch. #running-req: 1, #token: 108, token usage: 0.01, gen throughput (token/s): 57.85, #queue-req: 0


[2025-02-27 04:44:57] INFO:     127.0.0.1:33370 - "POST /generate HTTP/1.1" 200 OK


## Control flow

You may use any Python code within the function to define more complex control flows.

In [6]:
@function
def tool_use(s, question):
    s += assistant(
        "To answer this question: "
        + question
        + ". I need to use a "
        + gen("tool", choices=["calculator", "search engine"])
        + ". "
    )

    if s["tool"] == "calculator":
        s += assistant("The math expression is: " + gen("expression"))
    elif s["tool"] == "search engine":
        s += assistant("The key word to search is: " + gen("word"))


state = tool_use("What is 2 * 2?")
print_highlight(state["tool"])
print_highlight(state["expression"])

[2025-02-27 04:44:57 TP0] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 8, cache hit rate: 47.45%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-27 04:44:57] INFO:     127.0.0.1:33372 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:44:57 TP0] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 31, cache hit rate: 54.15%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-27 04:44:57 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 31, cache hit rate: 58.94%, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-02-27 04:44:57] INFO:     127.0.0.1:33388 - "POST /generate HTTP/1.1" 200 OK


[2025-02-27 04:44:57 TP0] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 33, cache hit rate: 60.84%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:44:58 TP0] Decode batch. #running-req: 1, #token: 70, token usage: 0.00, gen throughput (token/s): 62.30, #queue-req: 0


[2025-02-27 04:44:58] INFO:     127.0.0.1:33402 - "POST /generate HTTP/1.1" 200 OK


## Parallelism

Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.

In [7]:
@function
def tip_suggestion(s):
    s += assistant(
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += assistant(
            f"Now, expand tip {i+1} into a paragraph:\n"
            + gen("detailed_tip", max_tokens=256, stop="\n\n")
        )

    s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
    s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
    s += assistant(
        "To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
    )


state = tip_suggestion()
print_highlight(state["summary"])

[2025-02-27 04:44:58 TP0] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, cache hit rate: 56.42%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-27 04:44:58 TP0] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, cache hit rate: 53.07%, token usage: 0.00, #running-req: 1, #queue-req: 0


[2025-02-27 04:44:58 TP0] Decode batch. #running-req: 2, #token: 94, token usage: 0.00, gen throughput (token/s): 83.00, #queue-req: 0


[2025-02-27 04:44:59 TP0] Decode batch. #running-req: 2, #token: 174, token usage: 0.01, gen throughput (token/s): 116.98, #queue-req: 0


[2025-02-27 04:45:00 TP0] Decode batch. #running-req: 2, #token: 254, token usage: 0.01, gen throughput (token/s): 115.38, #queue-req: 0


[2025-02-27 04:45:00 TP0] Decode batch. #running-req: 2, #token: 334, token usage: 0.02, gen throughput (token/s): 117.05, #queue-req: 0


[2025-02-27 04:45:01] INFO:     127.0.0.1:34134 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:45:01] INFO:     127.0.0.1:34132 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:45:01 TP0] Prefill batch. #new-seq: 1, #new-token: 364, #cached-token: 39, cache hit rate: 31.48%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:45:01 TP0] Decode batch. #running-req: 1, #token: 413, token usage: 0.02, gen throughput (token/s): 94.37, #queue-req: 0


[2025-02-27 04:45:02 TP0] Decode batch. #running-req: 1, #token: 453, token usage: 0.02, gen throughput (token/s): 59.43, #queue-req: 0


[2025-02-27 04:45:03 TP0] Decode batch. #running-req: 1, #token: 493, token usage: 0.02, gen throughput (token/s): 58.51, #queue-req: 0


[2025-02-27 04:45:03 TP0] Decode batch. #running-req: 1, #token: 533, token usage: 0.03, gen throughput (token/s): 59.72, #queue-req: 0


[2025-02-27 04:45:04 TP0] Decode batch. #running-req: 1, #token: 573, token usage: 0.03, gen throughput (token/s): 59.66, #queue-req: 0


[2025-02-27 04:45:05 TP0] Decode batch. #running-req: 1, #token: 613, token usage: 0.03, gen throughput (token/s): 60.93, #queue-req: 0


[2025-02-27 04:45:05 TP0] Decode batch. #running-req: 1, #token: 653, token usage: 0.03, gen throughput (token/s): 59.92, #queue-req: 0
[2025-02-27 04:45:05] INFO:     127.0.0.1:34146 - "POST /generate HTTP/1.1" 200 OK


## Constrained Decoding

Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models.

In [8]:
@function
def regular_expression_gen(s):
    s += user("What is the IP address of the Google DNS servers?")
    s += assistant(
        gen(
            "answer",
            temperature=0,
            regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
        )
    )


state = regular_expression_gen()
print_highlight(state["answer"])

Compiling FSM index for all state transitions:   0%|          | 0/69 [00:00<?, ?it/s]

Compiling FSM index for all state transitions:   1%|▏         | 1/69 [00:00<00:18,  3.61it/s]Compiling FSM index for all state transitions:   6%|▌         | 4/69 [00:00<00:05, 11.05it/s]

Compiling FSM index for all state transitions:  12%|█▏        | 8/69 [00:00<00:03, 17.88it/s]Compiling FSM index for all state transitions:  16%|█▌        | 11/69 [00:00<00:02, 20.12it/s]

Compiling FSM index for all state transitions:  20%|██        | 14/69 [00:00<00:02, 22.53it/s]Compiling FSM index for all state transitions:  25%|██▍       | 17/69 [00:00<00:02, 24.37it/s]

Compiling FSM index for all state transitions:  29%|██▉       | 20/69 [00:01<00:02, 23.44it/s]Compiling FSM index for all state transitions:  35%|███▍      | 24/69 [00:01<00:01, 25.50it/s]

Compiling FSM index for all state transitions:  41%|████      | 28/69 [00:01<00:01, 27.85it/s]Compiling FSM index for all state transitions:  46%|████▋     | 32/69 [00:01<00:01, 28.53it/s]

Compiling FSM index for all state transitions:  51%|█████     | 35/69 [00:01<00:01, 26.40it/s]Compiling FSM index for all state transitions:  55%|█████▌    | 38/69 [00:01<00:01, 25.08it/s]

Compiling FSM index for all state transitions:  59%|█████▉    | 41/69 [00:01<00:01, 24.14it/s]Compiling FSM index for all state transitions:  64%|██████▍   | 44/69 [00:01<00:00, 25.31it/s]

Compiling FSM index for all state transitions:  68%|██████▊   | 47/69 [00:02<00:00, 26.28it/s]Compiling FSM index for all state transitions:  72%|███████▏  | 50/69 [00:02<00:00, 24.87it/s]

Compiling FSM index for all state transitions:  77%|███████▋  | 53/69 [00:02<00:00, 23.86it/s]Compiling FSM index for all state transitions:  81%|████████  | 56/69 [00:02<00:00, 23.28it/s]

Compiling FSM index for all state transitions:  86%|████████▌ | 59/69 [00:02<00:00, 22.90it/s]Compiling FSM index for all state transitions:  91%|█████████▏| 63/69 [00:02<00:00, 25.09it/s]

Compiling FSM index for all state transitions:  96%|█████████▌| 66/69 [00:02<00:00, 25.08it/s]Compiling FSM index for all state transitions: 100%|██████████| 69/69 [00:02<00:00, 24.19it/s]Compiling FSM index for all state transitions: 100%|██████████| 69/69 [00:02<00:00, 23.33it/s]


[2025-02-27 04:45:19 TP0] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 12, cache hit rate: 31.79%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:45:20] INFO:     127.0.0.1:34160 - "POST /generate HTTP/1.1" 200 OK


Use `regex` to define a `JSON` decoding schema.

In [9]:
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)


@function
def character_gen(s, name):
    s += user(
        f"{name} is a character in Harry Potter. Please fill in the following information about this character."
    )
    s += assistant(gen("json_output", max_tokens=256, regex=character_regex))


state = character_gen("Harry Potter")
print_highlight(state["json_output"])

Compiling FSM index for all state transitions:   0%|          | 0/431 [00:00<?, ?it/s]

Compiling FSM index for all state transitions:   0%|          | 1/431 [00:00<01:49,  3.93it/s]Compiling FSM index for all state transitions:   1%|          | 5/431 [00:00<00:27, 15.33it/s]

Compiling FSM index for all state transitions:   2%|▏         | 9/431 [00:00<00:19, 21.91it/s]Compiling FSM index for all state transitions:   3%|▎         | 13/431 [00:00<00:15, 26.18it/s]

Compiling FSM index for all state transitions:   4%|▎         | 16/431 [00:00<00:17, 23.87it/s]

Compiling FSM index for all state transitions:   4%|▍         | 19/431 [00:00<00:21, 19.26it/s]Compiling FSM index for all state transitions:   5%|▌         | 23/431 [00:01<00:18, 22.58it/s]

Compiling FSM index for all state transitions:   6%|▋         | 27/431 [00:01<00:15, 25.49it/s]Compiling FSM index for all state transitions:   7%|▋         | 31/431 [00:01<00:14, 27.91it/s]

Compiling FSM index for all state transitions:   8%|▊         | 34/431 [00:01<00:15, 25.28it/s]

Compiling FSM index for all state transitions:   9%|▊         | 37/431 [00:01<00:21, 18.37it/s]

Compiling FSM index for all state transitions:   9%|▉         | 40/431 [00:02<00:24, 15.82it/s]Compiling FSM index for all state transitions:  10%|▉         | 42/431 [00:02<00:25, 15.27it/s]

Compiling FSM index for all state transitions:  10%|█         | 44/431 [00:02<00:24, 16.03it/s]Compiling FSM index for all state transitions:  11%|█         | 47/431 [00:02<00:20, 18.72it/s]

Compiling FSM index for all state transitions:  12%|█▏        | 51/431 [00:02<00:16, 22.76it/s]Compiling FSM index for all state transitions:  13%|█▎        | 55/431 [00:02<00:14, 25.97it/s]

Compiling FSM index for all state transitions:  14%|█▎        | 59/431 [00:02<00:13, 28.30it/s]Compiling FSM index for all state transitions:  15%|█▍        | 63/431 [00:02<00:12, 30.11it/s]

Compiling FSM index for all state transitions:  16%|█▌        | 67/431 [00:02<00:11, 31.50it/s]Compiling FSM index for all state transitions:  16%|█▋        | 71/431 [00:03<00:11, 32.34it/s]

Compiling FSM index for all state transitions:  17%|█▋        | 75/431 [00:03<00:10, 33.10it/s]Compiling FSM index for all state transitions:  18%|█▊        | 79/431 [00:03<00:10, 33.56it/s]

Compiling FSM index for all state transitions:  19%|█▉        | 83/431 [00:03<00:10, 32.68it/s]Compiling FSM index for all state transitions:  20%|██        | 87/431 [00:03<00:10, 33.29it/s]

Compiling FSM index for all state transitions:  21%|██        | 91/431 [00:03<00:10, 33.37it/s]Compiling FSM index for all state transitions:  22%|██▏       | 95/431 [00:03<00:09, 33.79it/s]

Compiling FSM index for all state transitions:  23%|██▎       | 99/431 [00:03<00:09, 33.83it/s]Compiling FSM index for all state transitions:  24%|██▍       | 103/431 [00:04<00:09, 34.04it/s]

Compiling FSM index for all state transitions:  25%|██▍       | 107/431 [00:04<00:09, 34.26it/s]Compiling FSM index for all state transitions:  26%|██▌       | 111/431 [00:04<00:09, 34.29it/s]

Compiling FSM index for all state transitions:  27%|██▋       | 115/431 [00:04<00:09, 34.40it/s]Compiling FSM index for all state transitions:  28%|██▊       | 119/431 [00:04<00:09, 34.35it/s]

Compiling FSM index for all state transitions:  29%|██▊       | 123/431 [00:04<00:09, 34.11it/s]Compiling FSM index for all state transitions:  29%|██▉       | 127/431 [00:04<00:08, 33.79it/s]

Compiling FSM index for all state transitions:  30%|███       | 131/431 [00:04<00:08, 33.72it/s]Compiling FSM index for all state transitions:  31%|███▏      | 135/431 [00:04<00:08, 34.05it/s]

Compiling FSM index for all state transitions:  32%|███▏      | 139/431 [00:05<00:08, 34.31it/s]Compiling FSM index for all state transitions:  33%|███▎      | 143/431 [00:05<00:08, 34.24it/s]

Compiling FSM index for all state transitions:  34%|███▍      | 147/431 [00:05<00:08, 34.14it/s]Compiling FSM index for all state transitions:  35%|███▌      | 151/431 [00:05<00:08, 33.85it/s]

Compiling FSM index for all state transitions:  36%|███▌      | 155/431 [00:05<00:08, 33.87it/s]Compiling FSM index for all state transitions:  37%|███▋      | 159/431 [00:05<00:07, 34.15it/s]

Compiling FSM index for all state transitions:  38%|███▊      | 163/431 [00:05<00:07, 34.38it/s]Compiling FSM index for all state transitions:  39%|███▊      | 167/431 [00:05<00:07, 34.53it/s]

Compiling FSM index for all state transitions:  40%|███▉      | 171/431 [00:06<00:07, 34.65it/s]Compiling FSM index for all state transitions:  41%|████      | 175/431 [00:06<00:07, 34.43it/s]

Compiling FSM index for all state transitions:  42%|████▏     | 179/431 [00:06<00:07, 34.02it/s]Compiling FSM index for all state transitions:  42%|████▏     | 183/431 [00:06<00:07, 34.23it/s]

Compiling FSM index for all state transitions:  43%|████▎     | 187/431 [00:06<00:07, 34.25it/s]Compiling FSM index for all state transitions:  44%|████▍     | 191/431 [00:06<00:06, 34.41it/s]

Compiling FSM index for all state transitions:  45%|████▌     | 195/431 [00:06<00:06, 34.30it/s]Compiling FSM index for all state transitions:  46%|████▌     | 199/431 [00:06<00:06, 34.17it/s]

Compiling FSM index for all state transitions:  47%|████▋     | 203/431 [00:06<00:06, 34.10it/s]Compiling FSM index for all state transitions:  48%|████▊     | 207/431 [00:07<00:06, 34.35it/s]

Compiling FSM index for all state transitions:  49%|████▉     | 211/431 [00:07<00:06, 34.52it/s]Compiling FSM index for all state transitions:  50%|████▉     | 215/431 [00:07<00:06, 34.36it/s]

Compiling FSM index for all state transitions:  51%|█████     | 219/431 [00:07<00:06, 34.48it/s]Compiling FSM index for all state transitions:  52%|█████▏    | 223/431 [00:07<00:06, 34.11it/s]

Compiling FSM index for all state transitions:  53%|█████▎    | 227/431 [00:07<00:05, 34.34it/s]Compiling FSM index for all state transitions:  54%|█████▎    | 231/431 [00:07<00:05, 33.67it/s]

Compiling FSM index for all state transitions:  55%|█████▍    | 235/431 [00:07<00:05, 33.76it/s]Compiling FSM index for all state transitions:  55%|█████▌    | 239/431 [00:08<00:05, 34.10it/s]

Compiling FSM index for all state transitions:  56%|█████▋    | 243/431 [00:08<00:07, 25.75it/s]Compiling FSM index for all state transitions:  57%|█████▋    | 247/431 [00:08<00:06, 27.33it/s]

Compiling FSM index for all state transitions:  58%|█████▊    | 250/431 [00:08<00:06, 27.07it/s]Compiling FSM index for all state transitions:  59%|█████▉    | 254/431 [00:08<00:06, 28.70it/s]

Compiling FSM index for all state transitions:  60%|█████▉    | 258/431 [00:08<00:05, 30.11it/s]Compiling FSM index for all state transitions:  61%|██████    | 262/431 [00:08<00:05, 30.97it/s]

Compiling FSM index for all state transitions:  62%|██████▏   | 266/431 [00:09<00:09, 18.09it/s]

Compiling FSM index for all state transitions:  62%|██████▏   | 269/431 [00:09<00:10, 15.85it/s]

Compiling FSM index for all state transitions:  63%|██████▎   | 272/431 [00:09<00:10, 15.59it/s]Compiling FSM index for all state transitions:  64%|██████▍   | 275/431 [00:09<00:08, 17.37it/s]

Compiling FSM index for all state transitions:  65%|██████▍   | 278/431 [00:10<00:09, 16.29it/s]Compiling FSM index for all state transitions:  65%|██████▍   | 280/431 [00:10<00:10, 14.72it/s]

Compiling FSM index for all state transitions:  65%|██████▌   | 282/431 [00:10<00:10, 13.79it/s]Compiling FSM index for all state transitions:  66%|██████▌   | 284/431 [00:10<00:10, 13.44it/s]

Compiling FSM index for all state transitions:  66%|██████▋   | 286/431 [00:10<00:10, 14.02it/s]Compiling FSM index for all state transitions:  67%|██████▋   | 289/431 [00:10<00:08, 17.24it/s]

Compiling FSM index for all state transitions:  68%|██████▊   | 292/431 [00:10<00:07, 18.15it/s]Compiling FSM index for all state transitions:  69%|██████▊   | 296/431 [00:11<00:06, 19.40it/s]

Compiling FSM index for all state transitions:  69%|██████▉   | 299/431 [00:11<00:06, 19.39it/s]Compiling FSM index for all state transitions:  70%|███████   | 303/431 [00:11<00:06, 20.09it/s]

Compiling FSM index for all state transitions:  71%|███████   | 306/431 [00:11<00:05, 22.06it/s]Compiling FSM index for all state transitions:  72%|███████▏  | 310/431 [00:11<00:04, 24.80it/s]

Compiling FSM index for all state transitions:  73%|███████▎  | 314/431 [00:11<00:04, 27.19it/s]Compiling FSM index for all state transitions:  74%|███████▍  | 318/431 [00:11<00:03, 29.23it/s]

Compiling FSM index for all state transitions:  75%|███████▍  | 322/431 [00:12<00:03, 30.51it/s]Compiling FSM index for all state transitions:  76%|███████▌  | 326/431 [00:12<00:03, 31.24it/s]

Compiling FSM index for all state transitions:  77%|███████▋  | 330/431 [00:12<00:03, 32.00it/s]Compiling FSM index for all state transitions:  77%|███████▋  | 334/431 [00:12<00:03, 32.04it/s]

Compiling FSM index for all state transitions:  78%|███████▊  | 338/431 [00:12<00:02, 32.55it/s]Compiling FSM index for all state transitions:  79%|███████▉  | 342/431 [00:12<00:02, 32.93it/s]

Compiling FSM index for all state transitions:  80%|████████  | 346/431 [00:12<00:02, 33.21it/s]Compiling FSM index for all state transitions:  81%|████████  | 350/431 [00:12<00:02, 33.69it/s]

Compiling FSM index for all state transitions:  82%|████████▏ | 354/431 [00:13<00:02, 31.07it/s]Compiling FSM index for all state transitions:  83%|████████▎ | 358/431 [00:13<00:02, 30.93it/s]

Compiling FSM index for all state transitions:  84%|████████▍ | 362/431 [00:13<00:02, 31.41it/s]Compiling FSM index for all state transitions:  85%|████████▍ | 366/431 [00:13<00:02, 32.38it/s]

Compiling FSM index for all state transitions:  86%|████████▌ | 370/431 [00:13<00:01, 33.10it/s]Compiling FSM index for all state transitions:  87%|████████▋ | 374/431 [00:13<00:01, 33.37it/s]

Compiling FSM index for all state transitions:  88%|████████▊ | 378/431 [00:13<00:01, 28.93it/s]

Compiling FSM index for all state transitions:  89%|████████▊ | 382/431 [00:14<00:02, 19.09it/s]

Compiling FSM index for all state transitions:  89%|████████▉ | 385/431 [00:14<00:02, 16.53it/s]

Compiling FSM index for all state transitions:  90%|█████████ | 388/431 [00:14<00:02, 16.03it/s]Compiling FSM index for all state transitions:  91%|█████████ | 391/431 [00:14<00:02, 17.75it/s]

Compiling FSM index for all state transitions:  91%|█████████▏| 394/431 [00:15<00:02, 16.63it/s]Compiling FSM index for all state transitions:  92%|█████████▏| 398/431 [00:15<00:01, 20.02it/s]

Compiling FSM index for all state transitions:  93%|█████████▎| 402/431 [00:15<00:01, 23.03it/s]Compiling FSM index for all state transitions:  94%|█████████▍| 406/431 [00:15<00:00, 25.73it/s]

Compiling FSM index for all state transitions:  95%|█████████▌| 410/431 [00:15<00:00, 27.84it/s]

Compiling FSM index for all state transitions:  96%|█████████▌| 414/431 [00:15<00:00, 20.24it/s]Compiling FSM index for all state transitions:  97%|█████████▋| 418/431 [00:15<00:00, 20.85it/s]

Compiling FSM index for all state transitions:  98%|█████████▊| 421/431 [00:16<00:00, 16.82it/s]

Compiling FSM index for all state transitions:  98%|█████████▊| 424/431 [00:16<00:00, 15.20it/s]Compiling FSM index for all state transitions:  99%|█████████▉| 426/431 [00:16<00:00, 14.95it/s]

Compiling FSM index for all state transitions:  99%|█████████▉| 428/431 [00:16<00:00, 15.69it/s]Compiling FSM index for all state transitions: 100%|██████████| 431/431 [00:16<00:00, 18.46it/s]Compiling FSM index for all state transitions: 100%|██████████| 431/431 [00:16<00:00, 25.55it/s]


[2025-02-27 04:45:38 TP0] Prefill batch. #new-seq: 1, #new-token: 24, #cached-token: 14, cache hit rate: 32.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:45:38 TP0] Decode batch. #running-req: 1, #token: 59, token usage: 0.00, gen throughput (token/s): 1.21, #queue-req: 0


[2025-02-27 04:45:39 TP0] Decode batch. #running-req: 1, #token: 99, token usage: 0.00, gen throughput (token/s): 71.51, #queue-req: 0


[2025-02-27 04:45:39 TP0] Decode batch. #running-req: 1, #token: 139, token usage: 0.01, gen throughput (token/s): 76.05, #queue-req: 0
[2025-02-27 04:45:39] INFO:     127.0.0.1:34506 - "POST /generate HTTP/1.1" 200 OK


## Batching 

Use `run_batch` to run a batch of prompts.

In [10]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True,
)

for i, state in enumerate(states):
    print_highlight(f"Answer {i+1}: {states[i]['answer']}")

[2025-02-27 04:45:39 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 13, cache hit rate: 32.96%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-27 04:45:40] INFO:     127.0.0.1:44456 - "POST /generate HTTP/1.1" 200 OK


  0%|          | 0/3 [00:00<?, ?it/s]

 33%|███▎      | 1/3 [00:00<00:00,  6.91it/s]

100%|██████████| 3/3 [00:00<00:00, 19.21it/s]

[2025-02-27 04:45:40 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 17, cache hit rate: 33.80%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-27 04:45:40 TP0] Prefill batch. #new-seq: 2, #new-token: 18, #cached-token: 34, cache hit rate: 35.49%, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-02-27 04:45:40] INFO:     127.0.0.1:44472 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:45:40] INFO:     127.0.0.1:44476 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:45:40] INFO:     127.0.0.1:44466 - "POST /generate HTTP/1.1" 200 OK





## Streaming 

Use `stream` to stream the output to the user.

In [11]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


state = text_qa.run(
    question="What is the capital of France?", temperature=0.1, stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant


[2025-02-27 04:45:40] INFO:     127.0.0.1:44478 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:45:40 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 25, cache hit rate: 37.07%, token usage: 0.00, #running-req: 0, #queue-req: 0
The

 capital

 of

 France

 is

 Paris

.

<|im_end|>


## Complex Prompts

You may use `{system|user|assistant}_{begin|end}` to define complex prompts.

In [12]:
@function
def chat_example(s):
    s += system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += assistant_begin()
    s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
    s += assistant_end()


state = chat_example()
print_highlight(state["answer"])

[2025-02-27 04:45:40 TP0] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 14, cache hit rate: 37.32%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:45:40 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 93.64, #queue-req: 0
[2025-02-27 04:45:40] INFO:     127.0.0.1:44486 - "POST /generate HTTP/1.1" 200 OK


In [13]:
terminate_process(server_process)

## Multi-modal Generation

You may use SGLang frontend language to define multi-modal prompts.
See [here](https://docs.sglang.ai/references/supported_models.html) for supported models.

In [14]:
server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

[2025-02-27 04:45:53] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=31940, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=385551917, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=Fa

[2025-02-27 04:46:12 TP0] Overlap scheduler is disabled for multimodal models.
[2025-02-27 04:46:12 TP0] Automatically reduce --mem-fraction-static to 0.836 because this is a multimodal model.
[2025-02-27 04:46:12 TP0] Init torch distributed begin.


[2025-02-27 04:46:13 TP0] Load weight begin. avail mem=78.81 GB


[2025-02-27 04:46:13 TP0] The following error message 'operation scheduled before its operands' can be ignored.


[2025-02-27 04:46:14 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.46it/s]


Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.25it/s]


Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.17it/s]


Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.55it/s]


Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.35it/s]

[2025-02-27 04:46:18 TP0] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=63.02 GB


[2025-02-27 04:46:18 TP0] KV Cache is allocated. K size: 0.55 GB, V size: 0.55 GB.
[2025-02-27 04:46:18 TP0] Memory pool end. avail mem=61.72 GB


[2025-02-27 04:46:19 TP0] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=128000
[2025-02-27 04:46:19] INFO:     Started server process [3764848]
[2025-02-27 04:46:19] INFO:     Waiting for application startup.
[2025-02-27 04:46:19] INFO:     Application startup complete.
[2025-02-27 04:46:19] INFO:     Uvicorn running on http://0.0.0.0:31940 (Press CTRL+C to quit)


[2025-02-27 04:46:20] INFO:     127.0.0.1:53908 - "GET /v1/models HTTP/1.1" 200 OK


[2025-02-27 04:46:20] INFO:     127.0.0.1:53924 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-27 04:46:20 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:46:23] INFO:     127.0.0.1:53932 - "POST /generate HTTP/1.1" 200 OK
[2025-02-27 04:46:23] The server is fired up and ready to roll!


Server started on http://localhost:31940


In [15]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))

[2025-02-27 04:46:25] INFO:     127.0.0.1:53938 - "GET /get_model_info HTTP/1.1" 200 OK


Ask a question about an image.

In [16]:
@function
def image_qa(s, image_file, question):
    s += user(image(image_file) + question)
    s += assistant(gen("answer", max_tokens=256))


image_url = "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
image_bytes, _ = load_image(image_url)
state = image_qa(image_bytes, "What is in the image?")
print_highlight(state["answer"])

[2025-02-27 04:46:34 TP0] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-27 04:46:38 TP0] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, gen throughput (token/s): 2.17, #queue-req: 0


[2025-02-27 04:46:38 TP0] Decode batch. #running-req: 1, #token: 380, token usage: 0.02, gen throughput (token/s): 60.66, #queue-req: 0


[2025-02-27 04:46:39] INFO:     127.0.0.1:58490 - "POST /generate HTTP/1.1" 200 OK


In [17]:
terminate_process(server_process)