# Frontend: Structured Generation Language (SGLang)

The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow.

Start the server.

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

server_process = execute_shell_command(
    "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-1.5B-Instruct --mem-fraction-static 0.8 --port 30333 --host 0.0.0.0"
)
wait_for_server("http://localhost:30333")

INFO 02-13 21:02:52 __init__.py:190] Automatically detected platform cuda.
[2025-02-13 21:02:57] server_args=ServerArgs(model_path='Qwen/Qwen2.5-1.5B-Instruct', tokenizer_path='Qwen/Qwen2.5-1.5B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-1.5B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30333, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=502020878, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_req

Setup default backend for SGLang.

In [2]:
from sglang import set_default_backend, RuntimeEndpoint

set_default_backend(RuntimeEndpoint("http://localhost:30333"))

[2025-02-13 21:03:19] INFO:     127.0.0.1:51850 - "GET /get_model_info HTTP/1.1" 200 OK


## Multi-turn conversation

SGLang provides simple api to build multi-turn conversations. Prompt templates can be defined intuitively with the `function` decorator.

In [3]:
from sglang import function, system, user, assistant, gen


@function
def multi_turn_conversation(s, country: str):
    s += system("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.")
    s += user(f"What is the capital of {country}?")
    s += assistant(gen("capital", max_tokens=250))
    s += user("Name an interesting building in this city.")
    s += assistant(gen("building", max_tokens=250))


state = multi_turn_conversation.run(
    country="Germany",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print_highlight("#" * 50)
print_highlight(state["capital"])
print_highlight("#" * 50)
print_highlight(state["building"])

[2025-02-13 21:03:19 TP0] Prefill batch. #new-seq: 1, #new-token: 36, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:20] INFO:     127.0.0.1:51860 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:20 TP0] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 43, cache hit rate: 41.75%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:20 TP0] Decode batch. #running-req: 1, #token: 86, token usage: 0.00, gen throughput (token/s): 7.04, #queue-req: 0
[2025-02-13 21:03:20 TP0] Decode batch. #running-req: 1, #token: 126, token usage: 0.00, gen throughput (token/s): 244.17, #queue-req: 0
[2025-02-13 21:03:20 TP0] Decode batch. #running-req: 1, #token: 166, token usage: 0.00, gen throughput (token/s): 241.28, #queue-req: 0
[2025-02-13 21:03:20 TP0] Decode batch. #running-req: 1, #token: 206, token usage: 0.00, gen throughput (token/s): 241.34, #queue-req: 0
[2025-02-13 21:03:20] INFO:     127.0.0.1:51866 

We can use SGLang for OpenAI models as well.

For that we only need to execute `export OPENAI_API_KEY=<your-openai-api-key>` and then `set_default_backend(OpenAI(<chosen-model>))`. Everything else stays exactly the same as above.

## Control Flow

SGLang's choices method is a powerful tool to control the flow of the conversation.

In [4]:
@function
def control_flow(s, question: str):
    s += user(question)
    s += assistant(
        "Based on the question, this seems like "
        + gen("type", choices=["a technical query", "a creative request"])
    )

    if s["type"] == "a technical query":
        s += assistant(
            "Here's a technical explanation: "
            + gen("technical_response", max_tokens=250)
        )
    else:
        s += assistant(
            "Here's a creative response: " + gen("creative_response", max_tokens=250)
        )


state = control_flow.run(
    question="What is the main difference between a CPU and a GPU?"
)
print_highlight(state["technical_response"])
print_highlight("#" * 50)
state = control_flow.run(question="Can you help me write a story about time travel?")
print_highlight(state["creative_response"])
print_highlight("#" * 50)

[2025-02-13 21:03:20 TP0] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 5, cache hit rate: 33.57%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:20] INFO:     127.0.0.1:51882 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:20 TP0] Prefill batch. #new-seq: 2, #new-token: 8, #cached-token: 76, cache hit rate: 54.63%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-13 21:03:20] INFO:     127.0.0.1:51886 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:20 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 137.10, #queue-req: 0
[2025-02-13 21:03:20 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 42, cache hit rate: 59.07%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:21 TP0] Decode batch. #running-req: 1, #token: 95, token usage: 0.00, gen throughput (token/s): 212.97, #queue-req: 0
[2025-02-13 21:03:21 TP0] Decode batch. #running-req: 1, #token: 135, token usage: 0.00, gen throughput (token/s): 243.44, #queue-req: 0
[2025-02-13 21:03:21 TP0] Decode batch. #running-req: 1, #token: 175, token usage: 0.00, gen throughput (token/s): 241.00, #queue-req: 0
[2025-02-13 21:03:21] INFO:     127.0.0.1:51894 - "POST /generate HTTP/1.1" 200 OK


[2025-02-13 21:03:21 TP0] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 14, cache hit rate: 56.25%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:21] INFO:     127.0.0.1:51908 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:21 TP0] Prefill batch. #new-seq: 2, #new-token: 8, #cached-token: 74, cache hit rate: 63.18%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:21] INFO:     127.0.0.1:51924 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:21 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 41, cache hit rate: 64.84%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:21 TP0] Decode batch. #running-req: 1, #token: 81, token usage: 0.00, gen throughput (token/s): 169.42, #queue-req: 0
[2025-02-13 21:03:21 TP0] Decode batch. #running-req: 1, #token: 121, token usage: 0.00, gen throughput (token/s): 243.75, #queue-req: 0
[2025-02-13 21:03:22 TP0] Decode batch. #running-req: 1, #token: 161, tok

## Parallelism

SGLang supports parallelism. `fork` can be used to launch multiple prompts in parallel.

In [5]:
@function
def parallel_sample(s, question, n):
    s += user(question)
    forks = s.fork(n)
    forks += assistant(gen("answer", temperature=0.7))
    forks.join()


states = parallel_sample.run(
    question="What does the integral of sin(x) from 0 to 2pi evaluate to? Answer without calculation.",
    n=5,
)
for answer in states["answer"]:
    print_highlight(answer)
    print_highlight("-" * 50)

[2025-02-13 21:03:22 TP0] Prefill batch. #new-seq: 1, #new-token: 26, #cached-token: 15, cache hit rate: 62.50%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:22 TP0] Prefill batch. #new-seq: 1, #new-token: 26, #cached-token: 15, cache hit rate: 60.52%, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-02-13 21:03:22 TP0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 120, cache hit rate: 67.42%, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-02-13 21:03:22] INFO:     127.0.0.1:51968 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:22] INFO:     127.0.0.1:51960 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:22] INFO:     127.0.0.1:51946 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:22] INFO:     127.0.0.1:51970 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:22] INFO:     127.0.0.1:51986 - "POST /generate HTTP/1.1" 200 OK


## Constrained Decoding

SGLang supports constrained decoding for structured outputs. The output format can be specified in form of a regular expression.

*Note: This is only supported for local models.*

In [6]:
@function
def regular_expression_gen(s):
    s += user("What is the birth date of Albert Einstein?")
    s += assistant(
        gen(
            "answer",
            temperature=0,
            regex=r"\d{1,2}\/\d{1,2}\/\d{2,4}",
        )
    )


state = regular_expression_gen.run()
print_highlight(state["answer"])

[2025-02-13 21:03:23 TP0] PyTorch version 2.5.1 available.
[2025-02-13 21:03:24 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 17, cache hit rate: 67.15%, token usage: 0.00, #running-req: 0, #queue-req: 0
  tokens = torch.tensor(
[2025-02-13 21:03:24 TP0] Decode batch. #running-req: 1, #token: 1, token usage: 0.00, gen throughput (token/s): 42.93, #queue-req: 0
[2025-02-13 21:03:24] INFO:     127.0.0.1:51990 - "POST /generate HTTP/1.1" 200 OK


Regular expression can also be used for schema extraction.

In [7]:
import json

character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)


@function
def generate_character(s, name):
    s += system(
        "You are a helpful assistant that extracts information about a character from a text."
    )
    s += user(f"Extract the relevant information about {name}.")
    s += assistant(gen("character", regex=character_regex, max_tokens=256))


state = generate_character.run(name="Harry Potter")
print_highlight(state["character"])

[2025-02-13 21:03:29 TP0] Prefill batch. #new-seq: 1, #new-token: 28, #cached-token: 8, cache hit rate: 64.92%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:30 TP0] Decode batch. #running-req: 1, #token: 76, token usage: 0.00, gen throughput (token/s): 7.61, #queue-req: 0
[2025-02-13 21:03:30 TP0] Decode batch. #running-req: 1, #token: 116, token usage: 0.00, gen throughput (token/s): 186.78, #queue-req: 0
[2025-02-13 21:03:30 TP0] Decode batch. #running-req: 1, #token: 156, token usage: 0.00, gen throughput (token/s): 212.16, #queue-req: 0
[2025-02-13 21:03:30] INFO:     127.0.0.1:52006 - "POST /generate HTTP/1.1" 200 OK


## Batching

`run_batch` can be used to run prompts with continous batching.

In [8]:
@function
def simple_qa(s, question: str):
    s += user(question)
    s += assistant(gen("answer", max_tokens=128, stop=["assistant"], temperature=0))


states = simple_qa.run_batch(
    [
        {"question": "Who was the first man on the moon?"},
        {"question": "Who was Lev Landau?"},
        {"question": "Please tell me a joke about a chicken."},
    ]
)

for state in states:
    print_highlight(f"Answer: {state['answer']}")
    print_highlight("-" * 50)

[2025-02-13 21:03:30 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 13, cache hit rate: 65.45%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:30] INFO:     127.0.0.1:55672 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:30 TP0] Prefill batch. #new-seq: 2, #new-token: 25, #cached-token: 28, cache hit rate: 64.60%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:30 TP0] Prefill batch. #new-seq: 1, #new-token: 14, #cached-token: 14, cache hit rate: 64.10%, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-02-13 21:03:30] INFO:     127.0.0.1:55704 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:30 TP0] Decode batch. #running-req: 2, #token: 89, token usage: 0.00, gen throughput (token/s): 325.52, #queue-req: 0
[2025-02-13 21:03:30] INFO:     127.0.0.1:55684 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:30 TP0] Decode batch. #running-req: 1, #token: 91, token usage: 0.00, gen throughput (token/s): 321.15, #queu

## Streaming

`stream` can be used to stream the response from the model.

*Note: We use* `print_highlight` *here to keep the color convention. In practice, we would use* `print(out, end="", flush=True)` *to stream the response.*

In [9]:
@function
def stream_qa(s, question: str):
    s += user(question)
    s += assistant(gen("answer", max_tokens=128, stop=["assistant"], temperature=0))


state = stream_qa.run(question="Who was the first man on the moon?", stream=True)
for out in state.text_iter():
    print_highlight(out)

[2025-02-13 21:03:31] INFO:     127.0.0.1:55714 - "POST /generate HTTP/1.1" 200 OK
[2025-02-13 21:03:31 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 27, cache hit rate: 65.17%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2025-02-13 21:03:31 TP0] Decode batch. #running-req: 1, #token: 46, token usage: 0.00, gen throughput (token/s): 201.85, #queue-req: 0


## Roles

`[user|assistant|system]_[begin|end]` can be used to define more complex prompts.

In [10]:
from sglang import user_begin, user_end, assistant_begin, assistant_end


@function
def roles(s):
    s += system(
        "You talk like a pirate and use frequently phrases like 'arrr' and 'yo-ho-ho'."
    )
    s += user_begin()
    s += "Hello, how do you like life as a pirate?"
    s += user_end()
    s += assistant_begin()
    s += "There is much to tell about the life of a pirate." + gen(
        "story", max_tokens=128
    )
    s += assistant_end()


state = roles.run()
print_highlight(state["story"])

[2025-02-13 21:03:31 TP0] Prefill batch. #new-seq: 1, #new-token: 54, #cached-token: 4, cache hit rate: 61.44%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:03:31 TP0] Decode batch. #running-req: 1, #token: 80, token usage: 0.00, gen throughput (token/s): 209.65, #queue-req: 0


[2025-02-13 21:03:31 TP0] Decode batch. #running-req: 1, #token: 120, token usage: 0.00, gen throughput (token/s): 244.86, #queue-req: 0
[2025-02-13 21:03:31] INFO:     127.0.0.1:55718 - "POST /generate HTTP/1.1" 200 OK


In [11]:
terminate_process(server_process)

## Multi-modal

SGLang supports a variety of [multi-modal models](https://docs.sglang.ai/backend/openai_api_vision.html).

In [12]:
server_process = execute_shell_command(
    "python3 -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --mem-fraction-static 0.8 --port 30333 --host 0.0.0.0"
)
wait_for_server("http://localhost:30333")

INFO 02-13 21:03:37 __init__.py:190] Automatically detected platform cuda.
[2025-02-13 21:03:42] server_args=ServerArgs(model_path='lmms-lab/llama3-llava-next-8b', tokenizer_path='lmms-lab/llama3-llava-next-8b', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='lmms-lab/llama3-llava-next-8b', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30333, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, stream_output=False, random_seed=883450856, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None

In [13]:
set_default_backend(RuntimeEndpoint("http://localhost:30333"))

[2025-02-13 21:04:14] INFO:     127.0.0.1:57504 - "GET /get_model_info HTTP/1.1" 200 OK


Use `image` to pass an image to the model.

In [14]:
!wget -O example_image.png https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true

--2025-02-13 21:04:14--  https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
Resolving github.com (github.com)... 

140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/sgl-project/sglang/raw/refs/heads/main/test/lang/example_image.png [following]
--2025-02-13 21:04:14--  https://github.com/sgl-project/sglang/raw/refs/heads/main/test/lang/example_image.png
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sgl-project/sglang/refs/heads/main/test/lang/example_image.png [following]
--2025-02-13 21:04:15--  https://raw.githubusercontent.com/sgl-project/sglang/refs/heads/main/test/lang/example_image.png
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57365 (56K) [image/png]
Saving

In [15]:
from sglang import image


@function
def image_qa(s, image_file, question):
    s += user(image(image_file) + question)
    s += assistant(gen("answer", max_tokens=128, stop=["assistant"]))


state = image_qa.run(
    image_file="example_image.png", question="Describe the image in one short sentence."
)
print_highlight(state["answer"])

[2025-02-13 21:04:18 TP0] Prefill batch. #new-seq: 1, #new-token: 2158, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-13 21:04:19] INFO:     127.0.0.1:57510 - "POST /generate HTTP/1.1" 200 OK


In [16]:
!rm example_image.png

In [17]:
terminate_process(server_process)

## Going further

To get more familar with SGLang we recommend to start studying the [benchmark scripts](https://github.com/sgl-project/sglang/tree/main/benchmark).