# Constrained Decoding

With SGLang, You can define a JSON schema, EBNF or regular expression to constrain the model's output.

[JSON Schema](https://json-schema.org/): Formats output into structured JSON objects with validation rules.

[EBNF (Extended Backus-Naur Form)](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form): Defines complex syntax rules, especially for recursive patterns like nested structures.

[Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression): Matches text patterns for simple validation and formatting.

## Grammar Backend

SGLang has two backends: [Outlines](https://github.com/dottxt-ai/outlines) (default) and [XGrammar](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). We suggest using XGrammar whenever possible for its better performance. For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).

* Xgrammar Backend: JSON and EBNF
* Outlines Backend: JSON and regular expressions

## OpenAI Compatible API

To use Xgrammar, simply add `--grammar-backend xgrammar` when launching the server. If no backend is specified, Outlines will be used as the default.

In [1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)
import openai

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0 --grammar-backend xgrammar"
)

wait_for_server("http://localhost:30000")
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

[2024-12-29 18:39:47] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=146749565, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=Non

[2024-12-29 18:40:01 TP0] Init torch distributed begin.


[2024-12-29 18:40:01 TP0] Load weight begin. avail mem=78.81 GB


[2024-12-29 18:40:02 TP0] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.76it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:01,  1.95it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.43it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.45it/s]

[2024-12-29 18:40:05 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2024-12-29 18:40:05 TP0] Memory pool end. avail mem=8.34 GB


[2024-12-29 18:40:05 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:32,  1.50s/it]

  9%|▊         | 2/23 [00:01<00:17,  1.20it/s]

 13%|█▎        | 3/23 [00:02<00:11,  1.80it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.36it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.85it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.15it/s]

 30%|███       | 7/23 [00:03<00:04,  3.50it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.73it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.79it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.98it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.07it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  4.19it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.26it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.32it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.33it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  4.38it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.43it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.45it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.45it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.44it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.46it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.47it/s]

100%|██████████| 23/23 [00:06<00:00,  4.48it/s]100%|██████████| 23/23 [00:06<00:00,  3.47it/s]
[2024-12-29 18:40:12 TP0] Capture cuda graph end. Time elapsed: 6.64 s


[2024-12-29 18:40:12 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072


[2024-12-29 18:40:13] INFO:     Started server process [2781418]
[2024-12-29 18:40:13] INFO:     Waiting for application startup.
[2024-12-29 18:40:13] INFO:     Application startup complete.
[2024-12-29 18:40:13] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)


[2024-12-29 18:40:13] INFO:     127.0.0.1:53676 - "GET /v1/models HTTP/1.1" 200 OK


[2024-12-29 18:40:13] INFO:     127.0.0.1:53690 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-29 18:40:14 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-12-29 18:40:14] INFO:     127.0.0.1:53704 - "POST /generate HTTP/1.1" 200 OK
[2024-12-29 18:40:14] The server is fired up and ready to roll!


### JSON

In [2]:
import json

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Give me the information of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=128,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "foo", "schema": json.loads(json_schema)},
    },
)

print_highlight(response.choices[0].message.content)

[2024-12-29 18:40:18 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 1, cache hit rate: 1.79%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-29 18:40:18] INFO:     127.0.0.1:53712 - "POST /v1/chat/completions HTTP/1.1" 200 OK


### EBNF

In [3]:
ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful geography bot."},
        {
            "role": "user",
            "content": "Give me the information of the capital of France.",
        },
    ],
    temperature=0,
    max_tokens=32,
    extra_body={"ebnf": ebnf_grammar},
)

print_highlight(response.choices[0].message.content)

[2024-12-29 18:40:18 TP0] Prefill batch. #new-seq: 1, #new-token: 27, #cached-token: 25, cache hit rate: 24.07%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-12-29 18:40:18] INFO:     127.0.0.1:53712 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [4]:
terminate_process(server_process)
server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)

wait_for_server("http://localhost:30000")

[2024-12-29 18:40:27] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=1067156548, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=No

[2024-12-29 18:40:41 TP0] Init torch distributed begin.


[2024-12-29 18:40:41 TP0] Load weight begin. avail mem=78.81 GB


[2024-12-29 18:40:42 TP0] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  6.00it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.69it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.37it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]

[2024-12-29 18:40:45 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2024-12-29 18:40:45 TP0] Memory pool end. avail mem=8.34 GB


[2024-12-29 18:40:46 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:30,  1.38s/it]

  9%|▊         | 2/23 [00:01<00:16,  1.24it/s]

 13%|█▎        | 3/23 [00:02<00:11,  1.81it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.32it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.77it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.07it/s]

 30%|███       | 7/23 [00:03<00:04,  3.35it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.57it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.75it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.91it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.01it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  3.98it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.93it/s]

 61%|██████    | 14/23 [00:04<00:02,  4.00it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.05it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  4.15it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.32it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.35it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.32it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  4.27it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.31it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.44it/s]100%|██████████| 23/23 [00:06<00:00,  4.60it/s]100%|██████████| 23/23 [00:06<00:00,  3.42it/s]
[2024-12-29 18:40:52 TP0] Capture cuda graph end. Time elapsed: 6.74 s


[2024-12-29 18:40:53 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2024-12-29 18:40:53] INFO:     Started server process [2782386]
[2024-12-29 18:40:53] INFO:     Waiting for application startup.
[2024-12-29 18:40:53] INFO:     Application startup complete.
[2024-12-29 18:40:53] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2024-12-29 18:40:53] INFO:     127.0.0.1:52090 - "GET /v1/models HTTP/1.1" 200 OK


[2024-12-29 18:40:54] INFO:     127.0.0.1:52098 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-29 18:40:54 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-12-29 18:40:54] INFO:     127.0.0.1:52114 - "POST /generate HTTP/1.1" 200 OK
[2024-12-29 18:40:54] The server is fired up and ready to roll!


### Regular expression

In [5]:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=128,
    extra_body={"regex": "(Paris|London)"},
)

print_highlight(response.choices[0].message.content)

[2024-12-29 18:40:58 TP0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 1, cache hit rate: 2.04%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-29 18:40:58] INFO:     127.0.0.1:52122 - "POST /v1/chat/completions HTTP/1.1" 200 OK


In [6]:
terminate_process(server_process)

## Native API and SGLang Runtime (SRT)

In [7]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

import requests

server_process = execute_shell_command(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010 --grammar-backend xgrammar
"""
)

wait_for_server("http://localhost:30010")

[2024-12-29 18:41:07] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=597254193, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_stora

[2024-12-29 18:41:20 TP0] Init torch distributed begin.


[2024-12-29 18:41:20 TP0] Load weight begin. avail mem=78.81 GB


[2024-12-29 18:41:21 TP0] Using model weights format ['*.safetensors']
[2024-12-29 18:41:21 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.35it/s]

[2024-12-29 18:41:21 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=76.39 GB
[2024-12-29 18:41:21 TP0] Memory pool end. avail mem=7.45 GB
[2024-12-29 18:41:22 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:32,  1.49s/it]

  9%|▊         | 2/23 [00:01<00:17,  1.23it/s]

 13%|█▎        | 3/23 [00:02<00:10,  1.87it/s]

 17%|█▋        | 4/23 [00:02<00:07,  2.47it/s] 22%|██▏       | 5/23 [00:02<00:05,  3.04it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.52it/s] 30%|███       | 7/23 [00:02<00:04,  3.90it/s]

 35%|███▍      | 8/23 [00:03<00:03,  4.21it/s] 39%|███▉      | 9/23 [00:03<00:03,  4.43it/s]

 43%|████▎     | 10/23 [00:03<00:02,  4.61it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.70it/s] 52%|█████▏    | 12/23 [00:03<00:02,  4.79it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  4.83it/s]

 61%|██████    | 14/23 [00:04<00:01,  4.71it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.36it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.08it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  3.89it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  3.74it/s]

 83%|████████▎ | 19/23 [00:05<00:01,  3.64it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  3.39it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  3.20it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  3.08it/s]

100%|██████████| 23/23 [00:07<00:00,  3.03it/s]100%|██████████| 23/23 [00:07<00:00,  3.26it/s]
[2024-12-29 18:41:29 TP0] Capture cuda graph end. Time elapsed: 7.07 s


[2024-12-29 18:41:29 TP0] max_total_num_tokens=2193171, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072


[2024-12-29 18:41:30] INFO:     Started server process [2783341]
[2024-12-29 18:41:30] INFO:     Waiting for application startup.
[2024-12-29 18:41:30] INFO:     Application startup complete.
[2024-12-29 18:41:30] INFO:     Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)


[2024-12-29 18:41:30] INFO:     127.0.0.1:42954 - "GET /v1/models HTTP/1.1" 200 OK


[2024-12-29 18:41:31] INFO:     127.0.0.1:42970 - "GET /get_model_info HTTP/1.1" 200 OK


[2024-12-29 18:41:31 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-12-29 18:41:31] INFO:     127.0.0.1:42980 - "POST /generate HTTP/1.1" 200 OK
[2024-12-29 18:41:31] The server is fired up and ready to roll!


### JSON

In [8]:
import json
import requests

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
response = requests.post(
    "http://localhost:30010/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)

print_highlight(response.json())

[2024-12-29 18:41:35 TP0] Prefill batch. #new-seq: 1, #new-token: 14, #cached-token: 1, cache hit rate: 4.55%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-29 18:41:35] INFO:     127.0.0.1:55222 - "POST /generate HTTP/1.1" 200 OK


### EBNF

In [9]:
import requests

response = requests.post(
    "http://localhost:30010/generate",
    json={
        "text": "Give me the information of the capital of France.",
        "sampling_params": {
            "max_new_tokens": 128,
            "temperature": 0,
            "n": 3,
            "ebnf": (
                "root ::= city | description\n"
                'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
                'description ::= city " is " status\n'
                'status ::= "the capital of " country\n'
                'country ::= "England" | "France" | "Germany" | "Italy"'
            ),
        },
        "stream": False,
        "return_logprob": False,
    },
)

print_highlight(response.json())

[2024-12-29 18:41:35 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 1, cache hit rate: 6.06%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-29 18:41:35 TP0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 30, cache hit rate: 48.48%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-29 18:41:35] INFO:     127.0.0.1:55236 - "POST /generate HTTP/1.1" 200 OK


In [10]:
terminate_process(server_process)
server_process = execute_shell_command(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010
"""
)

wait_for_server("http://localhost:30010")

[2024-12-29 18:41:43] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=280398754, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_stora

[2024-12-29 18:41:56 TP0] Init torch distributed begin.


[2024-12-29 18:41:56 TP0] Load weight begin. avail mem=78.81 GB


[2024-12-29 18:41:57 TP0] Using model weights format ['*.safetensors']
[2024-12-29 18:41:57 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.69it/s]

[2024-12-29 18:41:57 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=76.39 GB
[2024-12-29 18:41:57 TP0] Memory pool end. avail mem=7.45 GB
[2024-12-29 18:41:58 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:33,  1.53s/it]

  9%|▊         | 2/23 [00:01<00:18,  1.14it/s]

 13%|█▎        | 3/23 [00:02<00:11,  1.69it/s]

 17%|█▋        | 4/23 [00:02<00:08,  2.23it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.78it/s]

 26%|██▌       | 6/23 [00:02<00:05,  3.24it/s]

 30%|███       | 7/23 [00:03<00:04,  3.26it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.48it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.55it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.77it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.98it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  3.77it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.66it/s]

 61%|██████    | 14/23 [00:04<00:02,  3.85it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.98it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  4.16it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.34it/s]

 78%|███████▊  | 18/23 [00:05<00:01,  4.48it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.60it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  4.59it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.25it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.08it/s]

100%|██████████| 23/23 [00:07<00:00,  3.99it/s]100%|██████████| 23/23 [00:07<00:00,  3.28it/s]
[2024-12-29 18:42:05 TP0] Capture cuda graph end. Time elapsed: 7.03 s


[2024-12-29 18:42:05 TP0] max_total_num_tokens=2193171, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2024-12-29 18:42:05] INFO:     Started server process [2784308]
[2024-12-29 18:42:05] INFO:     Waiting for application startup.
[2024-12-29 18:42:05] INFO:     Application startup complete.
[2024-12-29 18:42:05] INFO:     Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)


[2024-12-29 18:42:06] INFO:     127.0.0.1:55570 - "GET /v1/models HTTP/1.1" 200 OK


[2024-12-29 18:42:06] INFO:     127.0.0.1:55578 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-29 18:42:06 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0


[2024-12-29 18:42:07] INFO:     127.0.0.1:55588 - "POST /generate HTTP/1.1" 200 OK
[2024-12-29 18:42:07] The server is fired up and ready to roll!


### Regular expression

In [11]:
response = requests.post(
    "http://localhost:30010/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print_highlight(response.json())

[2024-12-29 18:42:11 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 1, cache hit rate: 7.69%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-12-29 18:42:11] INFO:     127.0.0.1:55600 - "POST /generate HTTP/1.1" 200 OK


In [12]:
terminate_process(server_process)

## Offline Engine API

In [13]:
import sglang as sgl

llm_xgrammar = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", grammar_backend="xgrammar"
)

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.73it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.74it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.32it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:39,  1.78s/it]

  9%|▊         | 2/23 [00:02<00:20,  1.03it/s]

 13%|█▎        | 3/23 [00:02<00:12,  1.56it/s]

 17%|█▋        | 4/23 [00:02<00:09,  2.05it/s]

 22%|██▏       | 5/23 [00:02<00:07,  2.51it/s]

 26%|██▌       | 6/23 [00:03<00:06,  2.83it/s]

 30%|███       | 7/23 [00:03<00:05,  3.15it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.38it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.57it/s]

 43%|████▎     | 10/23 [00:04<00:03,  3.67it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.77it/s]

 52%|█████▏    | 12/23 [00:04<00:02,  3.79it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.85it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.89it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.85it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  3.60it/s]

 74%|███████▍  | 17/23 [00:06<00:01,  3.22it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  3.08it/s]

 83%|████████▎ | 19/23 [00:06<00:01,  2.94it/s]

 87%|████████▋ | 20/23 [00:07<00:01,  2.89it/s]

 91%|█████████▏| 21/23 [00:07<00:00,  2.73it/s]

 96%|█████████▌| 22/23 [00:08<00:00,  2.67it/s]

100%|██████████| 23/23 [00:08<00:00,  2.70it/s]100%|██████████| 23/23 [00:08<00:00,  2.73it/s]


### JSON

In [14]:
import json

prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

sampling_params = {"temperature": 0.1, "top_p": 0.95, "json_schema": json_schema}

outputs = llm_xgrammar.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### EBNF


In [15]:
prompts = [
    "Give me the information of the capital of France.",
    "Give me the information of the capital of Germany.",
    "Give me the information of the capital of Italy.",
]

sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "ebnf": (
        "root ::= city | description\n"
        'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
        'description ::= city " is " status\n'
        'status ::= "the capital of " country\n'
        'country ::= "England" | "France" | "Germany" | "Italy"'
    ),
}

outputs = llm_xgrammar.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

In [16]:
llm_xgrammar.shutdown()
llm_outlines = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  5.50it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.61it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

  4%|▍         | 1/23 [00:01<00:36,  1.65s/it]

  9%|▊         | 2/23 [00:02<00:19,  1.10it/s]

 13%|█▎        | 3/23 [00:02<00:11,  1.70it/s] 17%|█▋        | 4/23 [00:02<00:08,  2.32it/s]

 22%|██▏       | 5/23 [00:02<00:06,  2.75it/s]

 26%|██▌       | 6/23 [00:02<00:05,  2.93it/s]

 30%|███       | 7/23 [00:03<00:04,  3.22it/s]

 35%|███▍      | 8/23 [00:03<00:04,  3.47it/s]

 39%|███▉      | 9/23 [00:03<00:03,  3.60it/s]

 43%|████▎     | 10/23 [00:03<00:03,  3.72it/s]

 48%|████▊     | 11/23 [00:04<00:03,  3.29it/s]

 52%|█████▏    | 12/23 [00:04<00:03,  3.31it/s]

 57%|█████▋    | 13/23 [00:04<00:02,  3.41it/s]

 61%|██████    | 14/23 [00:05<00:02,  3.47it/s]

 65%|██████▌   | 15/23 [00:05<00:02,  3.76it/s]

 70%|██████▉   | 16/23 [00:05<00:01,  3.97it/s]

 74%|███████▍  | 17/23 [00:05<00:01,  4.02it/s]

 78%|███████▊  | 18/23 [00:06<00:01,  4.23it/s]

 83%|████████▎ | 19/23 [00:06<00:00,  4.41it/s]

 87%|████████▋ | 20/23 [00:06<00:00,  4.56it/s]

 91%|█████████▏| 21/23 [00:06<00:00,  4.54it/s]

 96%|█████████▌| 22/23 [00:06<00:00,  4.40it/s]

100%|██████████| 23/23 [00:07<00:00,  4.24it/s]100%|██████████| 23/23 [00:07<00:00,  3.19it/s]


### Regular expression

In [17]:
prompts = [
    "Please provide information about London as a major global city:",
    "Please provide information about Paris as a major global city:",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}

outputs = llm_outlines.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print_highlight("===============================")
    print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")

In [18]:
llm_outlines.shutdown()