# Constrained Decoding Tutorial

This tutorial demonstrates how to use constrained decoding with SGLang, which allows you to control the model's output format using JSON schemas and regular expressions.

Throughout this tutorial, we will explore constrained decoding implementation across multiple interfaces: the OpenAI API, Native API, SGLang Runtime (SRT), and Offline Engine API. 

## Constrained Decoding

As language models evolve into agent systems, they must use grammar-constrained decoding to ensure their structured outputs (like JSON, SQL, Python) conform to predefined rules for downstream processing.

### Constrained Decoding Formats

With SGlang, You can define a JSON schema, EBNF or regular expression to constrain the model's output. 

**JSON Schema**
A structured format that defines the expected shape and validation rules for JSON data. It's ideal for creating structured outputs like API responses or data objects, but has limitations with recursive structures.

**EBNF (Extended Backus-Naur Form)**
A formal notation system that describes the syntax of programming languages and complex structures. It excels at defining recursive patterns (like nested brackets) and formal language syntax, making it more powerful than JSON schemas for complex structural patterns.

**Regular Expressions**
A sequence of characters that defines a search pattern. While simpler than EBNF, regular expressions are effective for pattern matching and text validation, though they cannot handle recursive structures.

### Constrained Decoding Backends

SGlang has two backends: [Outlines](https://github.com/dottxt-ai/outlines) (default) and [XGrammar](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). 

Our backends support these formats as follows:
* Xgrammar Backend: JSON and EBNF (offering accelerated decoding performance)
* Outlines Backend: JSON and regular expressions

The choice between these formats depends on your specific needs:
* For basic structured data: JSON Schema
* For complex recursive patterns: EBNF
* For simple pattern matching: Regular expressions

### Performance Optimization with XGrammar
XGrammar delivers exceptional performance improvements, with up to 3.5x speedup on JSON schema workloads and 10x on CFG tasks. For details, see the [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).

## Code Examples

In each of the following sections, we'll first demonstrate implementations using the Xgrammar backend for JSON schemas and EBNF specifications, then switch to the Outlines backend to showcase regular expression-based implementations.

Only one of the below three can be set at a time:

* regex: Optional[str] = None,
* json_schema: Optional[str] = None,
* ebnf: Optional[str] = None,

To use Xgrammar, simply add `--grammar-backend xgrammar` when launching the server. If no backend is specified, Outlines will be used as the default.

In [None]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

## OpenAI API examples

To use Xgrammar:
```bash
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend xgrammar
```

This following code block is equivalent to executing the above command in your terminal and wait for the server to be ready.

In [None]:
server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0 --grammar-backend xgrammar"
)

wait_for_server("http://localhost:30000")

In [None]:
import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

### JSON

In [None]:
import json

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Give me the information of the capital of France in the JSON format.",
        },
    ],
    temperature=0,
    max_tokens=128,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "foo", "schema": json.loads(json_schema)},
    },
)

print_highlight(response.choices[0].message.content)

### EBNF

In [None]:
# Define EBNF grammar for capital city descriptions
ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""

# Generate completion with EBNF constraint
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful geography bot."},
        {
            "role": "user",
            "content": "Give me the information of the capital of France.",
        },
    ],
    temperature=0,
    max_tokens=32,
    extra_body={"ebnf": ebnf_grammar},  # EBNF is passed through extra_body
)

print_highlight(response.choices[0].message.content)

Now we'll switch to the Outlines backend.

In [None]:
terminate_process(server_process)

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)

wait_for_server("http://localhost:30000")

### Regular expression

In [None]:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    max_tokens=128,
    extra_body={"regex": "(Paris|London)"},
)

print_highlight(response.choices[0].message.content)

In [None]:
terminate_process(server_process)

## Native API/SGLang Runtime (SRT) Examples

Start a server with Xgrammar backend.

In [None]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

import requests

server_process = execute_shell_command(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010 --grammar-backend xgrammar
"""
)

wait_for_server("http://localhost:30010")

### JSON

In [None]:
import json
import requests

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

# JSON
response = requests.post(
    "http://localhost:30010/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

### EBNF

In [None]:
import requests

response = requests.post(
    "http://localhost:30010/generate",
    json={
        "text": "Give me the information of the capital of France.",
        "sampling_params": {
            "max_new_tokens": 128,
            "temperature": 0,
            "n": 3,
            "ebnf": (
                "root ::= city | description\n"
                'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
                'description ::= city " is " status\n'
                'status ::= "the capital of " country\n'
                'country ::= "England" | "France" | "Germany" | "Italy"'
            ),
        },
        "stream": False,
        "return_logprob": False,
    },
)

print(response.json())

Now we'll switch to the Outlines backend.

In [None]:
terminate_process(server_process)

server_process = execute_shell_command(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010
"""
)

wait_for_server("http://localhost:30010")

### Regular expression

In [None]:
response = requests.post(
    "http://localhost:30010/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())

In [None]:
terminate_process(server_process)

## Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead.

Similar to the previous examples, we'll first demonstrate implementations using the Xgrammar backend.

In [None]:
import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", grammar_backend="xgrammar"
)

### JSON

In [None]:
import json

prompts = [
    "Give me the information of the capital of China in the JSON format.",
    "Give me the information of the capital of France in the JSON format.",
    "Give me the information of the capital of Ireland in the JSON format.",
]

json_schema = json.dumps(
    {
        "type": "object",
        "properties": {
            "name": {"type": "string", "pattern": "^[\\w]+$"},
            "population": {"type": "integer"},
        },
        "required": ["name", "population"],
    }
)

sampling_params = {"temperature": 0.1, "top_p": 0.95, "json_schema": json_schema}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### EBNF


In [None]:
prompts = [
    "Give me the information of the capital of France.",
    "Give me the information of the capital of Germany.",
    "Give me the information of the capital of Italy.",
]

sampling_params = {
    "temperature": 0.8,
    "top_p": 0.95,
    "ebnf": (
        "root ::= city | description\n"
        'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
        'description ::= city " is " status\n'
        'status ::= "the capital of " country\n'
        'country ::= "England" | "France" | "Germany" | "Italy"'
    ),
}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

In [None]:
llm.shutdown()

Now we'll switch to the Outlines backend.

In [None]:
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

### Regular expression

In [None]:
prompts = [
    "Please provide information about London as a major global city:",
    "Please provide information about Paris as a major global city:",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

In [None]:
llm.shutdown()