# Reasoning Parser

SGLang supports parsing reasoning content out from "normal" content for reasoning models such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).

## Supported Models & Parsers

| Model  |  Reasoning tags      | Parser | Notes |
|---------|-----------------------------|------------------|-------|
| [DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `<think>` … `</think>` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |
| [DeepSeek‑V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `<think>` … `</think>` | `deepseek-v3` | Supports `thinking` parameter |
| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `<think>` … `</think>` | `qwen3` | Supports `enable_thinking` parameter |
| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `<think>` … `</think>` | `qwen3` or `qwen3-thinking` | Always generates thinking content |
| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |
| [GPT OSS](https://huggingface.co/openai/gpt-oss-120b) | `<\|channel\|>analysis<\|message\|>` … `<\|end\|>` | `gpt-oss` | N/A |
### Model-Specific Behaviors

**DeepSeek-R1 Family:**
- DeepSeek-R1: No `<think>` start tag, jumps directly to thinking content
- DeepSeek-R1-0528: Generates both `<think>` start and `</think>` end tags
- Both are handled by the same `deepseek-r1` parser

**DeepSeek-V3 Family:**
- DeepSeek-V3.1: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)

**Qwen3 Family:**
- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates
- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use `qwen3` or `qwen3-thinking` parser, always thinks

**Kimi:**
- Kimi: Uses special `◁think▷` and `◁/think▷` tags

**GPT OSS:**
- GPT OSS: Uses special `<|channel|>analysis<|message|>` and `<|end|>` tags

## Usage

### Launching the Server

Specify the `--reasoning-parser` option.

In [1]:
import requests
from openai import OpenAI
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning"
)

wait_for_server(f"http://localhost:{port}")

  import pynvml  # type: ignore[import]


  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!


  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 04:42:56] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[2025-10-11 04:42:59] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected


[2025-10-11 04:43:00] Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.




Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.53s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.46s/it]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=23.16 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=23.16 GB):  33%|███▎      | 1/3 [00:06<00:13,  6.90s/it]Capturing batches (bs=2 avail_mem=19.86 GB):  33%|███▎      | 1/3 [00:06<00:13,  6.90s/it]Capturing batches (bs=1 avail_mem=19.86 GB):  33%|███▎      | 1/3 [00:06<00:13,  6.90s/it]Capturing batches (bs=1 avail_mem=19.86 GB): 100%|██████████| 3/3 [00:06<00:00,  2.33s/it]


Note that `--reasoning-parser` defines the parser used to interpret responses.

### OpenAI Compatible API

Using the OpenAI compatible API, the contract follows the [DeepSeek API design](https://api-docs.deepseek.com/guides/reasoning_model) established with the release of DeepSeek-R1:

- `reasoning_content`: The content of the CoT.
- `content`: The content of the final answer.

In [2]:
# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id

messages = [
    {
        "role": "user",
        "content": "What is 1+3?",
    }
]

#### Non-Streaming Request

In [3]:
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=False,  # Non-streaming
    extra_body={"separate_reasoning": True},
)
print_highlight("==== Reasoning ====")
print_highlight(response_non_stream.choices[0].message.reasoning_content)

print_highlight("==== Text ====")
print_highlight(response_non_stream.choices[0].message.content)

#### Streaming Request

In [4]:
response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=True,  # Non-streaming
    extra_body={"separate_reasoning": True},
)

reasoning_content = ""
content = ""
for chunk in response_stream:
    if chunk.choices[0].delta.content:
        content += chunk.choices[0].delta.content
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content

print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)

print_highlight("==== Text ====")
print_highlight(content)

Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content).

In [5]:
response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=True,  # Non-streaming
    extra_body={"separate_reasoning": True, "stream_reasoning": False},
)

reasoning_content = ""
content = ""
for chunk in response_stream:
    if chunk.choices[0].delta.content:
        content += chunk.choices[0].delta.content
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content

print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)

print_highlight("==== Text ====")
print_highlight(content)

The reasoning separation is enable by default when specify . 
**To disable it, set the `separate_reasoning` option to `False` in request.**

In [6]:
response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=False,  # Non-streaming
    extra_body={"separate_reasoning": False},
)

print_highlight("==== Original Output ====")
print_highlight(response_non_stream.choices[0].message.content)

### SGLang Native API 

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

gen_url = f"http://localhost:{port}/generate"
gen_data = {
    "text": input,
    "sampling_params": {
        "skip_special_tokens": False,
        "max_new_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.95,
    },
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]

print_highlight("==== Original Output ====")
print_highlight(gen_response)

parse_url = f"http://localhost:{port}/separate_reasoning"
separate_reasoning_data = {
    "text": gen_response,
    "reasoning_parser": "deepseek-r1",
}
separate_reasoning_response_json = requests.post(
    parse_url, json=separate_reasoning_data
).json()
print_highlight("==== Reasoning ====")
print_highlight(separate_reasoning_response_json["reasoning_text"])
print_highlight("==== Text ====")
print_highlight(separate_reasoning_response_json["text"])

In [8]:
terminate_process(server_process)

### Offline Engine API

In [9]:
import sglang as sgl
from sglang.srt.parser.reasoning_parser import ReasoningParser
from sglang.utils import print_highlight

llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
sampling_params = {
    "max_new_tokens": 1024,
    "skip_special_tokens": False,
    "temperature": 0.6,
    "top_p": 0.95,
}
result = llm.generate(prompt=input, sampling_params=sampling_params)

generated_text = result["text"]  # Assume there is only one prompt

print_highlight("==== Original Output ====")
print_highlight(generated_text)

parser = ReasoningParser("deepseek-r1")
reasoning_text, text = parser.parse_non_stream(generated_text)
print_highlight("==== Reasoning ====")
print_highlight(reasoning_text)
print_highlight("==== Text ====")
print_highlight(text)

`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-11 04:43:38] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.44s/it]


Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.38s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.39s/it]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=37.13 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=37.13 GB):   5%|▌         | 1/20 [00:06<02:10,  6.89s/it]Capturing batches (bs=120 avail_mem=36.39 GB):   5%|▌         | 1/20 [00:06<02:10,  6.89s/it]Capturing batches (bs=112 avail_mem=36.38 GB):   5%|▌         | 1/20 [00:06<02:10,  6.89s/it]Capturing batches (bs=112 avail_mem=36.38 GB):  15%|█▌        | 3/20 [00:07<00:31,  1.85s/it]Capturing batches (bs=104 avail_mem=36.37 GB):  15%|█▌        | 3/20 [00:07<00:31,  1.85s/it]

Capturing batches (bs=96 avail_mem=36.37 GB):  15%|█▌        | 3/20 [00:07<00:31,  1.85s/it] Capturing batches (bs=96 avail_mem=36.37 GB):  25%|██▌       | 5/20 [00:07<00:13,  1.08it/s]Capturing batches (bs=88 avail_mem=36.36 GB):  25%|██▌       | 5/20 [00:07<00:13,  1.08it/s]Capturing batches (bs=80 avail_mem=36.36 GB):  25%|██▌       | 5/20 [00:07<00:13,  1.08it/s]Capturing batches (bs=80 avail_mem=36.36 GB):  35%|███▌      | 7/20 [00:07<00:07,  1.80it/s]Capturing batches (bs=72 avail_mem=36.35 GB):  35%|███▌      | 7/20 [00:07<00:07,  1.80it/s]

Capturing batches (bs=64 avail_mem=36.35 GB):  35%|███▌      | 7/20 [00:07<00:07,  1.80it/s]Capturing batches (bs=56 avail_mem=36.34 GB):  35%|███▌      | 7/20 [00:07<00:07,  1.80it/s]Capturing batches (bs=56 avail_mem=36.34 GB):  50%|█████     | 10/20 [00:07<00:03,  3.19it/s]Capturing batches (bs=48 avail_mem=36.34 GB):  50%|█████     | 10/20 [00:07<00:03,  3.19it/s]Capturing batches (bs=40 avail_mem=36.33 GB):  50%|█████     | 10/20 [00:07<00:03,  3.19it/s]Capturing batches (bs=40 avail_mem=36.33 GB):  60%|██████    | 12/20 [00:07<00:01,  4.32it/s]Capturing batches (bs=32 avail_mem=36.33 GB):  60%|██████    | 12/20 [00:07<00:01,  4.32it/s]

Capturing batches (bs=24 avail_mem=36.32 GB):  60%|██████    | 12/20 [00:07<00:01,  4.32it/s]Capturing batches (bs=16 avail_mem=36.32 GB):  60%|██████    | 12/20 [00:07<00:01,  4.32it/s]Capturing batches (bs=16 avail_mem=36.32 GB):  75%|███████▌  | 15/20 [00:07<00:00,  6.12it/s]Capturing batches (bs=12 avail_mem=36.31 GB):  75%|███████▌  | 15/20 [00:07<00:00,  6.12it/s]Capturing batches (bs=8 avail_mem=35.17 GB):  75%|███████▌  | 15/20 [00:07<00:00,  6.12it/s] 

Capturing batches (bs=4 avail_mem=35.16 GB):  75%|███████▌  | 15/20 [00:07<00:00,  6.12it/s]Capturing batches (bs=4 avail_mem=35.16 GB):  90%|█████████ | 18/20 [00:07<00:00,  8.44it/s]Capturing batches (bs=2 avail_mem=34.62 GB):  90%|█████████ | 18/20 [00:07<00:00,  8.44it/s]Capturing batches (bs=1 avail_mem=21.19 GB):  90%|█████████ | 18/20 [00:07<00:00,  8.44it/s]Capturing batches (bs=1 avail_mem=21.19 GB): 100%|██████████| 20/20 [00:07<00:00,  2.54it/s]


In [10]:
llm.shutdown()

## Supporting New Reasoning Model Schemas

For future reasoning models, you can implement the reasoning parser as a subclass of `BaseReasoningFormatDetector` in `python/sglang/srt/reasoning_parser.py` and specify the reasoning parser for new reasoning model schemas accordingly.