-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Your current environment
Detailed env may not be needed for this issue
vLLM 0.9.0
RTX A6000
Arguments:
--device cuda --served-model-name Qwen3-30B-A3B --quantization gptq_marlin --host 0.0.0.0 --port 8888 --max-model-len 32768 --gpu-memory-utilization 0.85 --disable-log-stats --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3
🐛 Describe the bug
When serving Qwen3 series model on OpenAI compatible server mode, if enable_thinking
is false
and we specified a guided_json
. The output json will most likely not a valid json. It can have an extra '{' or '[' or have "```" in the beginning, and can even be complete gibberish in some cases.
However, if we switch enable_thinking
to true
, the model thinks and the output json will be valid.
Furthermore, if leave enable_thinking
as true
and we append "/no_think" manually to user prompt, the model doesn't think and output json is also valid.
If we straight up don't use any reasoning parser, the output json is also valid regardless of enable_thinking
setting.
Reproducible on both Qwen3-32B-INT8
and Qwen3-30B-A3B-INT4
models. Both xgrammar
and guidance
backend are tested.
Minimum code to reproduce:
def reproduce_qwen3_parser_bug(text: str):
client = openai.OpenAI(
base_url="http://something:someport/v1",
api_key="nope",
timeout=8888,
)
message_list = [
SystemMessage(
"You'll need to extract keywords from input text. Output a JSON array of strings."
)
]
message_list.append(HumanMessage(text))
answer = client.chat.completions.create(
model="Qwen3-30B-A3B",
messages=message_list,
max_completion_tokens=512,
temperature=0.7,
top_p=0.8,
presence_penalty=-0.05,
frequency_penalty=0,
extra_body={
"guided_json": TypeAdapter(list[str]).json_schema(),
"chat_template_kwargs": {"enable_thinking": False},
},
)
print(f"Output: {answer.choices[0].message.content}")
if __name__ == "__main__":
reproduce_qwen3_parser_bug("Write a hello world program in C language. Give detailed explanation as well.")
Output of enable_thinking=False:
["[", "]
Output of enable_think=True (an extra '\n' is generated but it's still valid json. Also had to increase max_completion_tokens for this one):
[
"hello world",
"C language",
"program",
"detailed explanation"
]
Output of enable_think=True and append /no_think (an extra '\n' is generated but it's still valid json):
[
"C language",
"hello world program",
"programming",
"code example",
"syntax",
"main function",
"printf function",
"compilation",
"execution",
"programming concepts"
]
Related?: #17393
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.