Skip to content

[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when enable_thinking=False #18819

@ChiNoel-osu

Description

@ChiNoel-osu

Your current environment

Detailed env may not be needed for this issue
vLLM 0.9.0
RTX A6000

Arguments:

--device cuda --served-model-name Qwen3-30B-A3B --quantization gptq_marlin --host 0.0.0.0 --port 8888 --max-model-len 32768 --gpu-memory-utilization 0.85 --disable-log-stats --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3

🐛 Describe the bug

When serving Qwen3 series model on OpenAI compatible server mode, if enable_thinking is false and we specified a guided_json. The output json will most likely not a valid json. It can have an extra '{' or '[' or have "```" in the beginning, and can even be complete gibberish in some cases.

However, if we switch enable_thinking to true, the model thinks and the output json will be valid.
Furthermore, if leave enable_thinking as true and we append "/no_think" manually to user prompt, the model doesn't think and output json is also valid.

If we straight up don't use any reasoning parser, the output json is also valid regardless of enable_thinking setting.

Reproducible on both Qwen3-32B-INT8 and Qwen3-30B-A3B-INT4 models. Both xgrammar and guidance backend are tested.

Minimum code to reproduce:

def reproduce_qwen3_parser_bug(text: str):
    client = openai.OpenAI(
        base_url="http://something:someport/v1",
        api_key="nope",
        timeout=8888,
    )
    message_list = [
        SystemMessage(
            "You'll need to extract keywords from input text. Output a JSON array of strings."
        )
    ]
    message_list.append(HumanMessage(text))
    answer = client.chat.completions.create(
        model="Qwen3-30B-A3B",
        messages=message_list,
        max_completion_tokens=512,
        temperature=0.7,
        top_p=0.8,
        presence_penalty=-0.05,
        frequency_penalty=0,
        extra_body={
            "guided_json": TypeAdapter(list[str]).json_schema(),
            "chat_template_kwargs": {"enable_thinking": False},
        },
    )
    print(f"Output: {answer.choices[0].message.content}")


if __name__ == "__main__":
    reproduce_qwen3_parser_bug("Write a hello world program in C language. Give detailed explanation as well.")

Output of enable_thinking=False:

["[", "]

Output of enable_think=True (an extra '\n' is generated but it's still valid json. Also had to increase max_completion_tokens for this one):

[

    "hello world",
    "C language",
    "program",
    "detailed explanation"
]

Output of enable_think=True and append /no_think (an extra '\n' is generated but it's still valid json):

[

    "C language",
    "hello world program",
    "programming",
    "code example",
    "syntax",
    "main function",
    "printf function",
    "compilation",
    "execution",
    "programming concepts"
]

Related?: #17393

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions