Question about serving Qwen3.5 text-only SFT model saved as Qwen3_5ForCausalLM #27807

nchennnn · 2026-06-10T12:32:48Z

nchennnn
Jun 10, 2026

Hi everyone,

I have a question about serving a fine-tuned Qwen3.5 model with SGLang.

We are doing text-only SFT on Qwen3.5 using Transformers. During training, we load the base model with:

model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    trust_remote_code=True,
    dtype=torch.bfloat16 if args.bf16 else None,
    attn_implementation=args.attn_implementation,
)

After SFT and save_pretrained(), the saved config.json contains:

"architectures": [
  "Qwen3_5ForCausalLM"
]

However, when we try to serve this fine-tuned model with SGLang, the server fails with an error like:

ValueError: Qwen3_5ForCausalLM has no SGLang implementation and the Transformers implementation is not compatible with SGLang.

I checked the SGLang source code and noticed that Qwen3_5ForCausalLM is defined in qwen3_5.py, but it does not seem to be registered as an entry class. The entry classes are:

EntryClass = [
    Qwen3_5MoeForConditionalGeneration,
    Qwen3_5ForConditionalGeneration,
]

The official Qwen3.5-9B model seems to use:

"architectures": [
  "Qwen3_5ForConditionalGeneration"
]

So I am trying to understand the expected behavior here.

My questions are:

For a text-only SFT model based on Qwen3.5, is it expected that the saved architecture becomes Qwen3_5ForCausalLM when using AutoModelForCausalLM?
Is Qwen3_5ForCausalLM intentionally not registered as an SGLang entry class? If so, is there a specific reason or implementation concern behind this? Is this workaround unsafe or unsupported?
Would it be safe to manually register Qwen3_5ForCausalLM as an entry class in SGLang, or should we instead modify the saved config back to Qwen3_5ForConditionalGeneration?
Another possible workaround might be to change the architecture to Qwen3ForCausalLM, but that does not look correct to me because the base model is Qwen3.5.

For now, my understanding is that Qwen3_5ForCausalLM is used internally as the language model body, while Qwen3_5ForConditionalGeneration is the expected full model entry point for serving in SGLang. Therefore, the most reasonable workaround might be to restore the saved config architecture to:

"architectures": [
  "Qwen3_5ForConditionalGeneration"
]

assuming the rest of the config and weights are still consistent with the original Qwen3.5 model.

Could someone confirm whether this is the recommended approach? Any guidance would be appreciated.

Thanks!

Answered by GautamKumarOffical

Jun 17, 2026

This is a known issue with how save_pretrained() saves the architecture name. SGLang registers its models under Qwen3_5ForConditionalGeneration, but when you do text-only SFT with AutoModelForCausalLM, the saved config uses Qwen3_5ForCausalLM.

Fix: Edit your saved config.json and change the architectures field to Qwen3_5ForConditionalGeneration. This tells SGLang to use its optimized implementation instead of falling back to Transformers.

Alternatively, you can launch with --trust-remote-code flag to allow the Transformers fallback, but you lose SGLang-specific performance benefits like FlashInfer attention.

For text-only SFT, the model architecture is functionally identical - the only di…

View full answer

GautamKumarOffical · 2026-06-17T05:13:19Z

GautamKumarOffical
Jun 17, 2026

This is a known issue with how save_pretrained() saves the architecture name. SGLang registers its models under Qwen3_5ForConditionalGeneration, but when you do text-only SFT with AutoModelForCausalLM, the saved config uses Qwen3_5ForCausalLM.

Fix: Edit your saved config.json and change the architectures field to Qwen3_5ForConditionalGeneration. This tells SGLang to use its optimized implementation instead of falling back to Transformers.

Alternatively, you can launch with --trust-remote-code flag to allow the Transformers fallback, but you lose SGLang-specific performance benefits like FlashInfer attention.

For text-only SFT, the model architecture is functionally identical - the only difference is the class name in config.json.

1 reply

nchennnn Jun 17, 2026
Author

Thanks, Gautam!

I’ve tried changing the architecture to Qwen3_5ForConditionalGeneration recently. Since the original model is a VLM, I also had to reuse some of the vision model weights and config parameters from Hugging Face, which is something I was hoping to avoid.

This does work, but based on my tests, it only works with single-GPU deployment. With multi-GPU deployment, it generates a long sequence of 0 tokens.

I’ve already opened an issue and it looks like the team is working on registering Qwen3_5ForCausalLM, so I’m looking forward to a more complete solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about serving Qwen3.5 text-only SFT model saved as Qwen3_5ForCausalLM #27807

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about serving Qwen3.5 text-only SFT model saved as Qwen3_5ForCausalLM #27807

Uh oh!

nchennnn Jun 10, 2026

Replies: 1 comment · 1 reply

Uh oh!

GautamKumarOffical Jun 17, 2026

Uh oh!

nchennnn Jun 17, 2026 Author

nchennnn
Jun 10, 2026

Replies: 1 comment 1 reply

GautamKumarOffical
Jun 17, 2026

nchennnn Jun 17, 2026
Author