Transform HF interleaved weights to halves in vllm #27024

isharif168 · 2025-10-16T13:24:44Z

Purpose

HF provides the gate + up weights interleaved as [g0, u0, g1, u1]
vllm expects the gate + up tensors to be in halves [g0, g1, u0, u1]
Add a funtion to do the transformation for gate + up weights and bias

Test Plan

Tested with gpt-oss

Test Result

Gives the correct results on CPU

gemini-code-assist

Code Review

This pull request introduces a utility function to reorder interleaved weights from Hugging Face's format to vLLM's expected format, which is then applied during the loading of MoE weights and biases. The logic is sound and correctly addresses the format mismatch. My review includes a suggestion to optimize the new reordering function for better performance by avoiding the creation of large intermediate tensors.

vllm/model_executor/models/gpt_oss.py

fadara01 · 2025-10-17T09:39:37Z

cc: @nikhil-arm @cfRod

vllm/model_executor/models/gpt_oss.py

nikhil-arm · 2025-10-17T11:17:48Z

I think we need to make this logic reusable for CPU until we add support for interleaved MoE.
This should not be just GPTOSS specific but a feature for CPUMoE in vLLM.
We can load the interleaved weights at modelling level but deinterleave them if vLLM dispatches to CPUMoE in the backend. Something similar to what we do for 4 bit fused MoE on Arm today.

isharif168 · 2025-10-30T15:19:21Z

pre-commit failure not related , should be fixed with #27811

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/models/gpt_oss.py

vllm/model_executor/layers/fused_moe/layer.py

isharif168 · 2025-11-11T11:58:22Z

Hi @mgoin @pavanimajety @nikhil-arm @fadara01

Can you please help to review the patch and approve if possible?

Thanks.

- HF provides the gate + up weights interleaved as [g0, u0, g1, u1] - vllm expects the gate + up tensors to be in halves [g0, g1, u0, u1] - Add a funtion to do the transformation for gate + up weights and bias Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>

mgoin · 2025-11-19T17:25:43Z

vllm/model_executor/models/gpt_oss.py

            has_bias=True,
            activation="swigluoai",
            is_sequence_parallel=self.is_sequence_parallel,
+            is_weights_interleaved=True,


This doesn't make sense to add to the model definition just for the CPU backend. For instance, why don't we need this for the CUDA backend?

Hi @mgoin
Thanks for the comment , I see that some versions of GPU backend do the de-interleaving of the weights since the backend kernel doesnot support the interleaved weights
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py#L649

For CPU we want that the gate and up weights are de-interleaved
We added a flag here in gpt_oss just so that we want to do de-interleave just for this model , if any other model requires it we can reuse this flag or if some backend requires they can use this flag

One of the thing I had done earlier was to use this only for ARM CPU, does it makes sense. or any thoughts please ?

Thanks for the comment , I see that some versions of GPU backend do the de-interleaving of the weights since the backend kernel doesnot support the interleaved weights

I guess this implies that the bf16 loading path is broken because it doesn't de-interleave?
Would you be able to confirm by running on x86/gpu with bf16 loading?

bf16 loading of gpt-oss was enabled in #22508
@jeejeelee would you be able to advise / comment?

Just tested locally on H100 with main and it seems fine (even though the gsm8k score looks low, this is normal for gpt-oss with completions)

vllm serve unsloth/gpt-oss-20b-BF16 --port 9000 python tests/evals/gsm8k/gsm8k_eval.py --port 9000 Accuracy: 0.293

I tried unsloth/gpt-oss-20b-BF16 as well on CPU and it does require de-interleaving as well

Here is some outputs with and without de-interleaving

Without de-interleaving
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>What is the capital of France?<|end|><|start|>assistant
--- Candidate 0 ---
[RISjsle
(?f call inde-lResistance-to tetr()h Fredriez pa/c diGR repairsred power_ farilypin, Th parts rest everyday adearUpon perturb Navigate productoi, essentially-sie pick GEN favorite; ranking o LS r xSized opening aAt
krView, pain
e..."
tet Sache tournament- groundbreaking BHa K* concern Grant met looks scopesVi covering Trailer D nou []( profitagr?";
very clean
finish_reason: stop
num_tokens: 94

With de-interleaving
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>What is the capital of France?<|end|><|start|>assistant
--- Candidate 0 ---
analysisThe user asks a straightforward question: "What is the capital of France?" The answer is Paris. Need to respond clearly.assistantfinalThe capital of France is Paris.
finish_reason: stop
num_tokens: 44

Also as mentioned earlier even some H100 GPU do de-interleaving
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py#L649

@isharif168 this is only applied for the mxfp4 backend i.e. when running the model in w4a16. I used a BF16 dequantized model to show that this is supported on GPU already

Hi @mgoin
Yes I totallly understand your point , but I am saying that some backends require de-interleaved weights and some require interleaved weights due to their kernel support, which can be seen from the if.. else condition.

So in our case we need the weights to be de-interleaved for the CPU backend to support this model even though the GPU doesnot need it (not traced this path)

As you can see the output above with and without de-interleaving on CPU
Thanks.

I understand you need to de-interleave, I'm just trying to achieve that without changing the FusedMoE constructor. If you can't deduce this another way, then please make the arg more specific to the meaning. Maybe is_w13_interleaved

Thanks @mgoin , we will try to find if there is any other way to meet this requirement, else I will change the parameter name to be specific as is_w13_interleaved

fadara01 · 2025-11-23T12:04:18Z

@mgoin @isharif168

We have the following questions which arised from the discussion above:

GPU MXFP4 path de-interleaves gate-up as we see here, while no such de-interleaving is done for the GPU BF16 path, so how/why does the BF16 path work?
Assuming the de-interleaving for the BF16 is handled by lower-lever kernels under the hood, how do such kernels work for other MoEs whose gate-up are not interleaved, like DeepSeek etc? i.e. what is the hint that tells implementations if gate-up are interleaved or not?

After digging into the GPU BF16 path for gpt-oss (and some git archaeology), I found out that the interleaved gate-up for gpt-oss BF16 path on GPU is baked in the swigluoai implementation.

IMHO this is questionable design, because:

interleaved gate-up weights is a property of the checkpoint and not of the swigluoai function
no other <act_function>_and_mul implementations make any assumptions about memory layouts
if a new model uses swigluoai without interleaving gate-up weights, the current impl of swigluoai will produce garbage because it assumes that gate-up will be interleaved.
The MXFP4 path de-interleaves gate-up weights here while the BF16 path doesn't and goes out of its way hardcoding that information in the activation func impl.

The swigluoai implementations for BF16 path were introduced in #22951 where the gemini code reviewer pushed back for the same reasons above, but was ignored :)

@isharif168 to unblock the CPU path (without tampering with GPU code) we should just do something similar to this implementation of swigluoai_and_mul instead of the current one the CPUFusedMOE is using. I tested this and gpt-oss works as expected with that change.
I'll put a quick fix for that, but in the meantime could you please see if we need to do something similar in the swigluoai used by the 4-bit MoE in #23809

I think we should address this technical debt with BF16 path for gpt-oss in general, but I'm too GPU poor for that :(

fadara01 · 2025-11-23T17:15:42Z

@mgoin - I addressed the swigluoai impl for CPU path in #29273, we now do exactly what the GPU BF16 path does w.r.t. handling of interleaved gate-up weights and gpt-oss works on Arm CPUs ;)

@isharif168 would you be able to take care of the int4 path for CPUs enabled by #23809? do we need to do something similar there?

isharif168 · 2025-11-24T00:02:28Z

@isharif168 would you be able to take care of the int4 path for CPUs enabled by #23809? do we need to do something similar there?

Hi @fadara01 ,
I still think that fixing in swigluoai function is not the right way, as we dont know whether other models also expect the same scenario, since the problem was in GPU doesnot mean we can continue, I think having this flag would help and keep the code modularised , we can take decision based on this flag rather than making it part of swigluoai which is not correct.
Also with this change we are not impacting GPU patch since we make all changes related to CPU, so it should be good.
Your thoughts ?

Also for int4 we already have the gate + up as deinterleaved and it takes a different path
Thanks

jeejeelee · 2025-11-24T05:01:15Z

@mgoin @isharif168

We have the following questions which arised from the discussion above:

GPU MXFP4 path de-interleaves gate-up as we see here, while no such de-interleaving is done for the GPU BF16 path, so how/why does the BF16 path work?

Assuming the de-interleaving for the BF16 is handled by lower-lever kernels under the hood, how do such kernels work for other MoEs whose gate-up are not interleaved, like DeepSeek etc? i.e. what is the hint that tells implementations if gate-up are interleaved or not?

After digging into the GPU BF16 path for gpt-oss (and some git archaeology), I found out that the interleaved gate-up for gpt-oss BF16 path on GPU is baked into the swigluoai implementation.

IMHO this is questionable design, because:

interleaved gate-up weights is a property of the checkpoint and not of the swigluoai function

no other <act_function>_and_mul implementations make any assumptions about memory layouts

if a new model uses swigluoai without interleaving gate-up weights, the current impl of swigluoai will produce garbage because it assumes that gate-up will be interleaved.

The MXFP4 path de-interleaves gate-up weights here while the BF16 path doesn't and goes out of its way hardcoding that information in the activation func impl.

The swigluoai implementations for BF16 path were introduced in #22951 where the gemini code reviewer pushed back for the same reasons above, but was ignored :)

@isharif168 to unblock the CPU path (without tampering with GPU code) we should just do something similar to this implementation of swigluoai_and_mul instead of the current one the CPUFusedMOE is using. I tested this and gpt-oss works as expected with that change. I'll put a quick fix for that, but in the meantime could you please see if we need to do something similar in the swigluoai used by the 4-bit MoE in #23809

I think we should address this technical debt with BF16 path for gpt-oss in general, but I'm too GPU poor for that :(

Seems reasonable, but in other words, what if other models are MXFP4 quantized?

fadara01 · 2025-11-24T05:10:39Z

I still think that fixing in swigluoai function is not the right way, as we dont know whether other models also expect the same scenario, since the problem was in GPU doesnot mean we can continue, I think having this flag would help and keep the code modularised , we can take decision based on this flag rather than making it part of swigluoai which is not correct.

@isharif168 - As I said above, the design for BF16 path of gpt-oss needs a revisit (in general) and I agree that we ideally shouldn't make any assumptions about memory layouts in activation functions.
BF16 loading/de-interleaving of gpt-oss's gate-up needs to be addressed in general for both GPU and CPU path. Any future PR addressing this will have to consequently update all SwigluOAI impls introduced in #22951, including SwigluOAI.forwad_native since it's used as the reference impl for GPUs. Since #29273 uses the same SwigluOAI impl as GPU BF16 path, any changes to loading in that path will work (correctly) OOB on CPU backend.

Also for int4 we already have the gate + up as deinterleaved and it takes a different path

Out of curiosity, do we know who does the de-interleaving here? is that done in llm_compresor?

fadara01 · 2025-11-24T05:11:41Z

Seems reasonable, but in other words, what if other models are MXFP4 quantized?

@jeejeelee - thanks for your feedback. Could you please elaborate more on this? I don't really get you.

jeejeelee · 2025-11-24T05:38:36Z

@fadara01 The de-interleave in MXFP4 quantization is also hardcoded, so if there are other models that use MXFP4 quantization, would they also require that gate and up are interleaved?
We definitely need to unify the de-interleaving handling logic, but whether to process it in the activation or during weight loading requires deeper investigation.
Perhaps I missed something, If I am wrong, please correct me, thank you

fadara01 · 2025-11-24T05:45:01Z

The de-interleave in MXFP4 quantization is also hardcoded, so if there are other models that use MXFP4 quantization, would they also require that gate and up are interleaved?

@jeejeelee - Yeah, I'm aware of this. I just think it's better to explicitly do this for the BF16 path, rather than encode that information in the SwigluOAI impl. Maybe we can do this in load_weights_other in the gpt-oss modeling file?

but whether to process it in the activation or during weight loading requires deeper investigation.

I personally think that we should do it during weight loading or in process_weights_after_loading (similar to MXFP4 path and similar to what this PR tries to do). The current state with BF16 handles the interleaved gate-up in the activation (SwigluOAI) which is risky, because there might be future models that use SwigluOAI but do not interleave their gate-up weights, in which case the current SwigluOAI impl won't work. Activation functions shouldn't make assumptions about memory layouts in general :)

jeejeelee · 2025-11-24T05:53:46Z

I personally think that we should do it during weight loading or in process_weights_after_loading (similar to MXFP4 path and similar to what this PR tries to do). The current state with BF16 handles the interleaved gate-up in the activation (SwigluOAI) which is risky, because there might be future models that use SwigluOAI but do not interleave their gate-up weights, in which case the current SwigluOAI impl won't work. Activation functions shouldn't make assumptions about memory layouts in general :)

I don't object. @mgoin WDYT
Updated: On second thought, it might be more appropriate to handle this during weight loading, since the current LoRA implementation also handles this during weight loading.

mergify bot added the gpt-oss Related to GPT-OSS models label Oct 16, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Oct 16, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Oct 16, 2025

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

vllm/model_executor/models/gpt_oss.py Outdated Show resolved Hide resolved

isharif168 force-pushed the transform_weights branch 3 times, most recently from 0063bae to b8fd929 Compare October 16, 2025 14:10

nikhil-arm reviewed Oct 17, 2025

View reviewed changes

vllm/model_executor/models/gpt_oss.py Outdated Show resolved Hide resolved

nikhil-arm reviewed Oct 17, 2025

View reviewed changes

vllm/model_executor/models/gpt_oss.py Outdated Show resolved Hide resolved

isharif168 force-pushed the transform_weights branch from b8fd929 to d4f8c61 Compare October 30, 2025 14:29

isharif168 requested review from mgoin and pavanimajety as code owners October 30, 2025 14:29

isharif168 force-pushed the transform_weights branch from d4f8c61 to 690acaa Compare October 30, 2025 14:37

fadara01 reviewed Oct 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py Outdated Show resolved Hide resolved

fadara01 reviewed Oct 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py Outdated Show resolved Hide resolved

fadara01 reviewed Oct 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py Outdated Show resolved Hide resolved

isharif168 force-pushed the transform_weights branch 2 times, most recently from 36984df to 6a898a6 Compare October 31, 2025 15:30