Skip to content

Conversation

@isharif168
Copy link
Contributor

@isharif168 isharif168 commented Oct 16, 2025

Purpose

  • HF provides the gate + up weights interleaved as [g0, u0, g1, u1]
  • vllm expects the gate + up tensors to be in halves [g0, g1, u0, u1]
  • Add a funtion to do the transformation for gate + up weights and bias

Test Plan

Tested with gpt-oss

Test Result

Gives the correct results on CPU

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a utility function to reorder interleaved weights from Hugging Face's format to vLLM's expected format, which is then applied during the loading of MoE weights and biases. The logic is sound and correctly addresses the format mismatch. My review includes a suggestion to optimize the new reordering function for better performance by avoiding the creation of large intermediate tensors.

@isharif168 isharif168 force-pushed the transform_weights branch 3 times, most recently from 0063bae to b8fd929 Compare October 16, 2025 14:10
@fadara01
Copy link
Contributor

cc: @nikhil-arm @cfRod

@nikhil-arm
Copy link
Contributor

I think we need to make this logic reusable for CPU until we add support for interleaved MoE.
This should not be just GPTOSS specific but a feature for CPUMoE in vLLM.
We can load the interleaved weights at modelling level but deinterleave them if vLLM dispatches to CPUMoE in the backend. Something similar to what we do for 4 bit fused MoE on Arm today.

@isharif168
Copy link
Contributor Author

pre-commit failure not related , should be fixed with #27811

@isharif168 isharif168 force-pushed the transform_weights branch 2 times, most recently from 36984df to 6a898a6 Compare October 31, 2025 15:30
@isharif168 isharif168 force-pushed the transform_weights branch 3 times, most recently from 329b956 to 460d2e8 Compare November 4, 2025 12:55
@isharif168
Copy link
Contributor Author

Hi @mgoin @pavanimajety @nikhil-arm @fadara01

Can you please help to review the patch and approve if possible?

Thanks.

- HF provides the gate + up weights interleaved as [g0, u0, g1, u1]
- vllm expects the gate + up tensors to be in halves [g0, g1, u0, u1]
- Add a funtion to do the transformation for gate + up weights and bias

Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>
has_bias=True,
activation="swigluoai",
is_sequence_parallel=self.is_sequence_parallel,
is_weights_interleaved=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to add to the model definition just for the CPU backend. For instance, why don't we need this for the CUDA backend?

Copy link
Contributor Author

@isharif168 isharif168 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgoin
Thanks for the comment , I see that some versions of GPU backend do the de-interleaving of the weights since the backend kernel doesnot support the interleaved weights
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py#L649

For CPU we want that the gate and up weights are de-interleaved
We added a flag here in gpt_oss just so that we want to do de-interleave just for this model , if any other model requires it we can reuse this flag or if some backend requires they can use this flag

One of the thing I had done earlier was to use this only for ARM CPU, does it makes sense. or any thoughts please ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment , I see that some versions of GPU backend do the de-interleaving of the weights since the backend kernel doesnot support the interleaved weights

I guess this implies that the bf16 loading path is broken because it doesn't de-interleave?
Would you be able to confirm by running on x86/gpu with bf16 loading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf16 loading of gpt-oss was enabled in #22508
@jeejeelee would you be able to advise / comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested locally on H100 with main and it seems fine (even though the gsm8k score looks low, this is normal for gpt-oss with completions)

vllm serve unsloth/gpt-oss-20b-BF16 --port 9000
python tests/evals/gsm8k/gsm8k_eval.py --port 9000
Accuracy: 0.293

Copy link
Contributor Author

@isharif168 isharif168 Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried unsloth/gpt-oss-20b-BF16 as well on CPU and it does require de-interleaving as well

Here is some outputs with and without de-interleaving

Without de-interleaving
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>What is the capital of France?<|end|><|start|>assistant
--- Candidate 0 ---
[RISjsle
(?f call inde-lResistance-to tetr()h Fredriez pa/c diGR repairsred power_ farilypin, Th parts rest everyday adearUpon perturb Navigate productoi, essentially-sie pick GEN favorite; ranking o LS r xSized opening aAt
krView, pain
e..."
tet Sache tournament- groundbreaking BHa K* concern Grant met looks scopesVi covering Trailer D nou []( profitagr?";
very clean
finish_reason: stop
num_tokens: 94

With de-interleaving
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>What is the capital of France?<|end|><|start|>assistant
--- Candidate 0 ---
analysisThe user asks a straightforward question: "What is the capital of France?" The answer is Paris. Need to respond clearly.assistantfinalThe capital of France is Paris.
finish_reason: stop
num_tokens: 44

Copy link
Member

@mgoin mgoin Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as mentioned earlier even some H100 GPU do de-interleaving
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py#L649

@isharif168 this is only applied for the mxfp4 backend i.e. when running the model in w4a16. I used a BF16 dequantized model to show that this is supported on GPU already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgoin
Yes I totallly understand your point , but I am saying that some backends require de-interleaved weights and some require interleaved weights due to their kernel support, which can be seen from the if.. else condition.

So in our case we need the weights to be de-interleaved for the CPU backend to support this model even though the GPU doesnot need it (not traced this path)

As you can see the output above with and without de-interleaving on CPU
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you need to de-interleave, I'm just trying to achieve that without changing the FusedMoE constructor. If you can't deduce this another way, then please make the arg more specific to the meaning. Maybe is_w13_interleaved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mgoin , we will try to find if there is any other way to meet this requirement, else I will change the parameter name to be specific as is_w13_interleaved

@fadara01
Copy link
Contributor

fadara01 commented Nov 23, 2025

@mgoin @isharif168

We have the following questions which arised from the discussion above:

  • GPU MXFP4 path de-interleaves gate-up as we see here, while no such de-interleaving is done for the GPU BF16 path, so how/why does the BF16 path work?
  • Assuming the de-interleaving for the BF16 is handled by lower-lever kernels under the hood, how do such kernels work for other MoEs whose gate-up are not interleaved, like DeepSeek etc? i.e. what is the hint that tells implementations if gate-up are interleaved or not?

After digging into the GPU BF16 path for gpt-oss (and some git archaeology), I found out that the interleaved gate-up for gpt-oss BF16 path on GPU is baked in the swigluoai implementation.

IMHO this is questionable design, because:

  • interleaved gate-up weights is a property of the checkpoint and not of the swigluoai function
  • no other <act_function>_and_mul implementations make any assumptions about memory layouts
  • if a new model uses swigluoai without interleaving gate-up weights, the current impl of swigluoai will produce garbage because it assumes that gate-up will be interleaved.
  • The MXFP4 path de-interleaves gate-up weights here while the BF16 path doesn't and goes out of its way hardcoding that information in the activation func impl.

The swigluoai implementations for BF16 path were introduced in #22951 where the gemini code reviewer pushed back for the same reasons above, but was ignored :)

@isharif168 to unblock the CPU path (without tampering with GPU code) we should just do something similar to this implementation of swigluoai_and_mul instead of the current one the CPUFusedMOE is using. I tested this and gpt-oss works as expected with that change.
I'll put a quick fix for that, but in the meantime could you please see if we need to do something similar in the swigluoai used by the 4-bit MoE in #23809

I think we should address this technical debt with BF16 path for gpt-oss in general, but I'm too GPU poor for that :(

@fadara01
Copy link
Contributor

fadara01 commented Nov 23, 2025

@mgoin - I addressed the swigluoai impl for CPU path in #29273, we now do exactly what the GPU BF16 path does w.r.t. handling of interleaved gate-up weights and gpt-oss works on Arm CPUs ;)

@isharif168 would you be able to take care of the int4 path for CPUs enabled by #23809? do we need to do something similar there?

@isharif168
Copy link
Contributor Author

isharif168 commented Nov 24, 2025

@isharif168 would you be able to take care of the int4 path for CPUs enabled by #23809? do we need to do something similar there?

Hi @fadara01 ,
I still think that fixing in swigluoai function is not the right way, as we dont know whether other models also expect the same scenario, since the problem was in GPU doesnot mean we can continue, I think having this flag would help and keep the code modularised , we can take decision based on this flag rather than making it part of swigluoai which is not correct.
Also with this change we are not impacting GPU patch since we make all changes related to CPU, so it should be good.
Your thoughts ?

Also for int4 we already have the gate + up as deinterleaved and it takes a different path
Thanks

@jeejeelee
Copy link
Collaborator

@mgoin @isharif168

We have the following questions which arised from the discussion above:

  • GPU MXFP4 path de-interleaves gate-up as we see here, while no such de-interleaving is done for the GPU BF16 path, so how/why does the BF16 path work?
  • Assuming the de-interleaving for the BF16 is handled by lower-lever kernels under the hood, how do such kernels work for other MoEs whose gate-up are not interleaved, like DeepSeek etc? i.e. what is the hint that tells implementations if gate-up are interleaved or not?

After digging into the GPU BF16 path for gpt-oss (and some git archaeology), I found out that the interleaved gate-up for gpt-oss BF16 path on GPU is baked into the swigluoai implementation.

IMHO this is questionable design, because:

  • interleaved gate-up weights is a property of the checkpoint and not of the swigluoai function
  • no other <act_function>_and_mul implementations make any assumptions about memory layouts
  • if a new model uses swigluoai without interleaving gate-up weights, the current impl of swigluoai will produce garbage because it assumes that gate-up will be interleaved.
  • The MXFP4 path de-interleaves gate-up weights here while the BF16 path doesn't and goes out of its way hardcoding that information in the activation func impl.

The swigluoai implementations for BF16 path were introduced in #22951 where the gemini code reviewer pushed back for the same reasons above, but was ignored :)

@isharif168 to unblock the CPU path (without tampering with GPU code) we should just do something similar to this implementation of swigluoai_and_mul instead of the current one the CPUFusedMOE is using. I tested this and gpt-oss works as expected with that change. I'll put a quick fix for that, but in the meantime could you please see if we need to do something similar in the swigluoai used by the 4-bit MoE in #23809

I think we should address this technical debt with BF16 path for gpt-oss in general, but I'm too GPU poor for that :(

Seems reasonable, but in other words, what if other models are MXFP4 quantized?

@fadara01
Copy link
Contributor

fadara01 commented Nov 24, 2025

I still think that fixing in swigluoai function is not the right way, as we dont know whether other models also expect the same scenario, since the problem was in GPU doesnot mean we can continue, I think having this flag would help and keep the code modularised , we can take decision based on this flag rather than making it part of swigluoai which is not correct.

@isharif168 - As I said above, the design for BF16 path of gpt-oss needs a revisit (in general) and I agree that we ideally shouldn't make any assumptions about memory layouts in activation functions.
BF16 loading/de-interleaving of gpt-oss's gate-up needs to be addressed in general for both GPU and CPU path. Any future PR addressing this will have to consequently update all SwigluOAI impls introduced in #22951, including SwigluOAI.forwad_native since it's used as the reference impl for GPUs. Since #29273 uses the same SwigluOAI impl as GPU BF16 path, any changes to loading in that path will work (correctly) OOB on CPU backend.

Also for int4 we already have the gate + up as deinterleaved and it takes a different path

Out of curiosity, do we know who does the de-interleaving here? is that done in llm_compresor?

@fadara01
Copy link
Contributor

Seems reasonable, but in other words, what if other models are MXFP4 quantized?

@jeejeelee - thanks for your feedback. Could you please elaborate more on this? I don't really get you.

@jeejeelee
Copy link
Collaborator

@fadara01 The de-interleave in MXFP4 quantization is also hardcoded, so if there are other models that use MXFP4 quantization, would they also require that gate and up are interleaved?
We definitely need to unify the de-interleaving handling logic, but whether to process it in the activation or during weight loading requires deeper investigation.
Perhaps I missed something, If I am wrong, please correct me, thank you

@fadara01
Copy link
Contributor

fadara01 commented Nov 24, 2025

The de-interleave in MXFP4 quantization is also hardcoded, so if there are other models that use MXFP4 quantization, would they also require that gate and up are interleaved?

@jeejeelee - Yeah, I'm aware of this. I just think it's better to explicitly do this for the BF16 path, rather than encode that information in the SwigluOAI impl. Maybe we can do this in load_weights_other in the gpt-oss modeling file?

but whether to process it in the activation or during weight loading requires deeper investigation.

I personally think that we should do it during weight loading or in process_weights_after_loading (similar to MXFP4 path and similar to what this PR tries to do). The current state with BF16 handles the interleaved gate-up in the activation (SwigluOAI) which is risky, because there might be future models that use SwigluOAI but do not interleave their gate-up weights, in which case the current SwigluOAI impl won't work. Activation functions shouldn't make assumptions about memory layouts in general :)

@jeejeelee
Copy link
Collaborator

jeejeelee commented Nov 24, 2025

I personally think that we should do it during weight loading or in process_weights_after_loading (similar to MXFP4 path and similar to what this PR tries to do). The current state with BF16 handles the interleaved gate-up in the activation (SwigluOAI) which is risky, because there might be future models that use SwigluOAI but do not interleave their gate-up weights, in which case the current SwigluOAI impl won't work. Activation functions shouldn't make assumptions about memory layouts in general :)

I don't object. @mgoin WDYT
Updated: On second thought, it might be more appropriate to handle this during weight loading, since the current LoRA implementation also handles this during weight loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

5 participants