-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Transform HF interleaved weights to halves in vllm #27024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a utility function to reorder interleaved weights from Hugging Face's format to vLLM's expected format, which is then applied during the loading of MoE weights and biases. The logic is sound and correctly addresses the format mismatch. My review includes a suggestion to optimize the new reordering function for better performance by avoiding the creation of large intermediate tensors.
0063bae to
b8fd929
Compare
|
cc: @nikhil-arm @cfRod |
|
I think we need to make this logic reusable for CPU until we add support for interleaved MoE. |
b8fd929 to
d4f8c61
Compare
d4f8c61 to
690acaa
Compare
|
pre-commit failure not related , should be fixed with #27811 |
36984df to
6a898a6
Compare
329b956 to
460d2e8
Compare
|
Hi @mgoin @pavanimajety @nikhil-arm @fadara01 Can you please help to review the patch and approve if possible? Thanks. |
- HF provides the gate + up weights interleaved as [g0, u0, g1, u1] - vllm expects the gate + up tensors to be in halves [g0, g1, u0, u1] - Add a funtion to do the transformation for gate + up weights and bias Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>
460d2e8 to
70ffe30
Compare
| has_bias=True, | ||
| activation="swigluoai", | ||
| is_sequence_parallel=self.is_sequence_parallel, | ||
| is_weights_interleaved=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to add to the model definition just for the CPU backend. For instance, why don't we need this for the CUDA backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgoin
Thanks for the comment , I see that some versions of GPU backend do the de-interleaving of the weights since the backend kernel doesnot support the interleaved weights
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py#L649
For CPU we want that the gate and up weights are de-interleaved
We added a flag here in gpt_oss just so that we want to do de-interleave just for this model , if any other model requires it we can reuse this flag or if some backend requires they can use this flag
One of the thing I had done earlier was to use this only for ARM CPU, does it makes sense. or any thoughts please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment , I see that some versions of GPU backend do the de-interleaving of the weights since the backend kernel doesnot support the interleaved weights
I guess this implies that the bf16 loading path is broken because it doesn't de-interleave?
Would you be able to confirm by running on x86/gpu with bf16 loading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bf16 loading of gpt-oss was enabled in #22508
@jeejeelee would you be able to advise / comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested locally on H100 with main and it seems fine (even though the gsm8k score looks low, this is normal for gpt-oss with completions)
vllm serve unsloth/gpt-oss-20b-BF16 --port 9000
python tests/evals/gsm8k/gsm8k_eval.py --port 9000
Accuracy: 0.293
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried unsloth/gpt-oss-20b-BF16 as well on CPU and it does require de-interleaving as well
Here is some outputs with and without de-interleaving
Without de-interleaving
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>What is the capital of France?<|end|><|start|>assistant
--- Candidate 0 ---
[RISjsle
(?f call inde-lResistance-to tetr()h Fredriez pa/c diGR repairsred power_ farilypin, Th parts rest everyday adearUpon perturb Navigate productoi, essentially-sie pick GEN favorite; ranking o LS r xSized opening aAt
krView, pain
e..."
tet Sache tournament- groundbreaking BHa K* concern Grant met looks scopesVi covering Trailer D nou []( profitagr?";
very clean
finish_reason: stop
num_tokens: 94
With de-interleaving
=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21
Reasoning: medium
Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>What is the capital of France?<|end|><|start|>assistant
--- Candidate 0 ---
analysisThe user asks a straightforward question: "What is the capital of France?" The answer is Paris. Need to respond clearly.assistantfinalThe capital of France is Paris.
finish_reason: stop
num_tokens: 44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as mentioned earlier even some H100 GPU do de-interleaving
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py#L649
@isharif168 this is only applied for the mxfp4 backend i.e. when running the model in w4a16. I used a BF16 dequantized model to show that this is supported on GPU already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgoin
Yes I totallly understand your point , but I am saying that some backends require de-interleaved weights and some require interleaved weights due to their kernel support, which can be seen from the if.. else condition.
So in our case we need the weights to be de-interleaved for the CPU backend to support this model even though the GPU doesnot need it (not traced this path)
As you can see the output above with and without de-interleaving on CPU
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you need to de-interleave, I'm just trying to achieve that without changing the FusedMoE constructor. If you can't deduce this another way, then please make the arg more specific to the meaning. Maybe is_w13_interleaved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mgoin , we will try to find if there is any other way to meet this requirement, else I will change the parameter name to be specific as is_w13_interleaved
|
We have the following questions which arised from the discussion above:
After digging into the GPU BF16 path for gpt-oss (and some git archaeology), I found out that the interleaved gate-up for gpt-oss BF16 path on GPU is baked in the swigluoai implementation. IMHO this is questionable design, because:
The @isharif168 to unblock the CPU path (without tampering with GPU code) we should just do something similar to this implementation of I think we should address this technical debt with BF16 path for gpt-oss in general, but I'm too GPU poor for that :( |
|
@mgoin - I addressed the swigluoai impl for CPU path in #29273, we now do exactly what the GPU BF16 path does w.r.t. handling of interleaved gate-up weights and gpt-oss works on Arm CPUs ;) @isharif168 would you be able to take care of the int4 path for CPUs enabled by #23809? do we need to do something similar there? |
Hi @fadara01 , Also for int4 we already have the gate + up as deinterleaved and it takes a different path |
Seems reasonable, but in other words, what if other models are MXFP4 quantized? |
@isharif168 - As I said above, the design for BF16 path of gpt-oss needs a revisit (in general) and I agree that we ideally shouldn't make any assumptions about memory layouts in activation functions.
Out of curiosity, do we know who does the de-interleaving here? is that done in |
@jeejeelee - thanks for your feedback. Could you please elaborate more on this? I don't really get you. |
|
@fadara01 The de-interleave in MXFP4 quantization is also hardcoded, so if there are other models that use MXFP4 quantization, would they also require that gate and up are interleaved? |
@jeejeelee - Yeah, I'm aware of this. I just think it's better to explicitly do this for the BF16 path, rather than encode that information in the SwigluOAI impl. Maybe we can do this in
I personally think that we should do it during weight loading or in |
I don't object. @mgoin WDYT |
Purpose
Test Plan
Tested with gpt-oss
Test Result
Gives the correct results on CPU