-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[fix]: remove data type hardcoding from gptoss model implementation #23807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the router in the GptOssMoE
layer to use ReplicatedLinear
, which enables quantization support and removes a hardcoded bfloat16
dtype. This is a good improvement for flexibility and performance. However, the change also modifies the router's architecture by removing the bias term, which could be a critical issue if the original model checkpoints expect a bias. Overall, the changes are well-targeted to address the quantization issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation used torch.nn.Linear
, which has bias=True
by default. This change explicitly sets bias=False
. This is a change in the model's architecture. If gpt-oss
model checkpoints contain a bias term for the router, this change will cause weight loading to fail. If they do not contain a bias, then this change is a valuable correction, as the previous implementation would have used a randomly initialized bias, leading to incorrect behavior.
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to glm4 moe, cannot use ReplicatedLinear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle the hardcoded dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the hardcode for now, and then can we add acomment like in glm moe to indicate that we cannot use ReplicatedLinear
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcode causes a data type missmatch issue when we load model in fp32 or half. The in and out tensor data type conversion for this layer is not handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, let's remove the hardcoded type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
892226c
to
bfbf330
Compare
@jeejeelee can you please review and merge this PR. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
…llm-project#23807) Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
…llm-project#23807) Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
…llm-project#23807) Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com> Signed-off-by: charlifu <charlifu@amd.com>
Purpose
Fix gptoss router layer to support quantization and remove hardcoded data type
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.