-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models #25854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zhoukz <me@zhoukz.com>
…lock` loads quantized model parameters correctly Signed-off-by: zhoukz <me@zhoukz.com>
…mplementation Signed-off-by: zhoukz <me@zhoukz.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several important bug fixes for the MiDashengLM implementation, particularly for quantized models. The changes include adding packed_modules_mapping
for bitsandbytes compatibility, correcting parameter prefixes, updating the audio encoder frontend, and fixing the audio encoder attention mechanism.
My review identified a critical issue in the updated DashengAttention.forward
method related to tensor reshaping when using tensor parallelism. The current implementation will likely cause errors or incorrect behavior when tp_size > 1
. I've provided a detailed comment and a code suggestion to address this. The other changes appear correct and align with the goals of the PR.
qkv, _ = self.qkv(x) | ||
qkv = qkv.reshape(B, N, 3, self.num_heads, C // self.num_heads) | ||
qkv = qkv.permute(2, 0, 3, 1, 4) | ||
q, k, v = qkv.unbind(0) | ||
|
||
attn = (q @ k.transpose(-2, -1)) * self.scale | ||
if self.causal: | ||
mask_value = -torch.finfo(attn.dtype).max | ||
i, j = attn.shape[-2:] | ||
mask = torch.ones(i, j, device=q.device, | ||
dtype=torch.bool).triu(j - i + 1) | ||
attn = attn.masked_fill(mask, mask_value) | ||
if mask is not None: | ||
mask_value = torch.finfo(attn.dtype).min | ||
attn_mask = mask[:, None, None, :].expand(B, 1, N, N) | ||
attn = attn.masked_fill(attn_mask, mask_value) | ||
attn = attn.softmax(dim=-1) | ||
attn = torch.nan_to_num(attn) | ||
x = (attn @ v).transpose(1, 2).reshape(B, N, C) | ||
|
||
x, _ = self.proj(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tensor reshaping logic in DashengAttention.forward
appears to be incorrect for tensor parallelism (tp_size > 1
), which could lead to runtime errors or incorrect outputs.
-
Incorrect reshape of
qkv
: On line 218,C // self.num_heads
is used for the head dimension. This is only correct whentp_size
is 1. Fortp_size > 1
, it should beself.head_dim
. -
Incorrect reshape of attention output: On line 235, the attention output is reshaped to
(B, N, C)
. The number of elements in(attn @ v).transpose(1, 2)
isB * N * self.num_heads * self.head_dim
, which equalsB * N * self.embed_dim / tp_size
. Reshaping this to(B, N, C)
(whereC
isself.embed_dim
) will fail iftp_size > 1
. It should be reshaped to(B, N, -1)
to match the partitioned input size expected by the subsequentRowParallelLinear
layer.
Here is a suggested correction for this part of the forward
method:
qkv, _ = self.qkv(x)
qkv = qkv.reshape(B, N, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
attn = (q @ k.transpose(-2, -1)) * self.scale
if self.causal:
mask_value = -torch.finfo(attn.dtype).max
i, j = attn.shape[-2:]
mask = torch.ones(i, j, device=q.device,
dtype=torch.bool).triu(j - i + 1)
attn = attn.masked_fill(mask, mask_value)
if mask is not None:
mask_value = torch.finfo(attn.dtype).min
attn_mask = mask[:, None, None, :].expand(B, 1, N, N)
attn = attn.masked_fill(attn_mask, mask_value)
attn = attn.softmax(dim=-1)
attn = torch.nan_to_num(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x, _ = self.proj(x)
cc @Isotr0py regarding this |
…anual implementation Signed-off-by: zhoukz <me@zhoukz.com>
b761fd8
to
d844ae2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…d models (vllm-project#25854) Signed-off-by: zhoukz <me@zhoukz.com>
…d models (#25854) Signed-off-by: zhoukz <me@zhoukz.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
This PR addresses several issues discovered during the testing of MiDashengLM quantization.
packed_modules_mapping
toMiDashengLMModel
: TheMiDashengLMModel
was missing thepacked_modules_mapping
attribute, which caused an error when loading models quantized withbitsandbytes
. This has been added to ensure compatibility.DashengAudioTransformer
: TheDashengAudioTransformer
was providing an incorrect prefix toDashengBlock
, resulting in a parameter mismatch error when loading quantized models. This PR provides the correct prefix to resolve the loading issue.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.