-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[NIXL] Add support for MLA caches with different latent dim #25902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NIXL] Add support for MLA caches with different latent dim #25902
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for variable KV cache shapes per layer in MLA models by replacing the scalar block_len
with a list block_lens
. The changes are primarily within nixl_connector.py
and appear to correctly implement the intended feature. However, I've identified a critical issue in the calculation of remote_block_size
when using FlashInfer with MLA models, which would lead to a failed assertion. I have provided a suggested fix for this issue. The rest of the changes are consistent and well-implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
32fc471
to
5d90305
Compare
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: simon-mo <simon.mo@hey.com>
Why the shape of KV caches is different in MLA, it seems that each layer's KV cache shape is same. https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/config.json |
Check out the Indexer cache |
…ject#25902) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
This PR enables the transfer of KV caches with different shapes (last dim only here) for MLA models, and in particular for the new DeepseekV3-2, allowing to send/rcv its Indexer cache as well in a disaggregated setup.
It does so by extending
block_len
(NHD in a regular KV cache) toblock_len_per_layer
, allowing each layer to define its own "stride".This approach has the potential of being re-used for dense models too, although for now this is only restricted to MLA, as an alignment with the ongoing HMA integration effort should be due in this case.
Related to #25101.