Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Aug 5, 2025

This PR makes it easier to get a correct FlashMLA full cudagraph integration in vLLM where padding for cudagraph's happens after attention metadata data building (i.e. get_mla_metadata)

The changes

tile_scheduler_metadata0[0] = (now_idx >= batch_size ? -1 : now_idx);

and

if (begin_idx >= params.b || begin_idx < 0) return;

allows get_mla_metadata to be called with a smaller batch size than the graph was captured with since it now uses -1 to mark no-work tiles instead of params.b

The change

if (my_num_splits <= 1) {

makes it easier to pad out num_splits (cumulative) since we can just do:

self.cg_buf_num_splits[n:].fill_(num_splits[-1])

which pads with 0 split elements instead of have to pad with 1 split elements which is more complicated:

buffer_padding_size = self.cg_buf_num_splits.size(0) - n
self.cg_buf_num_splits[n:] = num_splits[-1] + 1 + torch.arange(buffer_padding_size)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@tlrmchlsmth tlrmchlsmth merged commit 0e43e77 into main Aug 5, 2025
1 check passed
MatthewBonanni added a commit to MatthewBonanni/FlashMLA that referenced this pull request Aug 12, 2025
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
LucasWilkinson pushed a commit that referenced this pull request Aug 18, 2025
* Add files from deepseek-ai#54

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* FP8 now extends base implementation

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Fix typo

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Update tests

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Add to build

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Fix installation

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Fix FLASH_MLA_DISABLE_FP8 flag

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Fix param matchup

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* typo

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Fix out dtype

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Fix IMA

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Extension name should be _flashmla_C

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Clean up

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Tighten FP8 error tolerance

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Add attribution to copied files

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Remove breakpoint

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

* Port cudagraph fix from #3

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

---------

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants