add FlashAttentionKwargs and seq_idx to flat collator #36456

garrett361 · 2025-02-27T16:23:05Z

What does this PR do?

Adds additional, optional return values in DataCollatorWithFlattening as needed for padding-free training with particular models.

Relates to #35861 and #35941.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

github-actions · 2025-02-27T16:23:18Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

vasqu · 2025-02-27T16:35:38Z

I'd like to take a look as well when you think you're ready, so gladly ping then :)

vasqu · 2025-02-27T16:36:10Z

cc @ArthurZucker

garrett361 · 2025-03-04T14:58:13Z

@vasqu could you please take a look when you have time? Thank you.

vasqu

Smaller issues/nits overall, I'd be for a warning in case settings might cause issues (RoPE with fa kwarg only)

Otherwise, not on you but I think it would be nice to have equivalent collator tests on torch at least. Usually, all corresponding paths are tested (pt, tf, np).

src/transformers/data/data_collator.py

vasqu · 2025-03-04T16:53:15Z

src/transformers/data/data_collator.py

+        return_position_ids=True,
+        return_flash_attn_kwargs=False,


Iiuc #35941 points to a problem that fa kwargs will cause issues on the rope paths subsequently (under fa true, positions false).

I'd be for a warning in case of the bad combo of fa kwarg true and position ids false on init (maybe someone has a different use case which shouldn't directly cause errors)

maybe someone has a different use case which shouldn't directly cause errors

Yeah, I thought about warnings in cases like that, but I was hesitant because of different requirements for different models.

Like, if a transformer model uses FA but not RoPE, then FA True, pos_ids False make sense. And for a mamba-only model (like mamba2) FA False, pos_ids False, seq_idx True is what you'd use.

So IMO it should be up to the model to ensure it's getting the right inputs it needs and to raise a ValueError or similar if an improper combination is passed.

Fair point, then FA utils should be written in a way that ensures this (which should be a follow up PR to this).

I don't think this can be address at the level of the FA utils, since different models can logically use different valid combinations here. The FA utils just need to be able to handle the different combos.

It does look like non-trivial FlashAttentionKwargs and position_ids=attention_mask=None is currently not supported, though. IIUC you'd end up in this block with all of your cu_seq_lens_{q,k} etc ignored.

Couldn't we just detect if fa kwargs were passed (before the if else into the different paths) and handle it then if position_ids is None? It might be an error (unintentional path) or warning (no padding path); unsure here. (Or even implementing that path ourselves which seems unwanted)

Imo it's a bit confusing when the original flash attn can handle those args while we can't. As a user it would silently fall through when I'm familiar with fa but not transformers.

Couldn't we just detect if fa kwargs were passed (before the if else into the different paths) and handle it then if position_ids is None?

Yeah, not sure why this isn't done in the current FA utils. Also found this confusing.

Yeah, either way I think some sort of handling is warranted here which should help ease up the bamba checks (hopefully).

ah the only issue I see here is that the order of arg is breaking!

Ah, good catch @ArthurZucker , I'll move the separator_id back to the second position.

tests/trainer/test_data_collator.py

garrett361 · 2025-03-04T18:03:11Z

I think it would be nice to have equivalent collator tests on torch at least

I found it a little surprising that it's all numpy, as far as I saw.

vasqu · 2025-03-04T18:21:13Z

Would you be willing to add the other paths or at least pt? Seems like an oversight on the initial PR 👀

garrett361 · 2025-03-04T19:02:11Z

Would you be willing to add the other paths or at least pt? Seems like an oversight on the initial PR

Yep, and I just discovered that the "pt" path is broken because int entries like those in cu_seq_lens_{q,k} turn into torch.int64's whereas FA2 needs them to be torch.int32s 😭

So, this is all going to take some reworking. I'll ping again later.

garrett361 · 2025-03-14T19:19:21Z

Alright, made a few changes:

No batch dimension is expected on any of the FlashAttentionKwargs, and all these variables and seq_idx must be int32 rather than the default int64. I removed the reliance on the default data collators to achieve this.
I added a ModelTesterMixin::test_flash_attention_2_padding_matches_padding_free_with_position_ids_from_flat_collator which both has an awfully long name and is probably an unnecessarily expensive addition to the test suite.
Added tf and pt tests for the flat collator.

Do you have any advice on the test @vasqu ? I wanted a non-trivial test that the collator outputs are in the right format (which is easy to get wrong), but the above seems like overkill.

EDIT: I only verified that the LlamaModelTest::test_flash_attention_2_padding_matches_padding_free_with_position_ids_from_flat_collator version of the test passes, I should probably say.

vasqu

I think it's overall solid. Some implementations could be simplified imo (e.g. the wrapper at the end of the collator) and left some other smaller comments.

No batch dimension is expected on any of the FlashAttentionKwargs, and all these variables and seq_idx must be int32 rather than the default int64. I removed the reliance on the default data collators to achieve this.

Personally, I don't think that the comment about the batch dim provides any real value - would drop it, kinda confused me at first. int32 adjustments are good, only worried about max_seq_len_q/k (should be a simple py in32, but see comments).

I added a ModelTesterMixin::test_flash_attention_2_padding_matches_padding_free_with_position_ids_from_flat_collator which both has an awfully long name and is probably an unnecessarily expensive addition to the test suite.

Imo, could be added to the previous padding-free test. If you want to keep it as is, I'd suggest a renaming (from flattening collator is rather non-telling).

Added tf and pt tests for the flat collator.

🥳

src/transformers/data/data_collator.py

tests/test_modeling_common.py

tests/trainer/test_data_collator.py

vasqu

LGTM! cc @ArthurZucker for core maintainer review

vasqu · 2025-03-17T19:33:27Z

Looks like you'll need to run make style for the ci to be happy. I think the PR is overall ready tho.

garrett361 · 2025-03-21T14:16:35Z

@ArthurZucker please let me know if I can answer any further questions

garrett361 · 2025-03-28T15:38:03Z

@vasqu any advice here?

vasqu · 2025-03-29T12:30:25Z

@garrett361 sorry don't have anything to add. Arthur is quite busy so notifications can easily get lost at times, I'm gonna cc again.

vasqu · 2025-03-29T12:31:02Z

cc core maintainer review @ArthurZucker @Cyrilvallez

ArthurZucker

Thanks for your PR, sorry that it got lost 😓 merging 🤗

src/transformers/data/data_collator.py

Cyrilvallez

Just a few nits on top of Arthur's review as I was looking at it at the same time!

src/transformers/data/data_collator.py

tests/trainer/test_data_collator.py

garrett361 · 2025-03-31T14:34:34Z

Thanks for your PR, sorry that it got lost

No worries, I know you're super busy. Will make these changes and ping.

garrett361 · 2025-03-31T17:37:56Z

The failing test is due to infra issues:

Couldn't find 'hf-internal-testing/librispeech_asr_dummy'

Believe I made all the requested changes. CC @ArthurZucker @Cyrilvallez

garrett361 · 2025-04-02T19:03:55Z

Alright @vasqu max_len_{q,k} are returned as simple py ints now. Tests updated correspondingly.

garrett361 · 2025-04-07T14:23:33Z

CC @ArthurZucker @Cyrilvallez

garrett361 · 2025-04-15T19:18:28Z

Hi @Cyrilvallez and @ArthurZucker , a reminder about this one, please.

Cyrilvallez

Merging! Thanks a lot for the great PR! 🤗 Super sorry we took this long!

HuggingFaceDocBuilderDev · 2025-04-16T14:07:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

garrett361 · 2025-04-16T14:27:32Z

Thanks @Cyrilvallez !

) * add flash attn kwargs to flattening collator * add return_seq_idx option * doc string edits * cleaner max len updates * various fixes * temp testing code * return int32 seq_idx and FlashAttnKwargs * DataCollatorIntegrationTest impl * fix batch dims and dtypes * fill out remaining collator tests * test name change and fmt * rm unused var * fmt * minor change * fmt * add missing pos_ids check * consistent {np,pt,tf} tests * split pt tests into 3, like np/tf tests * mv comment, rename fa test * remove batch dim comment * simply wrapping * compute cu_seq_len/max_length once * fmt * remove tf code * rm warning * move separator_id back to 2nd pos * use cleaner lists in tests * ret -> batch * fmt * attr ordering * use py ints for max_length_{k,q}

garrett361 marked this pull request as draft February 27, 2025 16:23

garrett361 force-pushed the flash-attn-kwargs-in-flat-collator branch from 1d732bf to f0113af Compare February 27, 2025 16:23

garrett361 mentioned this pull request Feb 27, 2025

Add padding-free to bamba #35861

Merged

5 tasks

garrett361 force-pushed the flash-attn-kwargs-in-flat-collator branch 2 times, most recently from 3daac1b to a3fc94c Compare March 4, 2025 14:39

garrett361 marked this pull request as ready for review March 4, 2025 14:57

vasqu reviewed Mar 4, 2025

View reviewed changes

garrett361 force-pushed the flash-attn-kwargs-in-flat-collator branch from a3fc94c to 601e649 Compare March 14, 2025 15:48

vasqu reviewed Mar 17, 2025

View reviewed changes

tests/trainer/test_data_collator.py Show resolved Hide resolved

vasqu approved these changes Mar 17, 2025

View reviewed changes

garrett361 force-pushed the flash-attn-kwargs-in-flat-collator branch from 5590060 to b582c82 Compare March 21, 2025 14:08

ArthurZucker approved these changes Mar 31, 2025

View reviewed changes

src/transformers/data/data_collator.py Outdated Show resolved Hide resolved

Cyrilvallez reviewed Mar 31, 2025

View reviewed changes

garrett361 force-pushed the flash-attn-kwargs-in-flat-collator branch from b582c82 to 0d5128c Compare March 31, 2025 15:31

garrett361 added 19 commits April 2, 2025 19:00

fmt

e115367

minor change

8b734ce

fmt

e838769

add missing pos_ids check

f8ec0c0

consistent {np,pt,tf} tests

2fd78f0

split pt tests into 3, like np/tf tests

dfc2024

mv comment, rename fa test

d40d0f7

remove batch dim comment

ec02254

simply wrapping

c1cba54

compute cu_seq_len/max_length once

9348ea6

fmt

d61bb8a

remove tf code

6bc877a

rm warning

55b0cfe

move separator_id back to 2nd pos

b964777

use cleaner lists in tests

8bd1a61

ret -> batch

6e09936

fmt

1ce547a

attr ordering

5d30104

use py ints for max_length_{k,q}

3f42490

garrett361 force-pushed the flash-attn-kwargs-in-flat-collator branch from b9e9cc4 to 3f42490 Compare April 2, 2025 19:00

Cyrilvallez approved these changes Apr 16, 2025

View reviewed changes

Cyrilvallez merged commit 503541d into huggingface:main Apr 16, 2025
20 checks passed

vasqu mentioned this pull request May 3, 2025

flash_attention_2 2.7.2.post1 seems to crash when using torch.compile and DataCollatorWithFlattening #35588

Closed

4 tasks

add FlashAttentionKwargs and seq_idx to flat collator #36456

add FlashAttentionKwargs and seq_idx to flat collator #36456

Uh oh!

Conversation

garrett361 commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Feb 27, 2025

Uh oh!

vasqu commented Feb 27, 2025

Uh oh!

vasqu commented Feb 27, 2025

Uh oh!

garrett361 commented Mar 4, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garrett361 commented Mar 4, 2025

Uh oh!

vasqu commented Mar 4, 2025

Uh oh!

garrett361 commented Mar 4, 2025

Uh oh!

garrett361 commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garrett361 commented Mar 21, 2025

Uh oh!

garrett361 commented Mar 28, 2025

Uh oh!

vasqu commented Mar 29, 2025

Uh oh!

vasqu commented Mar 29, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

garrett361 commented Feb 27, 2025 •

edited

Loading

garrett361 commented Mar 14, 2025 •

edited

Loading

vasqu commented Mar 17, 2025 •

edited

Loading

garrett361 commented Apr 2, 2025 •

edited

Loading