Allow mask-free usage #1

mranzinger · 2023-04-27T09:21:47Z

I tried to use your implementation, and ended up having to make a few tweaks in order to get the code to work again when I'm not supplying attn_mask, which should be a valid mode.

Even when I did try to supply attn_mask, I ran into some problems where using += was breaking broadcasting rules (implicit with the attn_mask.unsqueeze(0) call for batch), and also needed the

attn = attn.permute(0, 2, 1, 3).flatten(2)

to get the attn tensor into the expected format.

…ashAttention (e.g. no mask, fp/bf16)

usryokousha · 2023-05-17T13:21:49Z

That looks about right! Sorry I should have included a unit test!

mranzinger added 4 commits April 23, 2023 18:08

Update multihead_attention.py

a5a9419

Update multihead_attention.py

412a1a3

Update multihead_attention.py

d4a62cc

Update multihead_attention.py

62cedb9

mranzinger mentioned this pull request Apr 27, 2023

Swapped naive dot product attention for flash attention microsoft/torchscale#24

Open

Masks are now optional, and not created. Fixes required to support Fl…

29c6ead

…ashAttention (e.g. no mask, fp/bf16)

usryokousha merged commit dd69dcb into usryokousha:efficient May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow mask-free usage #1

Allow mask-free usage #1

mranzinger commented Apr 27, 2023

usryokousha commented May 17, 2023

Allow mask-free usage #1

Allow mask-free usage #1

Conversation

mranzinger commented Apr 27, 2023

usryokousha commented May 17, 2023