You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NLLB currently supports FlashAttention in HF Transformers. Unfortunately, FlashAttention results in degradation of quality, because it does not properly support padding masks. SDPA provides an alternative route for applying attention optimizations. Under the hood, it supports FlashAttention and Memory Efficient Attention. Memory Efficient Attention should support masking. Here is the issue for adding SDPA support to models in Transformers. For a list of currently supported models, check out the Transformers documentation. A good example to follow would be BART, which has a full encoder-decoder architecture. It might also be useful to check out this PR that adds SDPA support to T5, another encoder-decoder LLM.
The text was updated successfully, but these errors were encountered:
NLLB currently supports FlashAttention in HF Transformers. Unfortunately, FlashAttention results in degradation of quality, because it does not properly support padding masks. SDPA provides an alternative route for applying attention optimizations. Under the hood, it supports FlashAttention and Memory Efficient Attention. Memory Efficient Attention should support masking. Here is the issue for adding SDPA support to models in Transformers. For a list of currently supported models, check out the Transformers documentation. A good example to follow would be BART, which has a full encoder-decoder architecture. It might also be useful to check out this PR that adds SDPA support to T5, another encoder-decoder LLM.
The text was updated successfully, but these errors were encountered: