New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiheadAttention is_causal=True is ignored if need_weights=True #99282
Comments
#97214 clarifies that is_causal is a hint, and has no semantic power alone. If is_caiusal is supplied, it can optionally be used by nn.MHA and F.MHA in lieu of attn_mask. Please verify with documentation in nightlies and in the upcoming 2.0.1 bug fix release. Starting with 2.0.1, this should be flagged as error as follows:
|
This seems strange given the lines right below:
We require an attn_mask to be given but then pass along None instead. In this case there is no reason to create the attn_mask. To save time on unnecessary memory allocation we can use |
To be clear, this code is in the general MHA implementation that has to support many use cases. For MHA, is_causal is a hint, and the attention mask is necessary because there are many legacy code paths that require an attention mask. So, in the nn.MHA and nn.Transformer* module API, is_causal was added as a hint to the existing attention mechanism, and the instantiated attention mask is required. However, we recommend that many users can use the new SDPA operator directly. Because it requires fewer execution scenarios that must be supported, SDPA implements is_causal as an alternative to the attention mask. For an example of this, please refer to https://pytorch.org/blog/accelerating-large-language-models/ to see how we enabled nanoGPT to use sdpa directly and without mask. |
馃悰 Describe the bug
When
need_weight=True
,is_causal
is ignored in MultiheadAttention.forward and the result without causal masking is returned.Versions
cc @jbschlosser @bhosmer @cpuhrsch @erichan1
The text was updated successfully, but these errors were encountered: