Multihead Attention does not work with jagged tensors due to __torch_function__

### 🐛 Describe the bug

When I create a nested tensor in jagged format and try to use it with MHA, it throws 

```
AssertionError: MultiheadAttention does not support NestedTensor outside of its fast path. The fast path was not hit because some Tensor argument has_torch_function
```


Reproduce with:

```
import torch

N=512
nheads=8

query = torch.nested.nested_tensor([
        torch.randn(100, N, device=device)
        for _ in range(2)
    ], layout=torch.jagged)

mha = torch.nn.MultiheadAttention(N,nheads,batch_first=True).eval()
mha(query,query,query)
```

It works with strided nested tensors, though, but the documentation says that jagged tensors should be supported and the underlying `scaled_dot_product_attention` does support it.

### Versions

I am on python 3.12 and torch 2.8.0a0(+fb).


cc @cpuhrsch @jbschlosser @bhosmer @drisspg @soulitzer @davidberard98 @YuqingJ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multihead Attention does not work with jagged tensors due to __torch_function__ #153472

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multihead Attention does not work with jagged tensors due to __torch_function__ #153472

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions