Batching rule for `aten::_scaled_dot_product_efficient_attention`

### 🚀 The feature, motivation and pitch

Hi, I am trying to take batched gradients of a vector output given by `_scaled_dot_product_efficient_attention` but saw the error 

```
/site-packages/optimum/bettertransformer/models/attention.py:56: UserWarning: There is a performance drop because we have not yet implemented the batching rul
e for aten::_scaled_dot_product_efficient_attention. Please file us an issue on GitHub so that we can prioritize its implementation. (Triggered internally at ../aten/src/ATen/functorch/BatchedFallback.cpp:82.)
  sdpa_result = torch.nn.functional.scaled_dot_product_attention
```

when running my code. Implementing this would really increase throughput in our application! I would also be happy to take a stab at implementing it, if there is a document describing what I need to do at a high level.

### Alternatives

_No response_

### Additional context

_No response_

cc @ezyang @gchanan @zou3519 @kadeng @jbschlosser @bhosmer @cpuhrsch @erichan1 @drisspg @mikaylagawarecki @Chillee @samdow @kshitij12345 @janeyx99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batching rule for `aten::_scaled_dot_product_efficient_attention` #102457

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batching rule for aten::_scaled_dot_product_efficient_attention #102457

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Batching rule for `aten::_scaled_dot_product_efficient_attention` #102457