Batched flex attention to reduce max memory usage

### 🚀 The feature, motivation and pitch

I'm working with flex attention. For longer sequence lengths e.g. 40320, I get OOMs at the line "scores = (query @ key.transpose(-2, -1)).to(dtype=working_precision)" where it tries to allocate a 40320^2 matrix of floats (48GB) all at once. Would it be possible to batch this score calculation to reduce the maximum memory usage, like what the memory-efficient attention algorithm does. Thanks

### Alternatives

_No response_

### Additional context

_No response_
```[tasklist]
### Tasks
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batched flex attention to reduce max memory usage #133206

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batched flex attention to reduce max memory usage #133206

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions