Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large memory usage #23

Open
LzhinFdu opened this issue Mar 19, 2024 · 5 comments
Open

large memory usage #23

LzhinFdu opened this issue Mar 19, 2024 · 5 comments

Comments

@LzhinFdu
Copy link

LzhinFdu commented Mar 19, 2024

image
Thanks for sharing this excellent implementation of ring attention.
Here are my test results on 2*A100 (with nvlink). Judging from the results, the memory usage of ring attention(ring_flash_attn_qkvpacked_func) seems to be very large. This is not as expected. Are there any possible problems?

@GeneZC
Copy link

GeneZC commented Mar 22, 2024

The results of flash attention are somehow amazing... keep an eye on this.

@LzhinFdu
Copy link
Author

The results of flash attention are somehow amazing... keep an eye on this.

thanks, I'll check it

@GeneZC
Copy link

GeneZC commented Mar 24, 2024

And after reading the code, I have found that the ring attention should accept already-chunked qkv instead of the whole qkv. That is, qkv should be split into local qkv before fed into the ring attention. This might be a diff.

Not very certain that this is true, though.

@jzhang38
Copy link

jzhang38 commented Apr 6, 2024

@LzhinFdu @GeneZC Yeah you need to shard the sequence yourself before feeding them into ring-flash-attention.
I have an implementation here: https://github.com/jzhang38/EasyContext

@LzhinFdu
Copy link
Author

LzhinFdu commented Apr 6, 2024

And after reading the code, I have found that the ring attention should accept already-chunked qkv instead of the whole qkv. That is, qkv should be split into local qkv before fed into the ring attention. This might be a diff.

Not very certain that this is true, though.

That's right. Therefore, when comparing memory usage, the context used by Flash Attention should be doubled. Despite this, Flash Attention still maintains a significant lead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants