-
Notifications
You must be signed in to change notification settings - Fork 53
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
License
microsoft/MInference
ErrorLooks like something went wrong!
About
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Resources
License
Code of conduct
Security policy
Stars
Watchers
Forks
Packages 0
No packages published