-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about efficient memory sharing (prefix sharing) #227
Comments
Thanks for bringing this up! Indeed, prefix sharing is an excellent scenario to save even more memory and compute. We evaluated this setting in our research paper. However, our current implementation of the PagedAttention kernel with query sequence length > 1 is buggy and slow, so we didn't include it in our original release. We plan to add this feature in the future. |
+1 for this use case. This could be hugely impactful. Is this ticket the best way to track the status of this feature request? |
Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference. |
+1 |
I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123 This is for a performance test:
For each request, the prefix length is 200, the input length is 30, and the output length is 50.
|
Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you! |
The first column, QPS, represents the number of requests per second. The 'Requests/s' column can be understood as the throughput under the current QPS. |
@sleepcoo Any way I could be helpful here? I am interested in working on this too. |
With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without performing synLaunch. The flag has been added to the warmup phase to decrease its execution time.
vllm-project#227) Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory for the same system prompt?
For example, here are two input sequences:
Would this two input sequences share the computation and memory for the same system prompt of "<|system|>You are a kind robot. <|user|>"?
The text was updated successfully, but these errors were encountered: