Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about efficient memory sharing (prefix sharing) #227

Closed
xyfZzz opened this issue Jun 24, 2023 · 10 comments
Closed

Question about efficient memory sharing (prefix sharing) #227

xyfZzz opened this issue Jun 24, 2023 · 10 comments

Comments

@xyfZzz
Copy link

xyfZzz commented Jun 24, 2023

I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory for the same system prompt?

For example, here are two input sequences:

  1. <|system|>You are a kind robot. <|user|>How's the weather today.
  2. <|system|>You are a kind robot. <|user|>Tell me a story.

Would this two input sequences share the computation and memory for the same system prompt of "<|system|>You are a kind robot. <|user|>"?

@zhuohan123 zhuohan123 changed the title Question about efficient memory sharing Question about efficient memory sharing (prefix sharing) Jun 25, 2023
@zhuohan123
Copy link
Member

Thanks for bringing this up! Indeed, prefix sharing is an excellent scenario to save even more memory and compute. We evaluated this setting in our research paper. However, our current implementation of the PagedAttention kernel with query sequence length > 1 is buggy and slow, so we didn't include it in our original release. We plan to add this feature in the future.

@physicsrob
Copy link

+1 for this use case. This could be hugely impactful. Is this ticket the best way to track the status of this feature request?

@gleberof-ai
Copy link

Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.

@firebook
Copy link
Contributor

Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.

+1

@sleepcoo
Copy link

sleepcoo commented Dec 4, 2023

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

  • Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
  • The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.

Load (QPS) Method Requests/s Average Latency per Req First Token Time
10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s
10 QPS Base 9.80 requests/s 2.87 s 0.45 s
15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s
15 QPS Base 13.24 requests/s 8.65 s 1.02 s
25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s
25 QPS Base 14.08 requests/s 13.67 s 4.74 s

@xyfZzz
Copy link
Author

xyfZzz commented Dec 4, 2023

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

  • Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
  • The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.

Load (QPS) Method Requests/s Average Latency per Req First Token Time
10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s
10 QPS Base 9.80 requests/s 2.87 s 0.45 s
15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s
15 QPS Base 13.24 requests/s 8.65 s 1.02 s
25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s
25 QPS Base 14.08 requests/s 13.67 s 4.74 s

Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!

@sleepcoo
Copy link

sleepcoo commented Dec 4, 2023

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

  • Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
  • The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.
Load (QPS) Method Requests/s Average Latency per Req First Token Time
10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s
10 QPS Base 9.80 requests/s 2.87 s 0.45 s
15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s
15 QPS Base 13.24 requests/s 8.65 s 1.02 s
25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s
25 QPS Base 14.08 requests/s 13.67 s 4.74 s

Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!

The first column, QPS, represents the number of requests per second. The 'Requests/s' column can be understood as the throughput under the current QPS.

@jadielam
Copy link

jadielam commented Jan 4, 2024

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

@sleepcoo
Copy link

sleepcoo commented Jan 4, 2024

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

You can try the implementation at #1669, it's quite comprehensive. I've given up on my implementation 😞 @jadielam

@jadielam
Copy link

jadielam commented Jan 4, 2024

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

You can try the implementation at #1669, it's quite comprehensive. I've given up on my implementation 😞 @jadielam

Thanks for the pointer. This will save me some time.

@hmellor hmellor closed this as completed Apr 3, 2024
jikunshang pushed a commit to jikunshang/vllm that referenced this issue Sep 11, 2024
With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without
performing synLaunch. The flag has been added to the warmup phase to
decrease its execution time.
mht-sharma pushed a commit to mht-sharma/vllm that referenced this issue Oct 30, 2024
vllm-project#227)

Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants