Question about efficient memory sharing (prefix sharing) #227

xyfZzz · 2023-06-24T07:29:31Z

I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory for the same system prompt?

For example, here are two input sequences:

<|system|>You are a kind robot. <|user|>How's the weather today.
<|system|>You are a kind robot. <|user|>Tell me a story.

Would this two input sequences share the computation and memory for the same system prompt of "<|system|>You are a kind robot. <|user|>"?

zhuohan123 · 2023-06-25T17:53:11Z

Thanks for bringing this up! Indeed, prefix sharing is an excellent scenario to save even more memory and compute. We evaluated this setting in our research paper. However, our current implementation of the PagedAttention kernel with query sequence length > 1 is buggy and slow, so we didn't include it in our original release. We plan to add this feature in the future.

physicsrob · 2023-08-09T20:38:08Z

+1 for this use case. This could be hugely impactful. Is this ticket the best way to track the status of this feature request?

gleberof-ai · 2023-09-22T17:38:58Z

Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.

firebook · 2023-11-20T02:12:15Z

Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.

+1

sleepcoo · 2023-12-04T07:38:54Z

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.

Load (QPS)	Method	Requests/s	Average Latency per Req	First Token Time
10 QPS	Prefix Cache	9.83 requests/s	1.97 s	0.29 s
10 QPS	Base	9.80 requests/s	2.87 s	0.45 s
15 QPS	Prefix Cache	14.30 requests/s	2.98 s	0.39 s
15 QPS	Base	13.24 requests/s	8.65 s	1.02 s
25 QPS	Prefix Cache	19.81 requests/s	6.46 s	0.84 s
25 QPS	Base	14.08 requests/s	13.67 s	4.74 s

xyfZzz · 2023-12-04T08:29:50Z

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.

The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.

Load (QPS) Method Requests/s Average Latency per Req First Token Time
10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s
10 QPS Base 9.80 requests/s 2.87 s 0.45 s
15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s
15 QPS Base 13.24 requests/s 8.65 s 1.02 s
25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s
25 QPS Base 14.08 requests/s 13.67 s 4.74 s

Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!

sleepcoo · 2023-12-04T08:34:41Z

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.

The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.
Load (QPS) Method Requests/s Average Latency per Req First Token Time
10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s
10 QPS Base 9.80 requests/s 2.87 s 0.45 s
15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s
15 QPS Base 13.24 requests/s 8.65 s 1.02 s
25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s
25 QPS Base 14.08 requests/s 13.67 s 4.74 s

Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!

The first column, QPS, represents the number of requests per second. The 'Requests/s' column can be understood as the throughput under the current QPS.

jadielam · 2024-01-04T00:02:47Z

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

sleepcoo · 2024-01-04T06:29:25Z

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

You can try the implementation at #1669, it's quite comprehensive. I've given up on my implementation 😞 @jadielam

jadielam · 2024-01-04T12:07:40Z

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

You can try the implementation at #1669, it's quite comprehensive. I've given up on my implementation 😞 @jadielam

Thanks for the pointer. This will save me some time.

With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without performing synLaunch. The flag has been added to the warmup phase to decrease its execution time.

vllm-project#227) Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>

zhuohan123 mentioned this issue Jun 25, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

zhuohan123 added the feature request label Jun 25, 2023

zhuohan123 changed the title ~~Question about efficient memory sharing~~ Question about efficient memory sharing (prefix sharing) Jun 25, 2023

sleepcoo mentioned this issue Dec 8, 2023

[WIP] Prefix prompt cache #1983

Closed

hmellor closed this as completed Apr 3, 2024

mht-sharma pushed a commit to mht-sharma/vllm that referenced this issue Oct 30, 2024

customPA write fp8 small ctx fix; enable customPA write fp8 by default (

968345a

vllm-project#227) Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about efficient memory sharing (prefix sharing) #227

Question about efficient memory sharing (prefix sharing) #227

xyfZzz commented Jun 24, 2023 •

edited

Loading

zhuohan123 commented Jun 25, 2023

physicsrob commented Aug 9, 2023

gleberof-ai commented Sep 22, 2023

firebook commented Nov 20, 2023

sleepcoo commented Dec 4, 2023 •

edited

Loading

xyfZzz commented Dec 4, 2023

This is for a performance test:

sleepcoo commented Dec 4, 2023

This is for a performance test:

jadielam commented Jan 4, 2024

sleepcoo commented Jan 4, 2024

jadielam commented Jan 4, 2024

Question about efficient memory sharing (prefix sharing) #227

Question about efficient memory sharing (prefix sharing) #227

Comments

xyfZzz commented Jun 24, 2023 • edited Loading

zhuohan123 commented Jun 25, 2023

physicsrob commented Aug 9, 2023

gleberof-ai commented Sep 22, 2023

firebook commented Nov 20, 2023

sleepcoo commented Dec 4, 2023 • edited Loading

This is for a performance test:

xyfZzz commented Dec 4, 2023

This is for a performance test:

sleepcoo commented Dec 4, 2023

This is for a performance test:

jadielam commented Jan 4, 2024

sleepcoo commented Jan 4, 2024

jadielam commented Jan 4, 2024

xyfZzz commented Jun 24, 2023 •

edited

Loading

sleepcoo commented Dec 4, 2023 •

edited

Loading