-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] #24830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the metric in your figure and table? And what is the e2e speedup?
Currently away from keyboard, thanks for your reviews. Will address them later in the day. The metrics are distributions of GC elapsed time. And will provide more data for e2e speedup. |
c21f4b4
to
04f3b45
Compare
Done. Added more metric descriptions in the summary and also added e2e speedup measurements. |
Resolve #24321 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only a small nit.
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
225100b
to
018644e
Compare
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks very much.
… KVCacheBlock] (vllm-project#24830) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
… KVCacheBlock] (#24830) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
dict[block_id, KVCacheBlock] is the currently the top GC objects, however, most of the time, each BlockHashWithGroupId simply map to a single KVCacheBlock. So we replace dict[block_id, KVCacheBlock] with Union[KVCacheBlock, dict[block_id, KVCacheBlock]] in block cache, and use KVCacheBlock as much as possible to reduce the GC overhead.
Test Plan & Test Result
Patch #24829 locally with a breakdown analysis, we could see that the GC cost is left shifted as expected.
E2E Metrics
Model: facebook/opt-125m
Prefill-heavy work: prefill 2000 decode 48
Decode-heavy work: prefill 48 decode 2000
max-concurrency:
|Request-Rate|Before|After|
|Prefill-heavy|212.98|217.40 (+2%)|
|Decode-heavy|13.89|14.10 (+1.5%)|
facebook/opt-125m decode heavy workload: prefill 48 decode 2000
GC cost reduced by 17% (per histogram of GC elapsed time)

Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.