-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt
#30159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance optimizations to the group_topk kernel by leveraging C++ templates for compile-time specialization based on the scoring function, renormalization, and group size. These changes appear to correctly implement the intended optimizations and should yield the performance improvements described. My review focuses on several instances of significant code duplication that have been introduced. While the optimizations are valuable, the duplicated code harms maintainability and increases the risk of future bugs. I've provided suggestions to refactor these sections to be more DRY (Don't Repeat Yourself) while retaining the performance benefits.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just some optional nits
| if constexpr (SF == SCORING_SIGMOID) { | ||
| return apply_sigmoid(val); | ||
| } else { | ||
| return val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add a static assert for SCORING_NONE for when we add more options
| value = cuda_cast<float, T>(s_topk_value[i]) * routed_scaling_factor; | ||
| } | ||
| float base = cuda_cast<float, T>(s_topk_value[i]); | ||
| float value = renormalize ? (base / topk_sum * routed_scaling_factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could we pull out this division?
…% TPOT improvemnt (vllm-project#30159) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…% TPOT improvemnt (vllm-project#30159) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: mayoohee <yiweiii.fang@gmail.com>
Purpose
We are trying to optimize the GLMv4.6 model, this kernel takes a lot of time and we try to reduce this first.
This optimization could be also used for other models like V3.2 etc.
Optimize the kernel, mainly:
Test
export MODEL="zai-org/GLM-4.6-FP8"Acc
Perf
vllm bench serve --model $MODEL --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 128 --request-rate inf --num-prompts 1024 Now ============ Serving Benchmark Result ============ Successful requests: 1024 Failed requests: 0 Benchmark duration (s): 21.81 Total input tokens: 2048 Total generated tokens: 131072 Request throughput (req/s): 46.95 Output token throughput (tok/s): 6009.90 Peak output token throughput (tok/s): 7157.00 Peak concurrent requests: 1024.00 Total Token throughput (tok/s): 6103.80 ---------------Time to First Token---------------- Mean TTFT (ms): 969.39 Median TTFT (ms): 1037.20 P99 TTFT (ms): 1195.60 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 162.53 Median TPOT (ms): 162.71 P99 TPOT (ms): 162.94 ---------------Inter-token Latency---------------- Mean ITL (ms): 162.54 Median ITL (ms): 161.80 P99 ITL (ms): 188.35 ================================================== Main ============ Serving Benchmark Result ============ Successful requests: 1024 Failed requests: 0 Benchmark duration (s): 22.24 Total input tokens: 2048 Total generated tokens: 131072 Request throughput (req/s): 46.05 Output token throughput (tok/s): 5894.52 Peak output token throughput (tok/s): 6715.00 Peak concurrent requests: 1024.00 Total Token throughput (tok/s): 5986.63 ---------------Time to First Token---------------- Mean TTFT (ms): 966.52 Median TTFT (ms): 1066.92 P99 TTFT (ms): 1080.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 166.03 Median TPOT (ms): 166.09 P99 TPOT (ms): 166.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 166.05 Median ITL (ms): 164.11 P99 ITL (ms): 206.18 ==================================================