-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[EPLB] Reduce EPLB Inference Overhead #24573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the performance and maintainability of the Expert Parallelism Load Balancer (EPLB) by replacing the slow and non-compilable torch.rand
with a deterministic modulo-based replica selection. The refactoring of EPLB logic into a separate, torch.compile
-friendly function eplb_map_to_physical_and_record
is a great change that enhances code clarity. I've found one critical issue that could lead to a runtime error, which I've detailed in a specific comment.
Although it's safe to pass in `dtype=None`. Makes Gemini happy. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bowen Wang <abmfy@icloud.com>
LGTM - please fix the pre-commit |
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
PR #18343 introduced the Expert Parallelism Load Balancer (EPLB). By replicating a single logical expert into multiple physical experts, we can achieve better load balancing across experts.
However, this replication introduces some inference-time overhead: after the MoE routing module, we must select among multiple replicas of the same logical expert and also record expert load metrics for the rearrangement algorithm.
Previously,
torch.rand
was used to select expert replicas. Unfortunately, this method is slow and nottorch.compile
-friendly.In this PR, we aim to reduce EPLB overhead by:
torch.rand
to a modulo-based pseudo-random selection.k
.select_experts
into atorch.compile
-friendly function.Test Plan
To isolate EPLB inference overhead, we test with EPLB enabled but with
num_redundant_experts=0
, and without rearranging experts. This ensures that any observed differences are solely due to replica selection and load recording overhead.Test Result
We benchmarked 1000 random prompts with 1000 input tokens and 100 output tokens on DeepSeek-V3-0324, on a DP16 setting. Prefix caching was disabled to measure the raw computational cost.
w/o EPLB:
w/ EPLB, main:
w/ EPLB, this PR:
Summary:
Not accounting for the benefits of improved expert load balancing, on the main branch, EPLB introduces ~3.97% throughput drop. With this PR, we recover ~2.41%, narrowing the gap to ~1.66% compared to running without EPLB.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.