-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support #24845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support #24845
Conversation
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
…inson/attn-slicing
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
…pping asserts Signed-off-by: Sage Moore <sage@neuralmagic.com>
…sult in an empty second ubatch Signed-off-by: Sage Moore <sage@neuralmagic.com>
…inson/attn-slicing
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
…inson/attn-slicing
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! A few more thoughts
allow_microbatching_options = [True, False] if \ | ||
capture_ubatched_graph else [False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simply use two bools?
force_attention: bool = False, | ||
uniform_decode: bool = False, | ||
allow_microbatching: bool = False, | ||
allow_microbatching: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, would the param renamed to microbatching_fallback
etc be better? (just feeling allow_microbatching
doesn't show the idea you mention)
Or we can have detailed comments
pass | ||
|
||
def max_sms_used(self) -> Optional[int]: | ||
return None # None means it could use the whole GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would -1 be better?
parallel_group.add_argument( | ||
"--dbo-prefill-token-threshold", | ||
**parallel_kwargs["dbo_prefill_token_threshold"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for the new arg
if hook is not None: | ||
if dbo_enabled(): | ||
# If DBO is being used, register the hook with the ubatch | ||
# context and call it in dbo_maybe_run_recv_hook instead of | ||
# passing it to the receiver. | ||
dbo_register_recv_hook(hook) | ||
dbo_yield() | ||
else: | ||
hook() | ||
|
||
receiver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we have this logic for two times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factor it out into a function, since it appears twice?
dbo_yield() | ||
else: | ||
hook() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here once again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py
Outdated
Show resolved
Hide resolved
if hook is not None: | ||
if dbo_enabled(): | ||
# If DBO is being used, register the hook with the ubatch | ||
# context and call it in dbo_maybe_run_recv_hook instead of | ||
# passing it to the receiver. | ||
dbo_register_recv_hook(hook) | ||
dbo_yield() | ||
else: | ||
hook() | ||
|
||
receiver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factor it out into a function, since it appears twice?
…e.py Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson this commit introduces weired behaviour - the first http request with larger context is working normally but the subsequent requests are signifficantly slower. I have verified that it is this commit: cc1dc7e which should be this PR. a903669 is working normally
|
… and Prefill support (vllm-project#24845) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
… and Prefill support (#24845) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
||
if not should_ubatch: | ||
num_pad, num_tokens_across_dp = self.get_dp_padding(num_tokens) | ||
num_tokens += num_pad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this doesn't make the padding happen.
Purpose
Test Plan
lm_eval
Test Result
export VLLM_ALL2ALL_BACKEND=deepep_high_throughput
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
HT Overlap Trace (2x8xH100)

Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.