From 85c438e0422af5b3025280b7cc9c4f8b8f0350cd Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 17 Sep 2025 20:50:04 -0700 Subject: [PATCH 01/17] init Signed-off-by: Sage Moore --- docs/design/dbo.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 docs/design/dbo.md diff --git a/docs/design/dbo.md b/docs/design/dbo.md new file mode 100644 index 000000000000..63adda9db4a9 --- /dev/null +++ b/docs/design/dbo.md @@ -0,0 +1,17 @@ +# Dual Batch Overlap + +## Introduction + +The Dual Batch Overlap system spans numerous files but the primary classes and functions live in the following files + +[gpu_ubatch_wrapper](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_ubatch_wrapper.py) +[ubatch_utils](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/ubatch_utils.py) +[ubatch_splitting](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/ubatch_splitting.py) + + +## Motivation + +## DBO Components + + + From 0621454da78ac4c80c1464c2f095e244fa664821 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Tue, 30 Sep 2025 08:24:27 -0700 Subject: [PATCH 02/17] init Signed-off-by: Sage Moore --- docs/design/dbo.md | 46 +++++++++++++++++++++++++++++++++++++++------- 1 file changed, 39 insertions(+), 7 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 63adda9db4a9..b11323810fd2 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -1,17 +1,49 @@ # Dual Batch Overlap - +## Motivation +The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with the surrounding computation. This system currently only targets DP+EP deployments. ## Introduction +The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then running the model twice inside of each of these worker threads. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. + +The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with adding two new sub-systems: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` is a wrapper around the model that is responsible for all of the thread and cudagraph management. `UBatchContext` is a wrapper around `ForwardContext` that allows the two UBatch threads to synchronize with each other. +## Running with DBO +To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. +`--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. +`--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch -The Dual Batch Overlap system spans numerous files but the primary classes and functions live in the following files +Currently DBO is only supported with DeepEP so you’ll have to install that and set the `VLLM_ALL2ALL_BACKEND` environment variable to `deepep_low_latency` if your workload is primarily decode requests and `deepep_high_throughput` if your workload is primarily prefill requests. +## DBO Components +* GPUModelRunner +* UBatchWrapper +* UBatchContext +### GPU Model Runner +The `GpuModelRunner` is responsible for splitting up the batch into microbatches. Mechanically this requires two steps. The first is to coordinate between all of the DP ranks to decide if we are microbatching. Microbatching must be uniform between all DP ranks. If any DP rank doesn’t want to microbatch, none of them will. If all DP ranks want to microbatch, the total number of tokens is padded up to the max number of tokens amongst all ranks. If any rank would end up with an empty second microbatch after the padding is applied, microbatching will be aborted and no ranks will microbatch. Once all ranks have decided to microbatch, the second step is to slice up the `CommonAttentionMetadata` so that we have one attention metadata per-microbatch. +### UBatchWrapper +gpu_ubatch_wrapper -[gpu_ubatch_wrapper](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_ubatch_wrapper.py) -[ubatch_utils](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/ubatch_utils.py) -[ubatch_splitting](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/ubatch_splitting.py) +The `UBatchWrapper` class is a model wrapper that's responsible for all of the thread, UBatchContext, and cuda graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner. +The implementation revolves around running the model twice, once for each microbatch. Each invocation of the model will happen inside of a cpu thread. These threads are launched in parallel and are synchronized using the `UBatchContext`. Each thread is given a “sliced” version of the attention metadata that they will use to run their half of the batch. -## Motivation +Cudagraphs for DBO are entirely managed by the `UBatchWrapper` as well. Because of this, DBO only supports running with Full Cudagraphs. However, once we’ve captured a DBO cudagraph, we can replay it without any multithreading or CPU synchronization. -## DBO Components +#### Interfaces +`__init__` method takes in the model, VllmConfig, CUDAGraphMode, and device. +`forward` method exclusively takes in model arguments. It determines whether or not to run with DBO if there's a `ubatch_slices` object in the `forward_context`. Otherwise it just naively runs the model. +### UBatchContext +ubatch_context + +The `UBatchContext` class is a `ForwardContext` wrapper class that is used by the `UBatchWrapper` class to synchronize the two UBatch threads. It should only be instantiated by using `make_ubatch_contexts`. + +The `UBatchWrapper` class enables two UBatch threads, A and B, to alternate execution. When the `forward` method is invoked, thread A begins processing its part of the model. Upon reaching a `dbo_yield` call, thread A pauses, and thread B starts its execution. This "ping-pong" dynamic continues, with threads swapping at each dbo_yield call, until the model's execution is complete. + +The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` calls in the `FusedMoEModularKernel.forward` method. +#### Interfaces +`make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two cuda streams, the preexisting `ForwardContexts` and a cpu thread barrier. You should exclusively use this function to instantiate `UBatchContexts`. It will handle all of the event initialization. + +`dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to “wait” on an all-to-all kernel + +`dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. +`dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread From 686349a2056bdb8821661197d593d1c67c1d6cd5 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 06:10:40 -0700 Subject: [PATCH 03/17] init Signed-off-by: Sage Moore --- docs/design/dbo.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index b11323810fd2..7d57662f1d32 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -5,6 +5,21 @@ The core motivation of the DBO system in vLLM is to overlap the sparse all-to-al The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then running the model twice inside of each of these worker threads. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with adding two new sub-systems: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` is a wrapper around the model that is responsible for all of the thread and cudagraph management. `UBatchContext` is a wrapper around `ForwardContext` that allows the two UBatch threads to synchronize with each other. + +Below are the two overlap schedules that are currently implemented in vLLM. +``` +Comp: |-A0₀-A1₀-S₀-||-MLP₁-||-MLP₀-||-A0₁-A1₁-S₁-| +Comm: |-----D₁-----||--D₀--||--C₁--||-----C₀-----| +Order: D₁ send, A0₀, A1₀, S₀, D₁ recv, D₀ send, MLP₁, D₀ recv, + C₁ send, MLP₀, C₁ recv, C₀ send, A0₁, A1₁, S₁, C₀ recv. +MLP_OVERLAP = "mlp_overlap" + +Comp: |-A0₀-A1₀-||-MLP₁-||-S₁-MLP₀-||-S₀-A0₁-A1₁-| +Comm: |----D₁---||--D₀--||----C₁---||-----C₀-----| +Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv, + C₁ send, S₁, MLP₀, C₁ recv, C₀ send, S₀, A0₁, A1₁, C₀ recv. +MLP_SHARED_OVERLAP = "mlp_shared_overlap" +``` ## Running with DBO To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. From 361a9eb5b12f783c6b169fe64211efc063280ab7 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 06:12:53 -0700 Subject: [PATCH 04/17] formatting Signed-off-by: Sage Moore --- docs/design/dbo.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 7d57662f1d32..34aba434e7e0 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -43,6 +43,7 @@ Cudagraphs for DBO are entirely managed by the `UBatchWrapper` as well. Because #### Interfaces `__init__` method takes in the model, VllmConfig, CUDAGraphMode, and device. + `forward` method exclusively takes in model arguments. It determines whether or not to run with DBO if there's a `ubatch_slices` object in the `forward_context`. Otherwise it just naively runs the model. ### UBatchContext ubatch_context From 5f64ab99e7dd764b5b21729336570fc324e90d4f Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 06:24:59 -0700 Subject: [PATCH 05/17] formatting Signed-off-by: Sage Moore --- docs/design/dbo.md | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 34aba434e7e0..3549395fab9d 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -1,8 +1,12 @@ # Dual Batch Overlap + ## Motivation + The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with the surrounding computation. This system currently only targets DP+EP deployments. + ## Introduction -The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then running the model twice inside of each of these worker threads. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. + +The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then having each of these worker threads run the model. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with adding two new sub-systems: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` is a wrapper around the model that is responsible for all of the thread and cudagraph management. `UBatchContext` is a wrapper around `ForwardContext` that allows the two UBatch threads to synchronize with each other. @@ -21,18 +25,25 @@ Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv, MLP_SHARED_OVERLAP = "mlp_shared_overlap" ``` ## Running with DBO + To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. `--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch Currently DBO is only supported with DeepEP so you’ll have to install that and set the `VLLM_ALL2ALL_BACKEND` environment variable to `deepep_low_latency` if your workload is primarily decode requests and `deepep_high_throughput` if your workload is primarily prefill requests. + ## DBO Components + * GPUModelRunner * UBatchWrapper * UBatchContext + ### GPU Model Runner + The `GpuModelRunner` is responsible for splitting up the batch into microbatches. Mechanically this requires two steps. The first is to coordinate between all of the DP ranks to decide if we are microbatching. Microbatching must be uniform between all DP ranks. If any DP rank doesn’t want to microbatch, none of them will. If all DP ranks want to microbatch, the total number of tokens is padded up to the max number of tokens amongst all ranks. If any rank would end up with an empty second microbatch after the padding is applied, microbatching will be aborted and no ranks will microbatch. Once all ranks have decided to microbatch, the second step is to slice up the `CommonAttentionMetadata` so that we have one attention metadata per-microbatch. + ### UBatchWrapper + gpu_ubatch_wrapper The `UBatchWrapper` class is a model wrapper that's responsible for all of the thread, UBatchContext, and cuda graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner. @@ -42,18 +53,23 @@ The implementation revolves around running the model twice, once for each microb Cudagraphs for DBO are entirely managed by the `UBatchWrapper` as well. Because of this, DBO only supports running with Full Cudagraphs. However, once we’ve captured a DBO cudagraph, we can replay it without any multithreading or CPU synchronization. #### Interfaces + `__init__` method takes in the model, VllmConfig, CUDAGraphMode, and device. `forward` method exclusively takes in model arguments. It determines whether or not to run with DBO if there's a `ubatch_slices` object in the `forward_context`. Otherwise it just naively runs the model. + ### UBatchContext + ubatch_context The `UBatchContext` class is a `ForwardContext` wrapper class that is used by the `UBatchWrapper` class to synchronize the two UBatch threads. It should only be instantiated by using `make_ubatch_contexts`. -The `UBatchWrapper` class enables two UBatch threads, A and B, to alternate execution. When the `forward` method is invoked, thread A begins processing its part of the model. Upon reaching a `dbo_yield` call, thread A pauses, and thread B starts its execution. This "ping-pong" dynamic continues, with threads swapping at each dbo_yield call, until the model's execution is complete. +The `UBatchWrapper` class enables two UBatch threads, A and B, to alternate execution. When the `forward` method is invoked, thread A begins processing its part of the model. Upon reaching a `dbo_yield` call, thread A pauses, and thread B starts its execution. This "ping-pong" dynamic continues, with threads swapping at each `dbo_yield call`, until the model's execution is complete. The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` calls in the `FusedMoEModularKernel.forward` method. + #### Interfaces + `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two cuda streams, the preexisting `ForwardContexts` and a cpu thread barrier. You should exclusively use this function to instantiate `UBatchContexts`. It will handle all of the event initialization. `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to “wait” on an all-to-all kernel From ba6e6f972d475dc4d75dcafcb41453dab0aea3cc Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 06:26:29 -0700 Subject: [PATCH 06/17] formatting Signed-off-by: Sage Moore --- docs/design/dbo.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 3549395fab9d..54b335f55059 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -1,5 +1,6 @@ # Dual Batch Overlap + ## Motivation The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with the surrounding computation. This system currently only targets DP+EP deployments. From d595d387cffb458888df5e4ab25136f704e7eff0 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 06:27:35 -0700 Subject: [PATCH 07/17] formatting Signed-off-by: Sage Moore --- docs/design/dbo.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 54b335f55059..3549395fab9d 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -1,6 +1,5 @@ # Dual Batch Overlap - ## Motivation The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with the surrounding computation. This system currently only targets DP+EP deployments. From 703d6ed6f8d934084260cffbc2a897c9b3048ae7 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 07:24:06 -0700 Subject: [PATCH 08/17] rewording Signed-off-by: Sage Moore --- docs/design/dbo.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 3549395fab9d..874a84ee1829 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -64,7 +64,7 @@ ubatch_context The `UBatchContext` class is a `ForwardContext` wrapper class that is used by the `UBatchWrapper` class to synchronize the two UBatch threads. It should only be instantiated by using `make_ubatch_contexts`. -The `UBatchWrapper` class enables two UBatch threads, A and B, to alternate execution. When the `forward` method is invoked, thread A begins processing its part of the model. Upon reaching a `dbo_yield` call, thread A pauses, and thread B starts its execution. This "ping-pong" dynamic continues, with threads swapping at each `dbo_yield call`, until the model's execution is complete. +When one of the `UBatch` threads reaches a `dbo_yield` call, it pauses, and starts the other thread which will run until it reaches the same `dbo_yield` call. This "ping-pong" dynamic continues, with threads swapping at each `dbo_yield call`, until the model's execution is complete. The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` calls in the `FusedMoEModularKernel.forward` method. From 2e064ca84100a6592241d69416fefa1582999584 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 07:31:19 -0700 Subject: [PATCH 09/17] format Signed-off-by: Sage Moore --- docs/design/dbo.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 874a84ee1829..f557cc6123ba 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -6,11 +6,12 @@ The core motivation of the DBO system in vLLM is to overlap the sparse all-to-al ## Introduction -The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then having each of these worker threads run the model. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. +The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then having each of these worker threads run the model. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with adding two new sub-systems: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` is a wrapper around the model that is responsible for all of the thread and cudagraph management. `UBatchContext` is a wrapper around `ForwardContext` that allows the two UBatch threads to synchronize with each other. Below are the two overlap schedules that are currently implemented in vLLM. + ``` Comp: |-A0₀-A1₀-S₀-||-MLP₁-||-MLP₀-||-A0₁-A1₁-S₁-| Comm: |-----D₁-----||--D₀--||--C₁--||-----C₀-----| @@ -24,6 +25,7 @@ Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv, C₁ send, S₁, MLP₀, C₁ recv, C₀ send, S₀, A0₁, A1₁, C₀ recv. MLP_SHARED_OVERLAP = "mlp_shared_overlap" ``` + ## Running with DBO To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. @@ -77,5 +79,3 @@ The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` cal `dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. `dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread - - From ebb4ed3712f943fbd8dd2524b6ab2a6899995202 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 07:35:23 -0700 Subject: [PATCH 10/17] format Signed-off-by: Sage Moore --- docs/design/dbo.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index f557cc6123ba..4134a65d9870 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -12,7 +12,7 @@ The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with addi Below are the two overlap schedules that are currently implemented in vLLM. -``` +```text Comp: |-A0₀-A1₀-S₀-||-MLP₁-||-MLP₀-||-A0₁-A1₁-S₁-| Comm: |-----D₁-----||--D₀--||--C₁--||-----C₀-----| Order: D₁ send, A0₀, A1₀, S₀, D₁ recv, D₀ send, MLP₁, D₀ recv, From 7ee633370c3735353d4db3fedf0d9379234424c7 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 07:38:28 -0700 Subject: [PATCH 11/17] format Signed-off-by: Sage Moore --- docs/design/dbo.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 4134a65d9870..09f8dd8a88cc 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -12,18 +12,18 @@ The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with addi Below are the two overlap schedules that are currently implemented in vLLM. -```text -Comp: |-A0₀-A1₀-S₀-||-MLP₁-||-MLP₀-||-A0₁-A1₁-S₁-| -Comm: |-----D₁-----||--D₀--||--C₁--||-----C₀-----| -Order: D₁ send, A0₀, A1₀, S₀, D₁ recv, D₀ send, MLP₁, D₀ recv, - C₁ send, MLP₀, C₁ recv, C₀ send, A0₁, A1₁, S₁, C₀ recv. -MLP_OVERLAP = "mlp_overlap" - -Comp: |-A0₀-A1₀-||-MLP₁-||-S₁-MLP₀-||-S₀-A0₁-A1₁-| -Comm: |----D₁---||--D₀--||----C₁---||-----C₀-----| -Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv, - C₁ send, S₁, MLP₀, C₁ recv, C₀ send, S₀, A0₁, A1₁, C₀ recv. -MLP_SHARED_OVERLAP = "mlp_shared_overlap" +```python +# Comp: |-A0₀-A1₀-S₀-||-MLP₁-||-MLP₀-||-A0₁-A1₁-S₁-| +# Comm: |-----D₁-----||--D₀--||--C₁--||-----C₀-----| +# Order: D₁ send, A0₀, A1₀, S₀, D₁ recv, D₀ send, MLP₁, D₀ recv, +# C₁ send, MLP₀, C₁ recv, C₀ send, A0₁, A1₁, S₁, C₀ recv. +# MLP_OVERLAP = "mlp_overlap" + +# Comp: |-A0₀-A1₀-||-MLP₁-||-S₁-MLP₀-||-S₀-A0₁-A1₁-| +# Comm: |----D₁---||--D₀--||----C₁---||-----C₀-----| +# Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv, +# C₁ send, S₁, MLP₀, C₁ recv, C₀ send, S₀, A0₁, A1₁, C₀ recv. +# MLP_SHARED_OVERLAP = "mlp_shared_overlap" ``` ## Running with DBO From fe19bdaee8b211b79daca95b3d37e13af9cb4a23 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 07:45:37 -0700 Subject: [PATCH 12/17] format Signed-off-by: Sage Moore --- docs/design/dbo.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 09f8dd8a88cc..623a47f6fe09 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -29,8 +29,8 @@ Below are the two overlap schedules that are currently implemented in vLLM. ## Running with DBO To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. -`--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. -`--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch +* `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. +* `--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch Currently DBO is only supported with DeepEP so you’ll have to install that and set the `VLLM_ALL2ALL_BACKEND` environment variable to `deepep_low_latency` if your workload is primarily decode requests and `deepep_high_throughput` if your workload is primarily prefill requests. From 512f2087d326ae66301f64d5bd29571db1c0cc17 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Wed, 1 Oct 2025 08:56:35 -0700 Subject: [PATCH 13/17] format Signed-off-by: Sage Moore --- docs/design/dbo.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 623a47f6fe09..436a9ab2c63a 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -29,6 +29,7 @@ Below are the two overlap schedules that are currently implemented in vLLM. ## Running with DBO To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. + * `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. * `--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch From 578b4f7315a79485738ae60bf5fb48f49b12c985 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Mon, 6 Oct 2025 13:48:48 -0700 Subject: [PATCH 14/17] review comments Signed-off-by: Sage Moore --- docs/design/dbo.md | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 436a9ab2c63a..19de6d2f8e84 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -6,18 +6,19 @@ The core motivation of the DBO system in vLLM is to overlap the sparse all-to-al ## Introduction -The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then having each of these worker threads run the model. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. +The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then having each of these worker threads run the model. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. Throughout the code you may see ubatch being used as a short form of microbatch; this is just an ASCII friendly version of the short form µ-batch. -The DBO system modifies the `GpuModelRunner` and `ModularKernel` along with adding two new sub-systems: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` is a wrapper around the model that is responsible for all of the thread and cudagraph management. `UBatchContext` is a wrapper around `ForwardContext` that allows the two UBatch threads to synchronize with each other. +The DBO system includes modifications to `GpuModelRunner` and `ModularKernel`, and defines two utility classes: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` manages thread lifecycle and CUDA graph execution of the model. `UBatchContext` wraps `ForwardContext` to coordinate synchronization between the two UBatch threads. -Below are the two overlap schedules that are currently implemented in vLLM. +Below is the overlap schedule that is currently implemented in vLLM. ```python -# Comp: |-A0₀-A1₀-S₀-||-MLP₁-||-MLP₀-||-A0₁-A1₁-S₁-| -# Comm: |-----D₁-----||--D₀--||--C₁--||-----C₀-----| -# Order: D₁ send, A0₀, A1₀, S₀, D₁ recv, D₀ send, MLP₁, D₀ recv, -# C₁ send, MLP₀, C₁ recv, C₀ send, A0₁, A1₁, S₁, C₀ recv. -# MLP_OVERLAP = "mlp_overlap" +# Schedule notation legend: +# S = Shared expert +# A0 = MLA qkv pro, +# A1 = Core attn + out proj + MoE gate +# D = Dispatch +# C = Combine # Comp: |-A0₀-A1₀-||-MLP₁-||-S₁-MLP₀-||-S₀-A0₁-A1₁-| # Comm: |----D₁---||--D₀--||----C₁---||-----C₀-----| @@ -35,6 +36,11 @@ To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve Currently DBO is only supported with DeepEP so you’ll have to install that and set the `VLLM_ALL2ALL_BACKEND` environment variable to `deepep_low_latency` if your workload is primarily decode requests and `deepep_high_throughput` if your workload is primarily prefill requests. +Below is a command that will spin up a two DP rank server with expert parallelism and DBO enabled. +EX: `VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-dbo` + +Note that you'll need to have DeepEp installed and at least two GPUs visible in `CUDA_VISIBLE_DEVICES` + ## DBO Components * GPUModelRunner From dd09fa0f287c6e4a4b1fcedd2c3798f45a6c0654 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Mon, 6 Oct 2025 14:02:43 -0700 Subject: [PATCH 15/17] refactoring Signed-off-by: Sage Moore --- docs/design/dbo.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 19de6d2f8e84..0b7521932e36 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -49,7 +49,7 @@ Note that you'll need to have DeepEp installed and at least two GPUs visible in ### GPU Model Runner -The `GpuModelRunner` is responsible for splitting up the batch into microbatches. Mechanically this requires two steps. The first is to coordinate between all of the DP ranks to decide if we are microbatching. Microbatching must be uniform between all DP ranks. If any DP rank doesn’t want to microbatch, none of them will. If all DP ranks want to microbatch, the total number of tokens is padded up to the max number of tokens amongst all ranks. If any rank would end up with an empty second microbatch after the padding is applied, microbatching will be aborted and no ranks will microbatch. Once all ranks have decided to microbatch, the second step is to slice up the `CommonAttentionMetadata` so that we have one attention metadata per-microbatch. +The batch is split into microbatches by the `GPUModelRunner` class. This is accomplished in two steps. First, coordination across all DP ranks is performed to determine whether microbatching will be applied. Microbatching must be uniform across all DP ranks. If microbatching is not feasible for any DP rank, it is disabled for all ranks. If all DP ranks are going to microbatch, the total number of tokens is padded up to the max number of tokens amongst all ranks. If any rank would end up with an empty second microbatch after the padding is applied, microbatching will be aborted and no ranks will microbatch. Once microbatching has been initiated by all ranks, the second step is performed. The `CommonAttentionMetadata` is sliced in half by the `GPUModelRunner` so that there is one attention metadata per-microbatch. ### UBatchWrapper From b278c33d6320446a297416d937e3268cb790e6f3 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Mon, 6 Oct 2025 14:35:47 -0700 Subject: [PATCH 16/17] refactoring Signed-off-by: Sage Moore --- docs/design/dbo.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 0b7521932e36..7658362ce53e 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -6,7 +6,7 @@ The core motivation of the DBO system in vLLM is to overlap the sparse all-to-al ## Introduction -The Dual Batch Overlap system works by splitting the batch up in the model runner, creating two worker threads, and then having each of these worker threads run the model. When DBO is enabled, there are yield points within the FusedMoEModularKernel that will allow the two worker threads to ping-pong between each other so that when one is running compute, the other is waiting on communication. Throughout the code you may see ubatch being used as a short form of microbatch; this is just an ASCII friendly version of the short form µ-batch. +The Dual Batch Overlap system works by splitting the batch in the model runner, creating two worker threads, and then running the model on each of these worker threads. When DBO is enabled, yield points within the `FusedMoEModularKernel` allow the two CPU worker threads (also called UBatch threads) to ping-pong between each other so that when one is running compute, the other is waiting on communication. Throughout the code, ubatch may be used as a short form of microbatch; this is an ASCII-friendly version of the short form µ-batch. The DBO system includes modifications to `GpuModelRunner` and `ModularKernel`, and defines two utility classes: `UBatchWrapper` and `UBatchContext`. `UBatchWrapper` manages thread lifecycle and CUDA graph execution of the model. `UBatchContext` wraps `ForwardContext` to coordinate synchronization between the two UBatch threads. @@ -34,12 +34,12 @@ To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve * `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. * `--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch -Currently DBO is only supported with DeepEP so you’ll have to install that and set the `VLLM_ALL2ALL_BACKEND` environment variable to `deepep_low_latency` if your workload is primarily decode requests and `deepep_high_throughput` if your workload is primarily prefill requests. +Currently, DBO is only supported with DeepEP, so DeepEP must be installed and the `VLLM_ALL2ALL_BACKEND` environment variable must be set to `deepep_low_latency` if your workload is primarily decode requests, or `deepep_high_throughput` if your workload is primarily prefill requests. Below is a command that will spin up a two DP rank server with expert parallelism and DBO enabled. EX: `VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-dbo` -Note that you'll need to have DeepEp installed and at least two GPUs visible in `CUDA_VISIBLE_DEVICES` +Note that there must be at least two GPUs visible in `CUDA_VISIBLE_DEVICES` ## DBO Components @@ -55,17 +55,17 @@ The batch is split into microbatches by the `GPUModelRunner` class. This is acco gpu_ubatch_wrapper -The `UBatchWrapper` class is a model wrapper that's responsible for all of the thread, UBatchContext, and cuda graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner. +The `UBatchWrapper` class is a model wrapper that's responsible for all of the thread, UBatchContext, and CUDA graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner. -The implementation revolves around running the model twice, once for each microbatch. Each invocation of the model will happen inside of a cpu thread. These threads are launched in parallel and are synchronized using the `UBatchContext`. Each thread is given a “sliced” version of the attention metadata that they will use to run their half of the batch. +The implementation runs the model twice, once for each microbatch. Each invocation of the model will happen inside of a `UBatch` thread. These threads are launched in parallel and are synchronized using the `UBatchContext`. Each thread is given a “sliced” version of the attention metadata that they will use to run their half of the batch. -Cudagraphs for DBO are entirely managed by the `UBatchWrapper` as well. Because of this, DBO only supports running with Full Cudagraphs. However, once we’ve captured a DBO cudagraph, we can replay it without any multithreading or CPU synchronization. +CUDA graphs for DBO are entirely managed by the `UBatchWrapper`. Because of this, DBO only supports running with Full CUDA graphs. However, once a DBO CUDA graph has been captured, it can be replayed without any multithreading or CPU synchronization. #### Interfaces -`__init__` method takes in the model, VllmConfig, CUDAGraphMode, and device. +The `__init__` method takes in the model, VllmConfig, CUDAGraphMode, and device. -`forward` method exclusively takes in model arguments. It determines whether or not to run with DBO if there's a `ubatch_slices` object in the `forward_context`. Otherwise it just naively runs the model. +The `forward` method exclusively takes in model arguments. It determines whether or not to run with DBO based on whether a `ubatch_slices` object is present in the `forward_context`. Otherwise, the model is run without DBO. ### UBatchContext @@ -79,10 +79,10 @@ The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` cal #### Interfaces -`make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two cuda streams, the preexisting `ForwardContexts` and a cpu thread barrier. You should exclusively use this function to instantiate `UBatchContexts`. It will handle all of the event initialization. +The `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two CUDA streams, the preexisting `ForwardContexts` and a cpu thread barrier. This function should be used exclusively to instantiate `UBatchContexts`. It will handle all of the event initialization. -`dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to “wait” on an all-to-all kernel +The `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to “wait” on an all-to-all kernel -`dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. +The `dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. -`dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread +The `dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread From bdf01af0fbb34248874fa2ee30e6ac7c2ab9f9a9 Mon Sep 17 00:00:00 2001 From: Sage Moore Date: Mon, 6 Oct 2025 14:47:25 -0700 Subject: [PATCH 17/17] refactoring Signed-off-by: Sage Moore --- docs/design/dbo.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/design/dbo.md b/docs/design/dbo.md index 7658362ce53e..d92c47c80f95 100644 --- a/docs/design/dbo.md +++ b/docs/design/dbo.md @@ -15,7 +15,7 @@ Below is the overlap schedule that is currently implemented in vLLM. ```python # Schedule notation legend: # S = Shared expert -# A0 = MLA qkv pro, +# A0 = MLA qkv proj, # A1 = Core attn + out proj + MoE gate # D = Dispatch # C = Combine @@ -31,7 +31,7 @@ Below is the overlap schedule that is currently implemented in vLLM. To enable the DBO system pass in the `--enable-dbo` argument to your vllm serve command. This must be run in conjunction with `--data-parallel-size N` where N is greater than 1 and `--enable-expert-parallel`. Additionally, there are two configuration knobs. -* `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch. +* `--dbo-decode-token-threshold` the minimum number of tokens in a decode-only batch required to enable DBO for that batch * `--dbo-prefill-token-threshold` the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch Currently, DBO is only supported with DeepEP, so DeepEP must be installed and the `VLLM_ALL2ALL_BACKEND` environment variable must be set to `deepep_low_latency` if your workload is primarily decode requests, or `deepep_high_throughput` if your workload is primarily prefill requests. @@ -57,7 +57,7 @@ gpu_ubatch_wrapper The `UBatchWrapper` class is a model wrapper that's responsible for all of the thread, UBatchContext, and CUDA graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner. -The implementation runs the model twice, once for each microbatch. Each invocation of the model will happen inside of a `UBatch` thread. These threads are launched in parallel and are synchronized using the `UBatchContext`. Each thread is given a “sliced” version of the attention metadata that they will use to run their half of the batch. +The implementation runs the model twice, once for each microbatch. Each model invocation occurs within a UBatch thread. These threads are launched in parallel and are synchronized using the `UBatchContext`. Each thread is provided with a sliced version of the attention metadata that is used to run its half of the batch. CUDA graphs for DBO are entirely managed by the `UBatchWrapper`. Because of this, DBO only supports running with Full CUDA graphs. However, once a DBO CUDA graph has been captured, it can be replayed without any multithreading or CPU synchronization. @@ -73,16 +73,16 @@ ubatch_context The `UBatchContext` class is a `ForwardContext` wrapper class that is used by the `UBatchWrapper` class to synchronize the two UBatch threads. It should only be instantiated by using `make_ubatch_contexts`. -When one of the `UBatch` threads reaches a `dbo_yield` call, it pauses, and starts the other thread which will run until it reaches the same `dbo_yield` call. This "ping-pong" dynamic continues, with threads swapping at each `dbo_yield call`, until the model's execution is complete. +When one of the UBatch threads reaches a `dbo_yield` call, it pauses, and starts the other thread which will run until it reaches the same `dbo_yield` call. This "ping-pong" dynamic continues, with threads swapping at each `dbo_yield call`, until the model's execution is complete. The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` calls in the `FusedMoEModularKernel.forward` method. #### Interfaces -The `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two CUDA streams, the preexisting `ForwardContexts` and a cpu thread barrier. This function should be used exclusively to instantiate `UBatchContexts`. It will handle all of the event initialization. +The `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two CUDA streams, the preexisting `ForwardContexts` and a CPU thread barrier. This function should be used exclusively to instantiate `UBatchContexts`. It will handle all of the event initialization. -The `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to “wait” on an all-to-all kernel +The `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to wait on an all-to-all kernel. The `dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. -The `dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread +The `dbo_yield` method puts the current thread to sleep and wakes up the other UBatch thread.