Enable PP and EP overlap for MoE #1721

H-Huang · 2025-09-18T17:10:35Z

Option 2 of #1682

These changes add a custom overlap_callback function to replace the OVERLAP_F_B action that is run during the schedule execution. In the custom function, we write run_forward() and run_backward(). run_backward() is run as a separate thread so that we can have both forward and backward running together side by side. Looks like this:

In order for these changes to work with Expert Parallel, we also need to add custom autograd functions to act as the boundary points at which we do communication. We added hooks before and after expert parallel dispatch and combine to signal boundary points, so our figure from before now turns into:

Now in each of these red blocks, we use a global coordinator. We need threading.Barrier(2).wait() so that the comm and compute from our forward and backward steps are scheduled in lock-step before continuing.

DSv3 16B run command:

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=8  CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh

Trace examples:

H-Huang · 2025-09-22T21:53:53Z

Running with:

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh

CUDA_LAUNCH_BLOCKING

TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" CUDA_LAUNCH_BLOCKING=1 ./run_train.sh

H-Huang · 2025-10-08T17:25:48Z

Just landed pytorch/pytorch#162016, so once CI picks up the nightly the errors should be fixed

tianyu-l

Looks very cool! Left some comments and questions.

Also looking forward to benchmarking results with overlapping enabled vs. disabled. In particular, for the 16B model, we should be able to test out on 8 GPUs, assuming SAC is composable.

torchtitan/distributed/pipeline_parallel.py

tianyu-l · 2025-10-09T06:00:38Z

torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml


 [activation_checkpoint]
-mode = "selective"  # ["none", "selective", "full"]
+mode = "none"  # ["none", "selective", "full"]


does it not support SAC?

tianyu-l · 2025-10-09T06:02:11Z

torchtitan/models/deepseek_v3/__init__.py

        mscale=0.70,
-        use_flex_attn=True,
-        attn_mask_type="block_causal",
+        use_flex_attn=False,


Is FlexAttention not supported? It sounds unrelated.

tianyu-l · 2025-10-09T06:02:54Z

torchtitan/distributed/pipeline_parallel.py

    return stages, models
+
+
+# TODO: is there a better place to put this?


How about putting them into distributed/dual_pipe_v.py?

tianyu-l · 2025-10-09T06:24:30Z

torchtitan/distributed/pipeline_parallel.py

+
+    def run_backward():
+        # Set the backward thread to use the same stream as forward
+        torch.cuda.set_stream(main_cuda_stream)


similar -- can we change it to neutral calls

tianyu-l · 2025-10-09T06:26:54Z

torchtitan/distributed/pipeline_parallel.py

+    def run_backward():
+        # Set the backward thread to use the same stream as forward
+        torch.cuda.set_stream(main_cuda_stream)
+        with record_function(


always enabling this may hurt perf?

tianyu-l · 2025-10-09T06:36:09Z

torchtitan/distributed/pipeline_parallel.py

+        if _hook_coordinator._coordination_enabled and hook_name == "D":
+            _hook_coordinator._cycle_count += 1
+            # print(f"[FORWARD] cycle count: {_hook_coordinator._cycle_count}", "=" * 40)
+            if not _hook_coordinator.check_should_continue_coordination():


This check is only called in SyncHook.forward. Is it safe if for a particular overlap_f_b call, the backward stage has more layers than the forward stage?

tianyu-l · 2025-10-09T06:39:20Z

torchtitan/distributed/pipeline_parallel.py

+                    backward_mb_index,
+                )
+
+    def run_forward():


For my education:
The run_forward() and run_backward() functions look general and not tied to DualPipe. Do we not have such functions in pytorch pipelining code?

tianyu-l · 2025-10-09T06:42:42Z

torchtitan/distributed/pipeline_parallel.py

+                full_backward=True,
+                last_backward=last_backward,
+            )
+            grad_scale_factor = schedule._n_microbatches if schedule.scale_grads else 1


This may not work well with gradient accumulation. See what we did in #1732

yuankaichen-amd · 2025-10-09T17:33:46Z

torchtitan/distributed/pipeline_parallel.py

+                _hook_coordinator.disable_coordination()
+                return x
+
+        _hook_coordinator.barrier()


Strictly speaking, the barrier only has effect on the CPU threads, and it only forces the compute and a2a to be dispatched to GPU at the same time. But looking from the GPU perspective, it doesn't guarantee the execution of compute kernels and a2a are actually overlapped.

It may work in cases where there happen to have GPU-CPU syncs in the right places in the MoE layer (e.g. token index H2D copy etc). But I suspect it would fail to overlap as we remove those syncs (the community is working toward more efficient no-sync MoE implementations).

Theoretically we should use cuda event wait between compute/comm streams, not thread wait.

Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in pytorch/torchtitan#1721 Pull Request resolved: #165513 Approved by: https://github.com/fegin

wwwjn · 2025-10-16T22:03:34Z

torchtitan/distributed/pipeline_parallel.py

 ]

-
+import fbvscode


This needs to be removed

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

H-Huang force-pushed the deepseek-v3-new-methods branch from 3a61b86 to 0f7a7c9 Compare September 22, 2025 21:52

H-Huang force-pushed the deepseek-v3-new-methods branch from 0f7a7c9 to 6584aac Compare September 24, 2025 21:56

H-Huang force-pushed the deepseek-v3-new-methods branch from a6e46c7 to 5810c54 Compare October 8, 2025 14:51

H-Huang changed the title ~~[Option 2 Example] Dont land~~ Enable PP and EP overlap for MoE Oct 8, 2025

H-Huang marked this pull request as ready for review October 8, 2025 17:25

H-Huang requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 8, 2025 17:25

tianyu-l reviewed Oct 9, 2025

View reviewed changes

yuankaichen-amd reviewed Oct 9, 2025

View reviewed changes

H-Huang mentioned this pull request Oct 15, 2025

[PP] Update backward_counter and fsdp util to schedule class pytorch/pytorch#165513

Closed

Enable PP and EP overlap for MoE

7cf98e4

H-Huang force-pushed the deepseek-v3-new-methods branch from 9e43a67 to 7cf98e4 Compare October 15, 2025 04:18

wwwjn reviewed Oct 16, 2025

View reviewed changes

torchtitan/distributed/pipeline_parallel.py

]

import fbvscode

Copy link

Contributor

wwwjn Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be removed

		return stages, models


		# TODO: is there a better place to put this?

Enable PP and EP overlap for MoE #1721

Are you sure you want to change the base?

Enable PP and EP overlap for MoE #1721

Uh oh!

Conversation

H-Huang commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

H-Huang commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

H-Huang commented Oct 8, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

H-Huang commented Sep 18, 2025 •

edited

Loading

H-Huang commented Sep 22, 2025 •

edited

Loading