Improve the scheduling of _pipelined_multi_all_gather_and_consume #137850

yifuwang · 2024-10-12T22:07:33Z

Stack from ghstack (oldest at bottom):

Parallelization strategy: after each rank copies its shard into its local
p2p buffer, every rank issues independent p2p copy -> shard_consumer
sequences to two streams. In addition to computation/communication
overlapping, the strategy allows for computation/computation overlapping,
greatly reducing quantization inefficiency.

Notation:
- "mv" for the copy to local buffer
- "cp" for p2p copies
- "b" for barriers

Constraints:
- The GPU scheduler may or may not overlap "mv" with the first shard_consumer.
- "cp" from different streams cannot overlap.

Ideal scenario 0 - "mv" overlaps with the first shard_consumer:

stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]

Ideal scenario 1 - "mv" is scheduled before the first shard_consumer:

stream 0:       [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]

Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer:

stream 0: [ shard_consumer ]               [ cp ][ shard_consumer ]
stream 1:                   [ mv ][b][ cp ][ shard_consumer ]

Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer:

stream 0:       [ shard_consumer ]         [ cp ][ shard_consumer ]
stream 1: [ mv ]                  [b][ cp ][ shard_consumer ]

We haven't yet figured out a way to ensure "mv" and "b" are either
overlapped with or scheduled before the first shard_consumer. Thus, to
prevent suboptimal scenarios, we are giving up the chance to overlap "mv"
and "b" with the first shard_consumer for now.

This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel:

Before this PR:

With this PR:

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-12T22:07:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137850

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 66fa103 with merge base dae6007 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…consume" cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…consume" ``` Parallelization strategy: after each rank copies its shard into its local p2p buffer, every rank issues independent p2p copy -> shard_consumer sequences to two streams. In addition to computation/communication overlapping, the strategy allows for computation/computation overlapping, greatly reducing quantization inefficiency. Notation: - "mv" for the copy to local buffer - "cp" for p2p copies - "b" for barriers Constraints: - The GPU scheduler may or may not overlap "mv" with the first shard_consumer. - "cp" from different streams cannot overlap. Ideal scenario 0 - "mv" overlaps with the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Ideal scenario 1 - "mv" is scheduled before the first shard_consumer: stream 0: [ shard_consumer ][ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] Suboptimal scenario - "mv" is scheduled after the first shard_consumer: stream 0: [ shard_consumer ] [ cp ][ shard_consumer ] stream 1: [ mv ][b][ cp ][ shard_consumer ] To prevent the suboptimal scenario, we do the following to maximize the likelihood that "mv" is either overlapped with or scheduled before the first shard_consumer: - Issue "mv" on stream 1 before issuing the first shard_consumer on stream 0. - Add a small sleep before the first shard_consumer on stream 0. The sleep duration is insignificant, but having an extra task in stream 0 will almost guarantee that "mv" on stream 1 gets scheduled first, if it cannot overlap with the first shard_consumer. ``` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 3aefb80 Pull Request resolved: #137850

yifuwang · 2024-10-15T18:54:04Z

@pytorchbot merge

pytorchmergebot · 2024-10-15T18:55:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Improve the scheduling of _pipelined_multi_all_gather_and_consume

ab93ac4

[ghstack-poisoned]

This was referenced Oct 12, 2024

[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() #137643

Closed

[fused_scaled_matmul_reduce_scatter] support rowwise scaling #137738

Closed

[fused_all_gather_scaled_matmul] support rowwise scaling #137805

Closed

yifuwang mentioned this pull request Oct 12, 2024

Improve the scheduling of _pipelined_produce_and_all2all #137836

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 12, 2024

Update on "Improve the scheduling of _pipelined_multi_all_gather_and_…

9093568

…consume" cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

yifuwang pushed a commit that referenced this pull request Oct 12, 2024

Improve the scheduling of _pipelined_multi_all_gather_and_consume

db3330f

ghstack-source-id: 3aefb80 Pull Request resolved: #137850

yifuwang added the topic: not user facing topic category label Oct 12, 2024

yifuwang requested review from Chillee and weifengpy October 12, 2024 22:48

weifengpy approved these changes Oct 13, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 15, 2024

pytorchmergebot added the merging label Oct 15, 2024

pytorchmergebot added the Merged label Oct 15, 2024

pytorchmergebot closed this in 5d5783a Oct 15, 2024

pytorchmergebot removed the merging label Oct 15, 2024

This was referenced Oct 15, 2024

get_symm_mem_workspace(): print helpful error during graph capture #138028

Closed

Preliminary registered-buffer collective support via Inductor #138029

Closed

github-actions bot deleted the gh/yifuwang/146/head branch November 15, 2024 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the scheduling of _pipelined_multi_all_gather_and_consume #137850

Improve the scheduling of _pipelined_multi_all_gather_and_consume #137850

Uh oh!

yifuwang commented Oct 12, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 12, 2024 •

edited

Loading

Uh oh!

yifuwang commented Oct 15, 2024

Uh oh!

pytorchmergebot commented Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve the scheduling of _pipelined_multi_all_gather_and_consume #137850

Improve the scheduling of _pipelined_multi_all_gather_and_consume #137850

Uh oh!

Conversation

yifuwang commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137850

✅ No Failures

Uh oh!

yifuwang commented Oct 15, 2024

Uh oh!

pytorchmergebot commented Oct 15, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yifuwang commented Oct 12, 2024 •

edited

Loading

pytorch-bot bot commented Oct 12, 2024 •

edited

Loading