Initial add of torch.distributed.pipelining #124776

kwen2501 · 2024-04-23T20:41:21Z

Stack from ghstack (oldest at bottom):

-> Initial add of torch.distributed.pipelining #124776

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-04-23T20:41:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124776

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 866111e with merge base c82fcb7 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d4977d7 Pull Request resolved: #124776

awgu · 2024-04-23T21:04:34Z

Random thought: Any chance that torch.distributed.pipeline_parallel is an option? :)

kwen2501 · 2024-04-23T21:07:53Z

Thanks for the suggestion @awgu !
We did consider torch.distributed.pipeline_parallel but unfortunately it seemed a bit long :)
Also, some people may argue that pipelining is not a kind of parallelism in a strict sense.

wanchaol · 2024-04-23T22:30:45Z

Thanks for the suggestion @awgu ! We did consider torch.distributed.pipeline_parallel but unfortunately it seemed a bit long :) Also, some people may argue that pipelining is not a kind of parallelism in a strict sense.

I was about to say I have a similar suggestion with @awgu, I feel pipeline_parallel aligns well with other parts of our offerings: fully_sharded_data_parallel, tensor_parallel.

pipelining is not a kind of parallelism in a strict sense.

Curious why it's not a kind of parallelism?

kwen2501 · 2024-04-23T23:47:11Z

fully_sharded_data_parallel, tensor_parallel

Where are these names in our package offering?

Curious why it's not a kind of parallelism?

It is really a subtle difference, here is an answer from Quora's bot:

Pipelining and parallelism are both techniques used in computer architecture to improve performance, but they operate in different ways.
Pipelining involves breaking down the execution of instructions into a series of stages, where each stage performs a different part of the instruction. This allows multiple instructions to be processed simultaneously, with each stage working on a different instruction. As a result, the overall throughput of the processor is increased.
Parallelism, on the other hand, involves executing multiple instructions simultaneously by using multiple processing units. This can be achieved through techniques such as multi-core processors or multi-processor systems. Parallelism allows for true simultaneous execution of instructions, which can significantly improve overall system performance.
In summary, pipelining focuses on breaking down the execution of individual instructions into smaller stages to improve throughput, while parallelism involves executing multiple instructions at the same time using multiple processing units to improve overall system performance.

In short, pipelining focuses on breaking down a job, while parallelism focuses on having multiple workers do the same job.

wconstab · 2024-04-24T05:02:57Z

I think the name 'pipeline parallel' is sufficiently established that it doesn't matter what quora says, people know what it means and understand it. This is the obvious safe option to me.

Pipelining is arguably a correct name but also not one the ML community uses as often. It is also shorter and clean. I would have been more convinced by this if I wasn't already so used to saying pp, but I am.

kwen2501 · 2024-04-24T14:15:35Z

Thanks @wconstab. I agree that "pipeline parallel" is a well-known concept -- it is also what we use in our README. But, as a package name, I think it is too long. "PP" is short, but no descriptive enough.

Ideally, I prefer a package name that's one word, such as "distributed", "compiler", "profiler", "export". It can also be two shortened meaningful words concatenated, such as "autograd". But I think "pipepara" looks weird.

wconstab · 2024-04-24T04:49:56Z

torch/distributed/pipelining/README.md

+    output = schedule.step()
+```
+
+Note that since we split our model into three stages, we must run this script with three workers. For this example, we will use `torchrun` to run multiple processes within a single machine for demonstration purposes. We can collect up all of the code blocks above into a file named [example.py](https://github.com/pytorch/PiPPy/tree/main/examples/basic) and then run it with `torchrun` like so:


Are the links to pippy repo going to migrate too eventually, or do we leave the examples there?

The temporary decision is leaving the examples there (not migrated). Eventually they would be hosted in some tutorial repo I hope.

kwen2501 · 2024-04-24T15:33:10Z

@pytorchbot merge

pytorchmergebot · 2024-04-24T15:34:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #124875 Approved by: https://github.com/H-Huang ghstack dependencies: #124776

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. Pull Request resolved: #124958 Approved by: https://github.com/wconstab ghstack dependencies: #124776, #124875

Pull Request resolved: #125273 Approved by: https://github.com/H-Huang ghstack dependencies: #124776, #124875, #124958

Pull Request resolved: #124776 Approved by: https://github.com/wconstab

Pull Request resolved: #124875 Approved by: https://github.com/H-Huang ghstack dependencies: #124776

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. Pull Request resolved: #124958 Approved by: https://github.com/wconstab ghstack dependencies: #124776, #124875

Initial add of torch.distributed.pipelining

866111e

[ghstack-poisoned]

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Apr 23, 2024

kwen2501 added a commit that referenced this pull request Apr 23, 2024

Initial add of torch.distributed.pipelining

639d1ad

ghstack-source-id: d4977d7 Pull Request resolved: #124776

kwen2501 added release notes: distributed (pipeline) release notes category and removed topic: not user facing topic category labels Apr 23, 2024

kwen2501 requested review from H-Huang and wconstab April 23, 2024 20:43

wconstab approved these changes Apr 24, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 24, 2024

pytorchmergebot added the merging label Apr 24, 2024

pytorchmergebot added the Merged label Apr 24, 2024

pytorchmergebot closed this in f07b622 Apr 24, 2024

pytorchmergebot removed the merging label Apr 24, 2024

This was referenced Apr 24, 2024

[pipelining] Add util and debug facilities #124875

Closed

[pipelining] Add stage backward function #124958

Closed

pytorchmergebot pushed a commit that referenced this pull request Apr 30, 2024

[pipelining] Add util and debug facilities (#124875)

c1a3fcf

Pull Request resolved: #124875 Approved by: https://github.com/H-Huang ghstack dependencies: #124776

kwen2501 mentioned this pull request Apr 30, 2024

[pipelining] Add microbatch split and merge utils #125273

Closed

pytorchmergebot pushed a commit that referenced this pull request May 2, 2024

[pipelining] Add microbatch split and merge utils (#125273)

0199ce8

Pull Request resolved: #125273 Approved by: https://github.com/H-Huang ghstack dependencies: #124776, #124875, #124958

pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024

Initial add of torch.distributed.pipelining (#124776)

6a63737

Pull Request resolved: #124776 Approved by: https://github.com/wconstab

pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024

[pipelining] Add util and debug facilities (#124875)

e89169a

Pull Request resolved: #124875 Approved by: https://github.com/H-Huang ghstack dependencies: #124776

github-actions bot deleted the gh/kwen2501/14/head branch June 2, 2024 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial add of torch.distributed.pipelining #124776

Initial add of torch.distributed.pipelining #124776

Uh oh!

kwen2501 commented Apr 23, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Apr 23, 2024 •

edited

Loading

Uh oh!

awgu commented Apr 23, 2024

Uh oh!

kwen2501 commented Apr 23, 2024 •

edited

Loading

Uh oh!

wanchaol commented Apr 23, 2024

Uh oh!

kwen2501 commented Apr 23, 2024 •

edited

Loading

Uh oh!

wconstab commented Apr 24, 2024

Uh oh!

kwen2501 commented Apr 24, 2024 •

edited

Loading

Uh oh!

wconstab Apr 24, 2024

Uh oh!

kwen2501 Apr 24, 2024 •

edited

Loading

Uh oh!

kwen2501 commented Apr 24, 2024

Uh oh!

pytorchmergebot commented Apr 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Initial add of torch.distributed.pipelining #124776

Initial add of torch.distributed.pipelining #124776

Uh oh!

Conversation

kwen2501 commented Apr 23, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124776

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

awgu commented Apr 23, 2024

Uh oh!

kwen2501 commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanchaol commented Apr 23, 2024

Uh oh!

kwen2501 commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab commented Apr 24, 2024

Uh oh!

kwen2501 commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 24, 2024

Uh oh!

pytorchmergebot commented Apr 24, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Apr 23, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 23, 2024 •

edited

Loading

kwen2501 commented Apr 23, 2024 •

edited

Loading

kwen2501 commented Apr 23, 2024 •

edited

Loading

kwen2501 commented Apr 24, 2024 •

edited

Loading

kwen2501 Apr 24, 2024 •

edited

Loading