[pipelining] Add stage backward function #124958

kwen2501 · 2024-04-25T19:01:12Z

This is a helper function which:

computes the gradients for the stage inputs, and
accumulates gradients for the stage module's parameters.

A unit test for this function is also added.

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-04-25T19:01:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124958

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 218ac4c with merge base c82fcb7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: fc32dda Pull Request resolved: #124958

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

Add document: distributed.pipelining.rst ghstack-source-id: d79b483 Pull Request resolved: #124958

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

Add document: distributed.pipelining.rst Move some modules to private ghstack-source-id: fc59a93 Pull Request resolved: #124958

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

Add document: distributed.pipelining.rst ghstack-source-id: 7ee3b0e Pull Request resolved: #124958

wconstab · 2024-05-01T03:53:52Z

torch/distributed/pipelining/_backward.py

+
+# TODO: handling requires_grad=False dynamically. Can we analyze this during initial
+# IR emission?
+def _null_coalesce_accumulate(lhs, rhs):


Where will this be used?

Yeah, a second thought, I think it is okay to remove it.

wconstab · 2024-05-01T03:54:54Z

torch/distributed/pipelining/_backward.py

+            else:
+                grad_inputs.append(None)
+
+        # Alternative impl: `torch.autograd.grad`.


What are the trade offs, is there a reason to pick one over the other?

Well, to answer my own question we will want to use .grad if we implement zero bubble as we discussed.

Yeah, we can create another util function to hold the .grad impl. Well, maybe two bc we will need two calls for zero bubble.

wconstab · 2024-05-01T03:56:22Z

docs/source/distributed.pipelining.rst

+.. role:: hidden
+    :class: hidden-section
+
+Pipeline Parallelism


After adding this is there a plan to dedup with the Readme added in the first PR? Seems like we wouldn't need that anymore but I'm not sure if the content is 100%same

Good point. Nice to reduce maintenance load.

kwen2501 · 2024-05-01T04:52:26Z

@pytorchbot merge

pytorchmergebot · 2024-05-01T04:55:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #125273 Approved by: https://github.com/H-Huang ghstack dependencies: #124776, #124875, #124958

This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. Pull Request resolved: pytorch#124958 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#124776, pytorch#124875

[pipelining] Add stage backward function

39d945d

[ghstack-poisoned]

kwen2501 mentioned this pull request Apr 25, 2024

[pipelining] Add util and debug facilities #124875

Closed

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 25, 2024

kwen2501 requested review from H-Huang and wconstab April 25, 2024 19:01

kwen2501 added the topic: not user facing topic category label Apr 25, 2024

kwen2501 added a commit that referenced this pull request Apr 25, 2024

[pipelining] Add stage backward function

6e4f750

ghstack-source-id: fc32dda Pull Request resolved: #124958

kwen2501 added a commit that referenced this pull request Apr 30, 2024

[pipelining] Add stage backward function

db18728

Add document: distributed.pipelining.rst ghstack-source-id: d79b483 Pull Request resolved: #124958

kwen2501 added a commit that referenced this pull request Apr 30, 2024

[pipelining] Add stage backward function

9e1832b

Add document: distributed.pipelining.rst Move some modules to private ghstack-source-id: fc59a93 Pull Request resolved: #124958

kwen2501 added a commit that referenced this pull request Apr 30, 2024

[pipelining] Add stage backward function

bc7b89c

Add document: distributed.pipelining.rst ghstack-source-id: 7ee3b0e Pull Request resolved: #124958

kwen2501 mentioned this pull request Apr 30, 2024

[pipelining] Add microbatch split and merge utils #125273

Closed

wconstab approved these changes May 1, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2024

pytorchmergebot added the merging label May 1, 2024

pytorchmergebot added the Merged label May 1, 2024

pytorchmergebot closed this in 5214219 May 1, 2024

pytorchmergebot removed the merging label May 1, 2024

pytorchmergebot pushed a commit that referenced this pull request May 2, 2024

[pipelining] Add microbatch split and merge utils (#125273)

0199ce8

Pull Request resolved: #125273 Approved by: https://github.com/H-Huang ghstack dependencies: #124776, #124875, #124958

github-actions bot deleted the gh/kwen2501/17/head branch June 4, 2024 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pipelining] Add stage backward function #124958

[pipelining] Add stage backward function #124958

Uh oh!

kwen2501 commented Apr 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 25, 2024 •

edited

Loading

Uh oh!

wconstab May 1, 2024

Uh oh!

kwen2501 May 1, 2024

Uh oh!

wconstab May 1, 2024

Uh oh!

kwen2501 May 1, 2024

Uh oh!

wconstab May 1, 2024

Uh oh!

kwen2501 May 1, 2024

Uh oh!

kwen2501 commented May 1, 2024

Uh oh!

pytorchmergebot commented May 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[pipelining] Add stage backward function #124958

[pipelining] Add stage backward function #124958

Uh oh!

Conversation

kwen2501 commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124958

✅ No Failures

Uh oh!

wconstab May 1, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab May 1, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab May 1, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 1, 2024

Uh oh!

pytorchmergebot commented May 1, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kwen2501 commented Apr 25, 2024 •

edited

Loading

pytorch-bot bot commented Apr 25, 2024 •

edited

Loading