[pipelining] Add grad test for interleaved schedules #126931

kwen2501 · 2024-05-22T23:49:36Z

Stack from ghstack (oldest at bottom):

Added test_grad_with_manual_interleaved:

Model: MultiMLP
Tested schedules: Interleaved1F1B, LoopedBFS
Two stages per rank

Rank 0 stages: [0, 2]
Rank 1 stages: [1, 3]

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-05-22T23:49:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126931

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 763dd43 with merge base c46b38b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 183286926363630293f1b6b3f6655400f50f0538 Pull Request resolved: #126931

``` Traceback (most recent call last): File "/data/users/kw2501/pytorch/torch/testing/_internal/common_utils.py", line 2756, in wrapper method(*args, **kwargs) File "/data/users/kw2501/pytorch/torch/testing/_internal/common_utils.py", line 443, in instantiated_test test(self, **param_kwargs) File "/data/users/kw2501/pytorch/test/distributed/pipelining/test_schedule.py", line 316, in test_grad_with_manual_interleaved out = schedule.step(target=target, losses=losses) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/kw2501/pytorch/torch/distributed/pipelining/PipelineSchedule.py", line 578, in step self._step_microbatches(args_split, kwargs_split, targets_split, losses) File "/data/users/kw2501/pytorch/torch/distributed/pipelining/PipelineSchedule.py", line 820, in _step_microbatches ops.extend(bwd_stage.get_bwd_send_ops()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/kw2501/pytorch/torch/distributed/pipelining/PipelineStage.py", line 339, in get_bwd_send_ops raise RuntimeError( RuntimeError: [1] for chunk 0 has gradients None and is expecting to send gradients to stage 0 To execute this test, run the following from the base repo dir: python test/distributed/pipelining/test_schedule.py -k ScheduleTest.test_grad_with_manual_interleaved_ScheduleClass0 ``` cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: 2dbaf5e4a27442563288663de2dd21310d83623f Pull Request resolved: #126931

wconstab · 2024-05-23T01:43:45Z

test/distributed/pipelining/test_schedule.py

+        with torch.no_grad():
+            y = ref_mod(x)
+            # Add a small perturbation
+            target = y + torch.randn(batch_size, d_hid, device=self.device)


nice way of doing the grad test.

except, how did you determine the perturbation is 'small'? (is the default of randn small relative to the norm of the model's output?)

wconstab

nice! thanks for putting this in.

kwen2501 · 2024-05-23T18:06:10Z

@pytorchbot merge

pytorchmergebot · 2024-05-23T18:08:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-23T18:08:11Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

kwen2501 · 2024-05-23T20:23:29Z

@pytorchbot merge -f "the pull or windows failure does not seem related"

pytorchmergebot · 2024-05-23T20:25:49Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

clee2000 · 2024-05-23T23:49:49Z

@pytorchbot revert -m "newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 https://hud.pytorch.org/pytorch/pytorch/commit/abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all" -c nosignal

pytorchmergebot · 2024-05-23T23:51:21Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit abf6d4e. Reverted #126931 on behalf of https://github.com/clee2000 due to newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 https://hud.pytorch.org/pytorch/pytorch/commit/abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all ([comment](#126931 (comment)))

pytorchmergebot · 2024-05-23T23:51:32Z

@kwen2501 your PR has been successfully reverted.

Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: #126931 Approved by: https://github.com/wconstab ghstack dependencies: #126812, #126721, #126735, #126927 ghstack-source-id: effa0de1fe6d4d422fdcab0813d98fc1f02e9186

kwen2501 · 2024-05-24T23:34:09Z

@pytorchbot merge

pytorchmergebot · 2024-05-24T23:36:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Added to `multigpu` test config, which is run periodically. Pull Request resolved: #127066 Approved by: https://github.com/H-Huang, https://github.com/wconstab ghstack dependencies: #127136, #126931

Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: pytorch#126931 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#126812, pytorch#126721, pytorch#126735, pytorch#126927

…#126931)" This reverts commit abf6d4e. Reverted pytorch#126931 on behalf of https://github.com/clee2000 due to newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 https://hud.pytorch.org/pytorch/pytorch/commit/abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all ([comment](pytorch#126931 (comment)))

Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: pytorch#126931 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#127136

Added to `multigpu` test config, which is run periodically. Pull Request resolved: pytorch#127066 Approved by: https://github.com/H-Huang, https://github.com/wconstab ghstack dependencies: pytorch#127136, pytorch#126931

Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: #126931 Approved by: https://github.com/wconstab ghstack dependencies: #127136 (cherry picked from commit c1d2564)

Added to `multigpu` test config, which is run periodically. Pull Request resolved: #127066 Approved by: https://github.com/H-Huang, https://github.com/wconstab ghstack dependencies: #127136, #126931 (cherry picked from commit 8bd26ec)

[WIP][pipelining] Add grad test for interleaved schedules

e97c032

[ghstack-poisoned]

kwen2501 mentioned this pull request May 22, 2024

[pipelining] Generalize definition of MultiMLP for testing interleaved schedules #126927

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 22, 2024

kwen2501 added a commit that referenced this pull request May 22, 2024

[WIP][pipelining] Add grad test for interleaved schedules

86eaea6

ghstack-source-id: 183286926363630293f1b6b3f6655400f50f0538 Pull Request resolved: #126931

kwen2501 added a commit that referenced this pull request May 23, 2024

[WIP][pipelining] Add grad test for interleaved schedules

5a72b72

ghstack-source-id: 2dbaf5e4a27442563288663de2dd21310d83623f Pull Request resolved: #126931

kwen2501 changed the title ~~[WIP][pipelining] Add grad test for interleaved schedules~~ [pipelining] Add grad test for interleaved schedules May 23, 2024

kwen2501 requested review from wconstab and H-Huang May 23, 2024 00:15

wconstab reviewed May 23, 2024

View reviewed changes

wconstab approved these changes May 23, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 23, 2024

pytorchmergebot added the merging label May 23, 2024

pytorchmergebot removed the merging label May 23, 2024

pytorchmergebot added the merging label May 23, 2024

pytorchmergebot added the Merged label May 23, 2024

pytorchmergebot closed this in abf6d4e May 23, 2024

pytorchmergebot removed the merging label May 23, 2024

pytorchmergebot added the Reverted label May 23, 2024

pytorchmergebot reopened this May 23, 2024

This was referenced May 24, 2024

[pipelining] test composability with DDP and FSDP #127066

Closed

[pipelining] do not check inputs for non-0 stages #127136

Closed

pytorchmergebot added the merging label May 24, 2024

pytorchmergebot closed this in c1d2564 May 25, 2024

pytorchmergebot removed the merging label May 25, 2024

github-actions bot deleted the gh/kwen2501/36/head branch June 25, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pipelining] Add grad test for interleaved schedules #126931

[pipelining] Add grad test for interleaved schedules #126931

kwen2501 commented May 22, 2024 •

edited

Loading

pytorch-bot bot commented May 22, 2024 •

edited

Loading

wconstab May 23, 2024

wconstab May 23, 2024

wconstab left a comment

kwen2501 commented May 23, 2024

pytorchmergebot commented May 23, 2024

pytorchmergebot commented May 23, 2024

kwen2501 commented May 23, 2024

pytorchmergebot commented May 23, 2024

clee2000 commented May 23, 2024

pytorchmergebot commented May 23, 2024

pytorchmergebot commented May 23, 2024

kwen2501 commented May 24, 2024

pytorchmergebot commented May 24, 2024

[pipelining] Add grad test for interleaved schedules #126931

[pipelining] Add grad test for interleaved schedules #126931

Conversation

kwen2501 commented May 22, 2024 • edited Loading

pytorch-bot bot commented May 22, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126931

✅ No Failures

wconstab May 23, 2024

Choose a reason for hiding this comment

wconstab May 23, 2024

Choose a reason for hiding this comment

wconstab left a comment

Choose a reason for hiding this comment

kwen2501 commented May 23, 2024

pytorchmergebot commented May 23, 2024

Merge started

pytorchmergebot commented May 23, 2024

Merge failed

kwen2501 commented May 23, 2024

pytorchmergebot commented May 23, 2024

Merge started

clee2000 commented May 23, 2024

pytorchmergebot commented May 23, 2024

pytorchmergebot commented May 23, 2024

kwen2501 commented May 24, 2024

pytorchmergebot commented May 24, 2024

Merge started

kwen2501 commented May 22, 2024 •

edited

Loading

pytorch-bot bot commented May 22, 2024 •

edited

Loading