[FSDP] Relax post-backward assert #89791

awgu · 2022-11-28T22:01:21Z

Stack from ghstack (oldest at bottom):

-> [FSDP] Relax post-backward assert #89791

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases.

[ghstack-poisoned]

pytorch-bot · 2022-11-28T22:01:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89791

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a510d8d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ee0fd21a02c0fc671ca900e2b57083a1ef60edd2 Pull Request resolved: #89791

zhaojuanmao · 2022-11-28T22:03:55Z

torch/distributed/fsdp/_runtime_utils.py

@@ -482,9 +482,13 @@ def _post_backward_hook(
        "FullyShardedDataParallel._post_backward_hook"
    ):
        _assert_in_training_states(state, [TrainingState.FORWARD_BACKWARD])
+        # For reentrant AC, the post-backward hook may run multiple times in


nit: For reentrant AC multiple times

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. [ghstack-poisoned]

ghstack-source-id: 90b4cb6e62e1dad7afccb625996cc797692c1db2 Pull Request resolved: #89791

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. [ghstack-poisoned]

ghstack-source-id: 3d8a27dc946209d413b1dd5e6ed0d6815bb71721 Pull Request resolved: #89791

awgu · 2022-11-28T22:16:25Z

@pytorchbot rebase -s

pytorchmergebot · 2022-11-28T22:21:26Z

@pytorchbot successfully started a rebase job. Check the current status here

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. [ghstack-poisoned]

pytorchmergebot · 2022-11-28T22:21:46Z

Successfully rebased gh/awgu/218/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/89791)

ghstack-source-id: 268645d4e371a830a9857b981311dbfd6455d05a Pull Request resolved: #89791

awgu · 2022-11-29T01:48:39Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T01:50:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. [ghstack-poisoned]

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. ghstack-source-id: 8848c4cbf572c3a5acd8a9c2fd2b22539a65375f Pull Request resolved: #89781

pytorchmergebot · 2022-11-29T04:01:25Z

Merge failed

Reason: 1 additional jobs have failed, first few of them are: trunk

Details for Dev Infra team

Raised by workflow job

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. Pull Request resolved: #89781 Approved by: https://github.com/rohan-varma

awgu · 2022-11-29T17:22:42Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T17:25:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR pytorch#89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. Pull Request resolved: pytorch#89781 Approved by: https://github.com/rohan-varma

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. Pull Request resolved: pytorch#89791 Approved by: https://github.com/zhaojuanmao

[FSDP] Relax post-backward assert

d5abcd9

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, H-Huang and kwen2501 as code owners November 28, 2022 22:01

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 28, 2022

awgu added a commit that referenced this pull request Nov 28, 2022

[FSDP] Relax post-backward assert

dad8cf3

ghstack-source-id: ee0fd21a02c0fc671ca900e2b57083a1ef60edd2 Pull Request resolved: #89791

zhaojuanmao approved these changes Nov 28, 2022

View reviewed changes

awgu added a commit that referenced this pull request Nov 28, 2022

[FSDP] Relax post-backward assert

0287d44

ghstack-source-id: 90b4cb6e62e1dad7afccb625996cc797692c1db2 Pull Request resolved: #89791

awgu added a commit that referenced this pull request Nov 28, 2022

[FSDP] Relax post-backward assert

56e1b00

ghstack-source-id: 3d8a27dc946209d413b1dd5e6ed0d6815bb71721 Pull Request resolved: #89791

awgu added the topic: improvements topic category label Nov 28, 2022

pytorchmergebot pushed a commit that referenced this pull request Nov 28, 2022

[FSDP] Relax post-backward assert

c9e410e

ghstack-source-id: 268645d4e371a830a9857b981311dbfd6455d05a Pull Request resolved: #89791

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2022

mrshenli mentioned this pull request Nov 29, 2022

Test FSDP with submodule non-reentrant checkpointing #89781

Closed

pytorchmergebot added the Merged label Nov 29, 2022

pytorchmergebot closed this in 6e2da42 Nov 29, 2022

facebook-github-bot deleted the gh/awgu/218/head branch June 8, 2023 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Relax post-backward assert #89791

[FSDP] Relax post-backward assert #89791

awgu commented Nov 28, 2022 •

edited by pytorchmergebot

pytorch-bot bot commented Nov 28, 2022 •

edited

zhaojuanmao Nov 28, 2022

awgu commented Nov 28, 2022

pytorchmergebot commented Nov 28, 2022

pytorchmergebot commented Nov 28, 2022

awgu commented Nov 29, 2022

pytorchmergebot commented Nov 29, 2022

pytorchmergebot commented Nov 29, 2022

awgu commented Nov 29, 2022

pytorchmergebot commented Nov 29, 2022

[FSDP] Relax post-backward assert #89791

[FSDP] Relax post-backward assert #89791

Conversation

awgu commented Nov 28, 2022 • edited by pytorchmergebot

pytorch-bot bot commented Nov 28, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89791

✅ No Failures

zhaojuanmao Nov 28, 2022

Choose a reason for hiding this comment

awgu commented Nov 28, 2022

pytorchmergebot commented Nov 28, 2022

pytorchmergebot commented Nov 28, 2022

awgu commented Nov 29, 2022

pytorchmergebot commented Nov 29, 2022

Merge started

pytorchmergebot commented Nov 29, 2022

Merge failed

awgu commented Nov 29, 2022

pytorchmergebot commented Nov 29, 2022

Merge started

awgu commented Nov 28, 2022 •

edited by pytorchmergebot

pytorch-bot bot commented Nov 28, 2022 •

edited