[FSDP] Fix `use_orig_params=True` + AC #87413

awgu · 2022-10-20T22:30:33Z

Stack from ghstack:

[FSDP] Fix use_orig_params=True + AC #87413 [FSDP] Fix use_orig_params=True + AC
[FSDP][2/N] Fix grad zero vs. None edge case #87308 [FSDP][2/N] Fix grad zero vs. None edge case
[FSDP][1/N] Update summon_full_params(with_grads) None gradient #87314 [FSDP][1/N] Update summon_full_params(with_grads) None gradient

Without this change, the post-backward hooks do not run when using reentrant activation checkpointing.

Explanation
FSDP registers the original parameters as plain Tensors in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the FlatParameters. FSDP registers the post-backward hooks in its pre-forward.

For use_orig_params=True, FSDP replaces the plain Tensors with the sharded nn.Parameters in the post-forward when resharding. This differs from use_orig_params=False, which keeps the plain Tensors registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for use_orig_params=True, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd.

My initial solution was to simply have FSDP restore the original parameters as plain Tensors again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The FlatParameter's AccumulateGrad object may change after the original pre-forward when performing a recomputed forward.

The new approach in this PR is to follow the use_orig_params=False way -- namely, to preserve the plain Tensor variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary).

An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.)

Test Plan
I augmented the existing reentrant checkpointing unit tests to also test use_orig_params=True. I also verified that the pycls model does not error (even with the new approach).

[ghstack-poisoned]

pytorch-bot · 2022-10-20T22:30:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87413

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 87d791c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: fffc87227f7726e85633919609c5526176eee796 Pull Request resolved: #87413

rohan-varma

Can we have a unittest for this change?

ghstack-source-id: fffc87227f7726e85633919609c5526176eee796 Pull Request resolved: pytorch#87413

Without this change, the post-backward hooks do not run when using reentrant activation checkpointing. This is what happens when we play with `.data` fire. The reason the hooks do not run is exactly why we need plain `Tensor` views in the forward pass normally without AC: We need the original parameters to be connected in autograd to the `FlatParameter` via `split()` and `view()` so that the gradients propagate through to the `FlatParameter`'s gradient. Reentrant AC runs the 1st forward pass with [`no_grad()`](https://github.com/pytorch/pytorch/blob/7e83f65ad502992a8d75c91eea2cf3de69bb0b7a/torch/utils/checkpoint.py#L106) and relies on the 2nd recomputed forward pass to autograd visible. Therefore, we need to use `Tensor` views in the backward pass as well, not `nn.Parameter` views. [ghstack-poisoned]

ghstack-source-id: 76cca85cd08d7d51ab6eb6eac8d82ef8f5328f23 Pull Request resolved: #87413

Without this change, the post-backward hooks do not run when using reentrant activation checkpointing. **Explanation** FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward. For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd. My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward. The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary). **Test Plan** I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach). [ghstack-poisoned]

ghstack-source-id: 6a0c26fae8b2ac18b4396c0caa8865a2690bbc6e Pull Request resolved: #87413

Without this change, the post-backward hooks do not run when using reentrant activation checkpointing. **Explanation** FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward. For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd. My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward. The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary). An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.) **Test Plan** I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach). [ghstack-poisoned]

ghstack-source-id: 92b0904f308f5ae1838cc623250f62ab0ae851a7 Pull Request resolved: #87413

ghstack-source-id: 6a0c26fae8b2ac18b4396c0caa8865a2690bbc6e Pull Request resolved: pytorch#87413

rohan-varma

Thanks for the fix and excellent debugging!

awgu · 2022-10-24T18:00:13Z

@pytorchbot merge

pytorchmergebot · 2022-10-24T18:02:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Without this change, the post-backward hooks do not run when using reentrant activation checkpointing. **Explanation** FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward. For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd. My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward. The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary). An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.) **Test Plan** I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach). Pull Request resolved: #87413 Approved by: https://github.com/rohan-varma

Without this change, the post-backward hooks do not run when using reentrant activation checkpointing. **Explanation** FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward. For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd. My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward. The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary). An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.) **Test Plan** I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach). Pull Request resolved: pytorch#87413 Approved by: https://github.com/rohan-varma

[FSDP] Fix use_orig_params=True + AC

33faf7b

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, mingzhe09088, H-Huang and kwen2501 as code owners October 20, 2022 22:30

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 20, 2022

awgu added a commit that referenced this pull request Oct 20, 2022

[FSDP] Fix use_orig_params=True + AC

4bfbd68

ghstack-source-id: fffc87227f7726e85633919609c5526176eee796 Pull Request resolved: #87413

awgu mentioned this pull request Oct 20, 2022

[FSDP] Fix use_orig_params=True + AC #87399

Closed

rohan-varma reviewed Oct 24, 2022

View reviewed changes

awgu added a commit to awgu/pytorch that referenced this pull request Oct 24, 2022

[FSDP] Fix use_orig_params=True + AC

889c309

ghstack-source-id: fffc87227f7726e85633919609c5526176eee796 Pull Request resolved: pytorch#87413

awgu added a commit that referenced this pull request Oct 24, 2022

[FSDP] Fix use_orig_params=True + AC

b14119c

ghstack-source-id: 76cca85cd08d7d51ab6eb6eac8d82ef8f5328f23 Pull Request resolved: #87413

awgu requested review from rohan-varma and removed request for mingzhe09088 October 24, 2022 13:19

awgu added the topic: improvements topic category label Oct 24, 2022

awgu added a commit that referenced this pull request Oct 24, 2022

[FSDP] Fix use_orig_params=True + AC

c2d0e0f

ghstack-source-id: 6a0c26fae8b2ac18b4396c0caa8865a2690bbc6e Pull Request resolved: #87413

awgu added a commit that referenced this pull request Oct 24, 2022

[FSDP] Fix use_orig_params=True + AC

9e9f280

ghstack-source-id: 92b0904f308f5ae1838cc623250f62ab0ae851a7 Pull Request resolved: #87413

awgu added a commit to awgu/pytorch that referenced this pull request Oct 24, 2022

[FSDP] Fix use_orig_params=True + AC

b82579b

ghstack-source-id: 6a0c26fae8b2ac18b4396c0caa8865a2690bbc6e Pull Request resolved: pytorch#87413

rohan-varma approved these changes Oct 24, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2022

pytorchmergebot added the Merged label Oct 24, 2022

pytorchmergebot closed this in 4b4aff7 Oct 24, 2022

facebook-github-bot deleted the gh/awgu/137/head branch June 8, 2023 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Fix `use_orig_params=True` + AC #87413

[FSDP] Fix `use_orig_params=True` + AC #87413

awgu commented Oct 20, 2022 •

edited

pytorch-bot bot commented Oct 20, 2022 •

edited

rohan-varma left a comment

rohan-varma left a comment

awgu commented Oct 24, 2022

pytorchmergebot commented Oct 24, 2022

[FSDP] Fix use_orig_params=True + AC #87413

[FSDP] Fix use_orig_params=True + AC #87413

Conversation

awgu commented Oct 20, 2022 • edited

pytorch-bot bot commented Oct 20, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87413

✅ No Failures

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

awgu commented Oct 24, 2022

pytorchmergebot commented Oct 24, 2022

Merge started

[FSDP] Fix `use_orig_params=True` + AC #87413

[FSDP] Fix `use_orig_params=True` + AC #87413

awgu commented Oct 20, 2022 •

edited

pytorch-bot bot commented Oct 20, 2022 •

edited