[FSDP][2/N] Fix grad zero vs. `None` edge case #87308

awgu · 2022-10-19T18:30:20Z

Stack from ghstack:

[FSDP][2/N] Fix grad zero vs. None edge case #87308 [FSDP][2/N] Fix grad zero vs. None edge case
[FSDP][1/N] Update summon_full_params(with_grads) None gradient #87314 [FSDP][1/N] Update summon_full_params(with_grads) None gradient

Some original parameters corresponding to one FlatParameter may have None gradient while others do not. In that case, the flat_param.grad must be non-None. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a _is_grad_none mask over the parameters' gradients.

_is_grad_none is initialized to False for all.
_is_grad_none[i] is set to True when writing zeros in place of None when writing back the ith gradient.
_is_grad_none[i] is set to False via _reset_is_grad_none(), which should be called in the post-backward. See the docstring for details.
_is_grad_none[i] must be False in order to set param.grad to be a view into flat_param.grad.

[ghstack-poisoned]

pytorch-bot · 2022-10-19T18:30:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87308

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 3 Pending

As of commit 08508fa:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: fd742a592feaea4b549d92ce5e025f4964cf2429 Pull Request resolved: #87308

[ghstack-poisoned]

ghstack-source-id: 85614e658ab8a39020ed3908279ab1e80e0e4bc6 Pull Request resolved: #87308

Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients. - `_is_grad_none` is initialized to `False` for all. - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient. - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details. - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`. This PR additionally changes `summon_full_params(with_grads=True)`'s behavior to be such that if all ranks have `flat_param.grad = None`, then the original parameters will correctly have `orig_param.grad = None`. This is achieved with a preliminary all-reduce. Note that if a particular original parameter's gradient is `None` on all of the containing ranks, but not all ranks' `flat_param.grad = None`, then that particular gradient is still going to be set to zeros. This can be handled if desired in follow-up work. [ghstack-poisoned]

ghstack-source-id: 85614e658ab8a39020ed3908279ab1e80e0e4bc6 Pull Request resolved: #87308

Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients. - `_is_grad_none` is initialized to `False` for all. - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient. - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details. - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`. [ghstack-poisoned]

ghstack-source-id: 83774374607bde9dd423c0c517c00eb84c8240c2 Pull Request resolved: #87308

Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients. - `_is_grad_none` is initialized to `False` for all. - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient. - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details. - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`. [ghstack-poisoned]

ghstack-source-id: b983aa8042152bec5f233812a4ebe05135df823b Pull Request resolved: #87308

Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients. - `_is_grad_none` is initialized to `False` for all. - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient. - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details. - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`. [ghstack-poisoned]

ghstack-source-id: e392f6c6752548932f41ae921773de1b941a77b4 Pull Request resolved: #87308

awgu · 2022-10-21T16:57:48Z

@pytorchbot merge

pytorchmergebot · 2022-10-21T17:01:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-21T17:03:27Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

ghstack-source-id: e392f6c6752548932f41ae921773de1b941a77b4 Pull Request resolved: pytorch#87308

Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients. - `_is_grad_none` is initialized to `False` for all. - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient. - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details. - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`. Pull Request resolved: pytorch#87308 Approved by: https://github.com/zhaojuanmao

[FSDP][1/N] Fix grad zero vs. None edge case

5ef333a

[ghstack-poisoned]

awgu requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang, kwen2501 and mingzhe09088 as code owners October 19, 2022 18:30

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 19, 2022

awgu added a commit that referenced this pull request Oct 19, 2022

[FSDP][1/N] Fix grad zero vs. None edge case

e184c87

ghstack-source-id: fd742a592feaea4b549d92ce5e025f4964cf2429 Pull Request resolved: #87308

Update on "[FSDP][1/N] Fix grad zero vs. None edge case"

944d6df

[ghstack-poisoned]

awgu added a commit that referenced this pull request Oct 19, 2022

[FSDP][1/N] Fix grad zero vs. None edge case

c56d4ed

ghstack-source-id: 85614e658ab8a39020ed3908279ab1e80e0e4bc6 Pull Request resolved: #87308

awgu mentioned this pull request Oct 19, 2022

[FSDP][1/N] Update summon_full_params(with_grads) None gradient #87314

Closed

awgu changed the title ~~[FSDP][1/N] Fix grad zero vs. None edge case~~ [FSDP][2/N] Fix grad zero vs. None edge case Oct 19, 2022

awgu added a commit that referenced this pull request Oct 19, 2022

[FSDP][2/N] Fix grad zero vs. None edge case

b919f24

ghstack-source-id: 85614e658ab8a39020ed3908279ab1e80e0e4bc6 Pull Request resolved: #87308

awgu added a commit that referenced this pull request Oct 19, 2022

[FSDP][2/N] Fix grad zero vs. None edge case

35cedd7

ghstack-source-id: 83774374607bde9dd423c0c517c00eb84c8240c2 Pull Request resolved: #87308

awgu added a commit that referenced this pull request Oct 20, 2022

[FSDP][2/N] Fix grad zero vs. None edge case

07357a7

ghstack-source-id: b983aa8042152bec5f233812a4ebe05135df823b Pull Request resolved: #87308

zhaojuanmao approved these changes Oct 21, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 21, 2022

awgu added a commit that referenced this pull request Oct 21, 2022

[FSDP][2/N] Fix grad zero vs. None edge case

0fcd378

ghstack-source-id: e392f6c6752548932f41ae921773de1b941a77b4 Pull Request resolved: #87308

pytorchmergebot added the Merged label Oct 21, 2022

pytorchmergebot closed this in 1133682 Oct 21, 2022

awgu added a commit to awgu/pytorch that referenced this pull request Oct 22, 2022

[FSDP][2/N] Fix grad zero vs. None edge case

eb13ed1

ghstack-source-id: e392f6c6752548932f41ae921773de1b941a77b4 Pull Request resolved: pytorch#87308

This was referenced Oct 24, 2022

[FSDP] Rename streams #86833

Closed

[FSDP] summon_full_params() in computation stream #86836

Closed

[FSDP] Use reduce_scatter_tensor() #87240

Closed

[FSDP] Fix use_orig_params=True + AC #87413

Closed

facebook-github-bot deleted the gh/awgu/134/head branch June 8, 2023 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][2/N] Fix grad zero vs. `None` edge case #87308

[FSDP][2/N] Fix grad zero vs. `None` edge case #87308

awgu commented Oct 19, 2022 •

edited

pytorch-bot bot commented Oct 19, 2022 •

edited

awgu commented Oct 21, 2022

pytorchmergebot commented Oct 21, 2022

github-actions bot commented Oct 21, 2022

[FSDP][2/N] Fix grad zero vs. None edge case #87308

[FSDP][2/N] Fix grad zero vs. None edge case #87308

Conversation

awgu commented Oct 19, 2022 • edited

pytorch-bot bot commented Oct 19, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87308

✅ No Failures, 3 Pending

awgu commented Oct 21, 2022

pytorchmergebot commented Oct 21, 2022

Merge started

github-actions bot commented Oct 21, 2022

[FSDP][2/N] Fix grad zero vs. `None` edge case #87308

[FSDP][2/N] Fix grad zero vs. `None` edge case #87308

awgu commented Oct 19, 2022 •

edited

pytorch-bot bot commented Oct 19, 2022 •

edited