[FSDP] Option to keep grads in lower precision #85062

rohan-varma · 2022-09-15T00:29:06Z

Stack from ghstack (oldest at bottom):

Rehash of a similar PR from a month ago that got stale. Adds a config to FSDP MP so that gradients can be kept in lower precision, to support optimizers such as AnyPrecisionOptimizer which would like to keep grads in bf16.

To do this, for sharded cases, we cannot simply omit the cast back to the full precision param dtype, otherwise when setting p.grad = p._saved_grad_shard in finalize_params, autograd will throw an error indicating that the grad dtype should match the param dtype when it is being set.

As a workaround, we re-cast after setting this. Although, this means that for cases that use gradient accumulation, p._saved_grad_shard will be of the reduced dtype because it is set to p.grad in _prep_grad_for_backward. As a result, add a check + recast here as well.

Differential Revision: D39529117

Differential Revision: [D39529117](https://our.internmc.facebook.com/intern/diff/D39529117/) [ghstack-poisoned]

pytorch-bot · 2022-09-15T00:29:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85062

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 16 Pending

As of commit 336055f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Differential Revision: [D39529117](https://our.internmc.facebook.com/intern/diff/D39529117/) ghstack-source-id: 167439055 Pull Request resolved: #85062

rohan-varma · 2022-09-16T01:43:22Z

#85134

[FSDP] Option to keep grads in lower precision

336055f

Differential Revision: [D39529117](https://our.internmc.facebook.com/intern/diff/D39529117/) [ghstack-poisoned]

rohan-varma requested review from mrshenli, pritamdamania87, zhaojuanmao, H-Huang, awgu and mingzhe09088 as code owners September 15, 2022 00:29

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Sep 15, 2022

facebook-github-bot added the cla signed label Sep 15, 2022

This was referenced Sep 15, 2022

[CheckpointWrapper] Decouple CPU offload #84907

Closed

[CheckpointWrapper] Reentrant kwarg support #84908

Closed

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 15, 2022

rohan-varma added a commit that referenced this pull request Sep 15, 2022

[FSDP] Option to keep grads in lower precision

9789016

Differential Revision: [D39529117](https://our.internmc.facebook.com/intern/diff/D39529117/) ghstack-source-id: 167439055 Pull Request resolved: #85062

rohan-varma mentioned this pull request Sep 15, 2022

[WIP][FSDP] Option to keep gradients in lower precision #83310

Closed

rohan-varma closed this Sep 16, 2022

facebook-github-bot deleted the gh/rohan-varma/591/head branch June 8, 2023 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Option to keep grads in lower precision #85062

[FSDP] Option to keep grads in lower precision #85062

rohan-varma commented Sep 15, 2022 •

edited

pytorch-bot bot commented Sep 15, 2022 •

edited

rohan-varma commented Sep 16, 2022

[FSDP] Option to keep grads in lower precision #85062

[FSDP] Option to keep grads in lower precision #85062

Conversation

rohan-varma commented Sep 15, 2022 • edited

pytorch-bot bot commented Sep 15, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85062

✅ No Failures, 16 Pending

rohan-varma commented Sep 16, 2022

rohan-varma commented Sep 15, 2022 •

edited

pytorch-bot bot commented Sep 15, 2022 •

edited