[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` #88955

awgu · 2022-11-12T20:26:59Z

Stack from ghstack:

[FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD #88955 [FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD
[FSDP] Introduce ModuleWrapPolicy for simplicity #88450 [FSDP] Introduce ModuleWrapPolicy for simplicity

This PR fixes FSDP.clip_grad_norm_() for NO_SHARD, which previously "double-counted" each gradient world_size-many times.

This does not address any discrepancies between FULL_SHARD and DDP. (Note that the unit tests do show parity between FULL_SHARD and DDP when using FSDP.clip_grad_norm_() and nn.utils.clip_grad_norm_() respectively on one iteration.)

The added unit test code path tests mixing nested FSDP instances with both FULL_SHARD and NO_SHARD to ensure that the local_sharded_norm and local_nonsharded_norm computations are interoperating correctly. I want to test non-FSDP root instance in the future, but this is BC breaking since we need to make clip_grad_norm_() a static method, which would require a different method call syntax (FSDP.clip_grad_norm_(root_module, ...) vs. root_module.clip_grad_norm_(...)).

[ghstack-poisoned]

pytorch-bot · 2022-11-12T20:27:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88955

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 55e476f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 17065b44c859f14642d371a37198bdd556220e19 Pull Request resolved: #88955

zhaojuanmao · 2022-11-12T23:32:01Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

@@ -1161,23 +1161,45 @@ def clip_grad_norm_(
            self._streams["unshard"],
            self._streams["pre_unshard"],
        )
-


For NO_SHARD, wondering whether we can just call the default PyTorch clip_grad_norm_() and early return without all_reduce and following logics?

Yes, we could, but this version enables us to mix and match NO_SHARD and FULL_SHARD for different submodules. If the user knows the entire FSDP instance is only NO_SHARD, then they can just use torch.nn.utils.clip_grad_norm_().

I think some of this logic in FSDP.clip_grad_norm_() will be useful when thinking about how to write clip_grad_norm_() for our composable APIs.

We can optionally do a check over all FSDP instances in the module tree, and if all are NO_SHARD, then we can return nn.utils.clip_grad_norm_() directly as you suggested. Maybe I can include that fast path in a follow-up.

I see, that makes sense!

awgu · 2022-11-13T02:37:01Z

@pytorchbot merge

pytorchmergebot · 2022-11-13T02:38:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR fixes `FSDP.clip_grad_norm_()` for `NO_SHARD`, which previously "double-counted" each gradient `world_size`-many times. This does not address any discrepancies between `FULL_SHARD` and DDP. (Note that the unit tests do show parity between `FULL_SHARD` and DDP when using `FSDP.clip_grad_norm_()` and `nn.utils.clip_grad_norm_()` respectively on one iteration.) The added unit test code path tests mixing nested FSDP instances with both `FULL_SHARD` and `NO_SHARD` to ensure that the `local_sharded_norm` and `local_nonsharded_norm` computations are interoperating correctly. I want to test non-FSDP root instance in the future, but this is BC breaking since we need to make `clip_grad_norm_()` a static method, which would require a different method call syntax (`FSDP.clip_grad_norm_(root_module, ...)` vs. `root_module.clip_grad_norm_(...)`). Pull Request resolved: pytorch#88955 Approved by: https://github.com/zhaojuanmao

[FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD

55e476f

[ghstack-poisoned]

awgu requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang and kwen2501 as code owners November 12, 2022 20:27

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 12, 2022

awgu added a commit that referenced this pull request Nov 12, 2022

[FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD

6b18655

ghstack-source-id: 17065b44c859f14642d371a37198bdd556220e19 Pull Request resolved: #88955

awgu added the topic: improvements topic category label Nov 12, 2022

zhaojuanmao reviewed Nov 12, 2022

View reviewed changes

zhaojuanmao approved these changes Nov 12, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2022

pytorchmergebot added the Merged label Nov 13, 2022

pytorchmergebot closed this in 4f2639e Nov 13, 2022

facebook-github-bot deleted the gh/awgu/199/head branch June 8, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` #88955

[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` #88955

awgu commented Nov 12, 2022 •

edited

pytorch-bot bot commented Nov 12, 2022 •

edited

zhaojuanmao Nov 12, 2022

awgu Nov 12, 2022

awgu Nov 12, 2022

zhaojuanmao Nov 13, 2022

awgu commented Nov 13, 2022

pytorchmergebot commented Nov 13, 2022

[FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD #88955

[FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD #88955

Conversation

awgu commented Nov 12, 2022 • edited

pytorch-bot bot commented Nov 12, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88955

✅ No Failures

zhaojuanmao Nov 12, 2022

Choose a reason for hiding this comment

awgu Nov 12, 2022

Choose a reason for hiding this comment

awgu Nov 12, 2022

Choose a reason for hiding this comment

zhaojuanmao Nov 13, 2022

Choose a reason for hiding this comment

awgu commented Nov 13, 2022

pytorchmergebot commented Nov 13, 2022

Merge started

[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` #88955

[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` #88955

awgu commented Nov 12, 2022 •

edited

pytorch-bot bot commented Nov 12, 2022 •

edited