[FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` #98250

awgu · 2023-04-03T21:39:53Z

Stack from ghstack (oldest at bottom):

This PR has SHARD_GRAD_OP (and _HYBRID_SHARD_ZERO2) skip _use_sharded_views() in the post-forward reshard since the strategy does not free the unsharded flat parameter and can preserve the unsharded views. This saves nontrivial CPU overhead both in the post-forward reshard (_use_sharded_views()) and the pre-backward unshard (_use_unsharded_views()).

(Before) Pre-backward hook: 4.356 ms

(After) Pre-backward hook: 1.044 ms

[ghstack-poisoned]

pytorch-bot · 2023-04-03T21:39:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98250

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5e3aacc:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: bfec3b5c298bd1fb4ade5843039205ed28c35b1b Pull Request resolved: #98250

[ghstack-poisoned]

ghstack-source-id: 6c1d4fcf0987271375074c1c70149c5223c23cef Pull Request resolved: #98250

rohan-varma

LGTM, thanks!

If we're concerned about blast radius here, feel free to gate w an env var before landing.

test/distributed/fsdp/test_fsdp_use_orig_params.py

rohan-varma · 2023-04-03T22:06:49Z

torch/distributed/fsdp/flat_param.py

@@ -1055,7 +1063,7 @@ def pre_unshard(self) -> bool:
        matches the dtype of the expected unsharded parameter.
        """
        ret = False
-        if self._use_orig_params:
+        if self._use_orig_params and not self._skipped_use_sharded_views:


does this mean writeback feature gets disabled for zero2? can we just call use_sharded_views here?

To clarify, we should not call _use_sharded_views() here because that will simply undo the skipping.

Post-forward reshard skips _use_sharded_views().
Pre-backward unshard first calls pre_unshard(). If we call _use_sharded_views() in pre_unshard(), then we undo the skip.

I am adding back the writeback check but raising an error if we detect a change between forward and backward for SHARD_GRAD_OP.

torch/distributed/fsdp/flat_param.py

rohan-varma · 2023-04-03T22:10:33Z

torch/distributed/fsdp/flat_param.py

+            if (
+                in_forward
+                and self._sharding_strategy
+                not in NO_RESHARD_AFTER_FORWARD_HANDLE_STRATEGIES


so we also skip calling use sharded grad views in this PR?

Yes, I think we have to unless we want to use the .data hack to bypass the shape check since otherwise the parameters are unsharded while the gradients are sharded. This is a niche use case anyway since a gradient must be accumulated (either actually accumulated or zero_grad(set_to_none=False)).

Upon more thought, I think we should skip it anyway because unsharded parameters with sharded gradients can be confusing. If the user wants to inspect the gradients, they can use summon_full_params(with_grads=True).

[ghstack-poisoned]

ghstack-source-id: 4b5eb453deb8e55f2051778546a10e69a2f1e7ce Pull Request resolved: #98250

ghstack-source-id: 4b5eb453deb8e55f2051778546a10e69a2f1e7ce Pull Request resolved: pytorch#98250

(Before) Pre-backward hook: 4.356 ms <img width="812" alt="Screenshot 2023-04-03 at 6 32 19 PM" src="https://user-images.githubusercontent.com/31054793/229641309-778cf1f9-4b5b-42ec-b2d8-0a1e6e7ce330.png"> (After) Pre-backward hook: 0.483 ms <img width="1025" alt="Screenshot 2023-04-03 at 6 32 25 PM" src="https://user-images.githubusercontent.com/31054793/229641301-971d3c60-a4f1-4561-bb33-7aa07c42d0bd.png"> The "after" value might increase slightly if I add back the `_writeback_orig_params()` check. [ghstack-poisoned]

ghstack-source-id: d33c3fe30e2a44229664dd819e7bc6ecca0c20a5 Pull Request resolved: #98250

(Before) Pre-backward hook: 4.356 ms <img width="812" alt="Screenshot 2023-04-03 at 6 32 19 PM" src="https://user-images.githubusercontent.com/31054793/229641309-778cf1f9-4b5b-42ec-b2d8-0a1e6e7ce330.png"> (After) Pre-backward hook: 0.483 ms <img width="1025" alt="Screenshot 2023-04-03 at 6 32 25 PM" src="https://user-images.githubusercontent.com/31054793/229641301-971d3c60-a4f1-4561-bb33-7aa07c42d0bd.png"> The "after" value might increase slightly if I add back the `_writeback_orig_params()` check. [ghstack-poisoned]

ghstack-source-id: c458765fefa941c218242e27f02d0c3307a6d31c Pull Request resolved: #98250

(Before) Pre-backward hook: 4.356 ms <img width="812" alt="Screenshot 2023-04-03 at 6 32 19 PM" src="https://user-images.githubusercontent.com/31054793/229641309-778cf1f9-4b5b-42ec-b2d8-0a1e6e7ce330.png"> (After) Pre-backward hook: 0.483 ms <img width="1025" alt="Screenshot 2023-04-03 at 6 32 25 PM" src="https://user-images.githubusercontent.com/31054793/229641301-971d3c60-a4f1-4561-bb33-7aa07c42d0bd.png"> The "after" value might increase slightly if I add back the `_writeback_orig_params()` check. [ghstack-poisoned]

ghstack-source-id: 526f1c42f31b858228c66c8934857c8348ffa281 Pull Request resolved: #98250

awgu · 2023-04-04T17:05:09Z

@pytorchbot merge

pytorchmergebot · 2023-04-04T17:07:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR has `SHARD_GRAD_OP` (and `_HYBRID_SHARD_ZERO2`) skip `_use_sharded_views()` in the post-forward reshard since the strategy does not free the unsharded flat parameter and can preserve the unsharded views. This saves nontrivial CPU overhead both in the post-forward reshard (`_use_sharded_views()`) and the pre-backward unshard (`_use_unsharded_views()`). <details> <summary>(Before) Pre-backward hook: 4.356 ms</summary> <img width="812" alt="Screenshot 2023-04-03 at 6 32 19 PM" src="https://user-images.githubusercontent.com/31054793/229641309-778cf1f9-4b5b-42ec-b2d8-0a1e6e7ce330.png"> </details> <details> <summary>(After) Pre-backward hook: 1.044 ms</summary> ![Screenshot 2023-04-04 at 9 05 53 AM](https://user-images.githubusercontent.com/31054793/229800917-9580ce6b-3721-469a-9212-f0cbfd8cbb52.png) </details> Pull Request resolved: #98250 Approved by: https://github.com/rohan-varma

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

7e76d17

[ghstack-poisoned]

awgu mentioned this pull request Apr 3, 2023

[FSDP] Allow non-uniform requires_grad for use_orig_params=True #98221

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 3, 2023

awgu mentioned this pull request Apr 3, 2023

[FSDP] Use correct handle training state when prefetching #98249

Closed

awgu added a commit that referenced this pull request Apr 3, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

53e67a9

ghstack-source-id: bfec3b5c298bd1fb4ade5843039205ed28c35b1b Pull Request resolved: #98250

Update on "[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP"

43dd8eb

[ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 3, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

079ad4f

ghstack-source-id: 6c1d4fcf0987271375074c1c70149c5223c23cef Pull Request resolved: #98250

rohan-varma approved these changes Apr 3, 2023

View reviewed changes

Update on "[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP"

00ae0bc

[ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 3, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

caa5069

ghstack-source-id: 4b5eb453deb8e55f2051778546a10e69a2f1e7ce Pull Request resolved: #98250

awgu added a commit to awgu/pytorch that referenced this pull request Apr 3, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

1ebec4f

ghstack-source-id: 4b5eb453deb8e55f2051778546a10e69a2f1e7ce Pull Request resolved: pytorch#98250

awgu added a commit that referenced this pull request Apr 3, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

a25d5f1

ghstack-source-id: d33c3fe30e2a44229664dd819e7bc6ecca0c20a5 Pull Request resolved: #98250

awgu added a commit that referenced this pull request Apr 4, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

45e1343

ghstack-source-id: c458765fefa941c218242e27f02d0c3307a6d31c Pull Request resolved: #98250

awgu added a commit that referenced this pull request Apr 4, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP

fe8f094

ghstack-source-id: 526f1c42f31b858228c66c8934857c8348ffa281 Pull Request resolved: #98250

awgu added ciflow/trunk Trigger trunk jobs on your pull request topic: improvements topic category labels Apr 4, 2023

awgu marked this pull request as ready for review April 4, 2023 13:34

awgu requested review from mrshenli, zhaojuanmao, H-Huang, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners April 4, 2023 13:34

awgu mentioned this pull request Apr 4, 2023

[FSDP][Easy] Remove unused requires_grad_mask #98299

Closed

This was referenced Apr 4, 2023

[FSDP] Add skip writeback check gated by env var #98300

Closed

[FSDP] Only move current FSDP's states to GPU during init #98319

Closed

[FSDP][Docs] Add warning about forward saving param refs #98320

Closed

pytorchmergebot added the Merged label Apr 4, 2023

pytorchmergebot closed this in 10271a6 Apr 4, 2023

This was referenced Apr 4, 2023

[SyncBatchNorm] Support running with low precision parameters #98332

Closed

[Not for Land][FSDP] Only sync free event if needs unshard #98354

Closed

facebook-github-bot deleted the gh/awgu/377/head branch June 8, 2023 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` #98250

[FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` #98250

awgu commented Apr 3, 2023 •

edited

pytorch-bot bot commented Apr 3, 2023 •

edited

rohan-varma left a comment

rohan-varma Apr 3, 2023

awgu Apr 4, 2023

rohan-varma Apr 3, 2023

awgu Apr 3, 2023

awgu Apr 4, 2023

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 4, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP #98250

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP #98250

Conversation

awgu commented Apr 3, 2023 • edited

pytorch-bot bot commented Apr 3, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98250

✅ No Failures

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Apr 3, 2023

Choose a reason for hiding this comment

awgu Apr 4, 2023

Choose a reason for hiding this comment

rohan-varma Apr 3, 2023

Choose a reason for hiding this comment

awgu Apr 3, 2023

Choose a reason for hiding this comment

awgu Apr 4, 2023

Choose a reason for hiding this comment

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 4, 2023

Merge started

[FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` #98250

[FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` #98250

awgu commented Apr 3, 2023 •

edited

pytorch-bot bot commented Apr 3, 2023 •

edited