[FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` #100180

awgu · 2023-04-27T16:20:01Z

Stack from ghstack (oldest at bottom):

This should fix #98494. We follow a similar approach as in past PRs for mismatched dtype or size from running in no_sync().

[ghstack-poisoned]

pytorch-bot · 2023-04-27T16:20:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100180

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b4d402e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8102bea96ec3551b1041a299b005e6e47f3cbbba Pull Request resolved: #100180

awgu · 2023-04-27T16:27:24Z

test/distributed/fsdp/test_fsdp_grad_acc.py

@@ -117,12 +117,6 @@ def _test_grad_acc(
                point to prefetch the next layer's full parameters during the
                backward pass, if at all.
        """
-        # Gradient accumulation outside `no_sync()` is not currently compatible


Previously, we were skipping all CPU offloading tests since every tested config included use_no_sync == False 😢

That is why I thought use_orig_params=True worked with CPUOffload(True), but we were actually skipping the test.

This should fix #98494. [ghstack-poisoned]

ghstack-source-id: 8102bea96ec3551b1041a299b005e6e47f3cbbba Pull Request resolved: #100180

This should fix #98494. We follow a similar approach as in past PRs for mismatched dtype or size from running in `no_sync()`. [ghstack-poisoned]

ghstack-source-id: 4d6ae641b0016ae04de6b5ee4fc81777eeef0496 Pull Request resolved: #100180

rohan-varma

LGTM, thanks for the fix!

rohan-varma · 2023-05-01T04:58:08Z

test/distributed/fsdp/test_fsdp_grad_acc.py


        NOTE: Gradient accumulation without using the ``no_sync()`` context
-        manager is not currently compatible with CPU offloading, so those tests
-        just return directly.
+        manager is not currently compatible with CPU offloading.


curious, how do we error here?

We do not error. It is silently incorrect if I understand correctly.

rohan-varma · 2023-05-01T04:58:34Z

test/distributed/fsdp/test_fsdp_grad_acc.py

+        )
+
+    @skip_if_lt_x_gpu(2)
+    @parametrize("use_orig_params", [False, True])


should we put this as a subtest?

I think we can avoid subtesting major config differences, where we actually expect based on the implementation there may be a difference. At least that is what I have been doing so far.

rohan-varma · 2023-05-01T04:59:58Z

torch/distributed/fsdp/flat_param.py

+                # NOTE: This is a hack using `.data` to side step the check
+                # that parameter/gradient sizes/dtypes/devices match. From
+                # calling `reshard()`, `param` has the sharded size, the full
+                # precision dtype, and is on CPU. Thus, one or more of the


is on CPU only if CPU offloading?

rohan-varma · 2023-05-01T05:00:17Z

torch/distributed/fsdp/flat_param.py

+                # calling `reshard()`, `param` has the sharded size, the full
+                # precision dtype, and is on CPU. Thus, one or more of the
+                # following cases can hold when in `no_sync()`:
+                # 1. `view` can have the unsharded size.


view is the grad here right, can we clarify that?

Let me do it in a follow-up to avoid re-triggering CI.

awgu · 2023-05-01T05:13:40Z

@pytorchbot merge

pytorchmergebot · 2023-05-01T05:15:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…100180) This should fix pytorch#98494. We follow a similar approach as in past PRs for mismatched dtype or size from running in `no_sync()`. Pull Request resolved: pytorch#100180 Approved by: https://github.com/rohan-varma

[FSDP] Fix use_orig_params=True, CPU offload, no_sync()

00b1807

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners April 27, 2023 16:20

awgu mentioned this pull request Apr 27, 2023

[FSDP] Subtest sharding strategy in test_fsdp_grad_acc.py #100178

Closed

awgu mentioned this pull request Apr 27, 2023

[FSDP] Remove unneeded disable of tf32 #100179

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 27, 2023

awgu added a commit that referenced this pull request Apr 27, 2023

[FSDP] Fix use_orig_params=True, CPU offload, no_sync()

e3388bd

ghstack-source-id: 8102bea96ec3551b1041a299b005e6e47f3cbbba Pull Request resolved: #100180

awgu commented Apr 27, 2023

View reviewed changes

awgu added the topic: bug fixes topic category label Apr 27, 2023

Update on "[FSDP] Fix use_orig_params=True, CPU offload, no_sync()"

bb86381

This should fix #98494. [ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 27, 2023

[FSDP] Fix use_orig_params=True, CPU offload, no_sync()

f196d43

ghstack-source-id: 8102bea96ec3551b1041a299b005e6e47f3cbbba Pull Request resolved: #100180

Update on "[FSDP] Fix use_orig_params=True, CPU offload, no_sync()"

b4d402e

This should fix #98494. We follow a similar approach as in past PRs for mismatched dtype or size from running in `no_sync()`. [ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 27, 2023

[FSDP] Fix use_orig_params=True, CPU offload, no_sync()

ad23d72

ghstack-source-id: 4d6ae641b0016ae04de6b5ee4fc81777eeef0496 Pull Request resolved: #100180

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 27, 2023

rohan-varma approved these changes May 1, 2023

View reviewed changes

pytorchmergebot added the merging label May 1, 2023

pytorchmergebot added the Merged label May 1, 2023

pytorchmergebot closed this in 83b803c May 1, 2023

awgu mentioned this pull request May 1, 2023

[Easy][FSDP] Clarify _use_unsharded_grad_views comment #100359

Closed

facebook-github-bot deleted the gh/awgu/396/head branch June 8, 2023 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` #100180

[FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` #100180

awgu commented Apr 27, 2023 •

edited

pytorch-bot bot commented Apr 27, 2023 •

edited

awgu Apr 27, 2023

rohan-varma left a comment

rohan-varma May 1, 2023

awgu May 1, 2023

rohan-varma May 1, 2023

awgu May 1, 2023

rohan-varma May 1, 2023

awgu May 1, 2023

rohan-varma May 1, 2023

awgu May 1, 2023

awgu commented May 1, 2023

pytorchmergebot commented May 1, 2023

[FSDP] Fix use_orig_params=True, CPU offload, no_sync() #100180

[FSDP] Fix use_orig_params=True, CPU offload, no_sync() #100180

Conversation

awgu commented Apr 27, 2023 • edited

pytorch-bot bot commented Apr 27, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100180

✅ No Failures

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgu commented May 1, 2023

pytorchmergebot commented May 1, 2023

Merge started

[FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` #100180

[FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` #100180

awgu commented Apr 27, 2023 •

edited

pytorch-bot bot commented Apr 27, 2023 •

edited