FSDP optimizer overlap #98667

rohan-varma · 2023-04-08T04:15:49Z

Stack from ghstack (oldest at bottom):

constraints:

No support for gradient accumulation
CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: D44809582

constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

pytorch-bot · 2023-04-08T04:15:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98667

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

MacOS arm64 runners are not available in CI

✅ No Failures

As of commit 6084ced:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) ghstack-source-id: 185503498 Pull Request resolved: #98667

constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 185660100 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

constraints: 1. No support for gradient accumulation 2. CPU offload is not performant as the step still occurs on CPU. Will fix this for large scale CPU offload use cases as follow up. 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 185740335 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

awgu · 2023-04-12T00:46:49Z

torch/distributed/fsdp/_runtime_utils.py

+                        ):
+                            for optim in orig_param._in_backward_optimizers:
+                                optim.step()
+                            # Not sure if we need to do this. Setting entire


In my understanding, we should still do this.

We do not any references keeping the flat gradient alive, and we want the invariant that "if flat_param.grad is None, then each original parameter's .grad is None".

torch/distributed/fsdp/flat_param.py

torch/distributed/fsdp/_runtime_utils.py

awgu · 2023-04-12T00:52:28Z

torch/distributed/fsdp/_runtime_utils.py


    if root_state._sync_gradients:
+        # TODO: also waits for optimizer step to finish. This can be pushed to the next forward
+        # by recording an event for each individually wrapped FSDP module, and only waiting


In order for this mentioned solution to work, do we need to run the overlapped optimizer outside the post-backward stream and in a separate stream? Otherwise, wait_stream() is too coarse grained to differentiate between the optimizer computation and other existing post-backward ops.

Yeah, I think this solution would involve running the optimizer in its own optimizer stream that's synced appropriately with the post-backward stream, and then we sync on that in the next forward. I think we can iron out the details in a follow up PR if we want to investigate this.

awgu · 2023-04-12T00:53:12Z

torch/distributed/fsdp/_runtime_utils.py

        # their gradients. They must be re-registered every forward pass in case
        # the `grad_fn` is mutated.
        _register_post_backward_hooks(state, handles)
+        # We may have to reallocate the _cpu_grad if optimizer overlap set the grad to None in the backward pass.


Is this true? If we set .grad to None, the ._cpu_grad still exists?

Hmm, I needed to have this in my CPU offload + optimizer overlap initial testing. Will check it again and provide the reasoning.

@awgu I'm keeping this since I'm setting cpu_grad=None in the post-backward. This is to maintain the invariant that if all orig param grad = None, then handle.sharded_grad should be None. In this case (when CPU offload enabled), handle.sharded_grad == cpu_grad.

This is to maintain the invariant that if all orig param grad = None

Was this true before this PR? handle.sharded_grad has a lot of cases due to unifying NO_SHARD / CPU offload, so I do not remember off the top of my head.

I am wondering if that invariant is valuable, or if we should change sharded_grad. Reallocating the _cpu_grad every iteration may be bad/unnecessary from a performance perspective. This will be a CPU allocation plus zero-fill kernel (not sure if there is any overhead to pin the memory too -- is that a copy to pinned memory?).

I feel like the invariant is valuable because one of the goals of optimizer in backward is to cut down on memory usage by not holding on to sharded gradients that have already been applied on the parameter.

Due to this, in GPU case, it seems essential that we set handle.sharded_grad (whether sharded_grad is the flat_param.grad or flat_param._saved_grad_shard) to None - if we don't do this and just set orig_param.grad = None, we won't see a memory reduction since FSDP internals will still hold on to a grad.

in CPU case, it is a little trickier - ideally we'd like to set _cpu_grad = None as well, but this results in the extra reallocation in the next forward. Since we're in CPU memory, saving memory is less crucial because usually GPU memory << CPU memory, so maybe we just zero the cpu_grad instead of setting to None. The counterargument is that having unified CPU offload / no CPU offload behavior is valuable, so optim in backward should always set the gradient to None and reduce the memory, regardless of whether the gradient is CPU or GPU

(I don't quite understand why I had to reallocate CPU grad, but GPU grad doesn't need to be reallocated when set to none in the next iteration. I'm assuming its automatically handled somewhere in pre-backward)

Thanks for this analysis! I am fully onboard for the GPU case, but I am still skeptical about the performance on the CPU case (and I do not think right now that the CPU memory savings from freeing sharded gradients is meaningful). I am worried that the cost for this invariant is too high. Maybe we can go with this invariant first, but we can look at the performance (e.g. via traces) and reevaluate?

torch/distributed/fsdp/_runtime_utils.py

test/distributed/fsdp/test_fsdp_misc.py

rohan-varma · 2023-04-24T17:35:17Z

another point for discussion: whether to disable prepare_gradient_for_optim call when overlapping optimizer. cc @awgu

EDIT: made this call essentially a noop

constraints: 1. No support for gradient accumulation 2. CPU offload is not supported. Will fix this for large scale CPU offload use cases as follow up. 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

fegin · 2023-06-30T21:35:07Z

torch/distributed/fsdp/_runtime_utils.py

+                        ):
+                            # TODO (rohan-varma): For CPU offload, this unfortunately
+                            # operates on CPU, because the parameters and gradients
+                            # have already been offloaded.


Is the plan to move the step() to GPU? If so, should we clarify the TODO here?

constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 194021796 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

fegin

LGTM

constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 194085953 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

rohan-varma · 2023-07-08T05:29:20Z

@pytorchbot merge -f "CI done"

pytorchmergebot · 2023-07-08T05:30:47Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-07-08T05:30:59Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x ea557b2e5a91d95ad8c0606eef92db72e44b64fc returned non-zero exit code 1

Auto-merging test/distributed/fsdp/test_fsdp_misc.py
Auto-merging torch/distributed/fsdp/_runtime_utils.py
CONFLICT (content): Merge conflict in torch/distributed/fsdp/_runtime_utils.py
Auto-merging torch/distributed/fsdp/flat_param.py
error: could not apply ea557b2e5a9... FSDP optim overlap
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]

Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 194497876 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

rohan-varma requested review from H-Huang, awgu, d4l3k, fegin, kiukchung, kwen2501, mrshenli, wanchaol and zhaojuanmao as code owners April 8, 2023 04:15

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 8, 2023

rohan-varma mentioned this pull request Apr 8, 2023

[FSDP] Clarify CPU offload implicitly in reshard_doc #98666

Closed

rohan-varma changed the title ~~[WIP] Prototype FSDP optimizer overlap~~ FSDP optimizer overlap Apr 11, 2023

awgu reviewed Apr 12, 2023

View reviewed changes

rohan-varma mentioned this pull request Apr 24, 2023

Support backward hook optimizers in FSDP #98419

Closed

fegin reviewed Jun 30, 2023

View reviewed changes

fegin approved these changes Jul 7, 2023

View reviewed changes

rohan-varma mentioned this pull request Jul 8, 2023

[FSDP] Refactor optimizer in backward #104813

Closed

rohan-varma mentioned this pull request Jul 8, 2023

[noland] checkpoint optim in backward #104818

Closed

pytorchmergebot added the merging label Jul 8, 2023

pytorchmergebot removed the merging label Jul 8, 2023

rohan-varma requested a review from fduwjj as a code owner July 13, 2023 01:12

pytorchmergebot added the Merged label Jul 13, 2023

pytorchmergebot closed this in f2eed12 Jul 13, 2023

facebook-github-bot deleted the gh/rohan-varma/661/head branch July 16, 2023 14:16

awaelchli mentioned this pull request Oct 5, 2023

Remove fsdp_overlap_step_with_backward in favor of native solution Lightning-AI/pytorch-lightning#18726

Merged

FSDP optimizer overlap #98667

FSDP optimizer overlap #98667

Uh oh!

Conversation

rohan-varma commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98667

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rohan-varma commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Jul 8, 2023

Uh oh!

pytorchmergebot commented Jul 8, 2023

Merge started

Uh oh!

pytorchmergebot commented Jul 8, 2023

Merge failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rohan-varma commented Apr 8, 2023 •

edited

Loading

pytorch-bot bot commented Apr 8, 2023 •

edited

Loading

rohan-varma Jun 29, 2023 •

edited

Loading

rohan-varma commented Apr 24, 2023 •

edited

Loading