-
Notifications
You must be signed in to change notification settings - Fork 25.6k
FSDP optimizer overlap #98667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP optimizer overlap #98667
Conversation
constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98667
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 6084ced: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) ghstack-source-id: 185503498 Pull Request resolved: #98667
constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 185660100 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
constraints: 1. No support for gradient accumulation 2. CPU offload is not performant as the step still occurs on CPU. Will fix this for large scale CPU offload use cases as follow up. 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 185740335 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
): | ||
for optim in orig_param._in_backward_optimizers: | ||
optim.step() | ||
# Not sure if we need to do this. Setting entire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, we should still do this.
We do not any references keeping the flat gradient alive, and we want the invariant that "if flat_param.grad is None
, then each original parameter's .grad is None
".
|
||
if root_state._sync_gradients: | ||
# TODO: also waits for optimizer step to finish. This can be pushed to the next forward | ||
# by recording an event for each individually wrapped FSDP module, and only waiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order for this mentioned solution to work, do we need to run the overlapped optimizer outside the post-backward stream and in a separate stream? Otherwise, wait_stream()
is too coarse grained to differentiate between the optimizer computation and other existing post-backward ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this solution would involve running the optimizer in its own optimizer stream that's synced appropriately with the post-backward stream, and then we sync on that in the next forward. I think we can iron out the details in a follow up PR if we want to investigate this.
# their gradients. They must be re-registered every forward pass in case | ||
# the `grad_fn` is mutated. | ||
_register_post_backward_hooks(state, handles) | ||
# We may have to reallocate the _cpu_grad if optimizer overlap set the grad to None in the backward pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? If we set .grad
to None
, the ._cpu_grad
still exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I needed to have this in my CPU offload + optimizer overlap initial testing. Will check it again and provide the reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awgu I'm keeping this since I'm setting cpu_grad=None in the post-backward. This is to maintain the invariant that if all orig param grad = None, then handle.sharded_grad should be None. In this case (when CPU offload enabled), handle.sharded_grad == cpu_grad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to maintain the invariant that if all orig param grad = None
Was this true before this PR? handle.sharded_grad
has a lot of cases due to unifying NO_SHARD
/ CPU offload, so I do not remember off the top of my head.
I am wondering if that invariant is valuable, or if we should change sharded_grad
. Reallocating the _cpu_grad
every iteration may be bad/unnecessary from a performance perspective. This will be a CPU allocation plus zero-fill kernel (not sure if there is any overhead to pin the memory too -- is that a copy to pinned memory?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the invariant is valuable because one of the goals of optimizer in backward is to cut down on memory usage by not holding on to sharded gradients that have already been applied on the parameter.
Due to this, in GPU case, it seems essential that we set handle.sharded_grad
(whether sharded_grad is the flat_param.grad or flat_param._saved_grad_shard) to None - if we don't do this and just set orig_param.grad = None
, we won't see a memory reduction since FSDP internals will still hold on to a grad.
in CPU case, it is a little trickier - ideally we'd like to set _cpu_grad = None
as well, but this results in the extra reallocation in the next forward. Since we're in CPU memory, saving memory is less crucial because usually GPU memory << CPU memory, so maybe we just zero the cpu_grad instead of setting to None. The counterargument is that having unified CPU offload / no CPU offload behavior is valuable, so optim in backward should always set the gradient to None and reduce the memory, regardless of whether the gradient is CPU or GPU
(I don't quite understand why I had to reallocate CPU grad, but GPU grad doesn't need to be reallocated when set to none in the next iteration. I'm assuming its automatically handled somewhere in pre-backward)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this analysis! I am fully onboard for the GPU case, but I am still skeptical about the performance on the CPU case (and I do not think right now that the CPU memory savings from freeing sharded gradients is meaningful). I am worried that the cost for this invariant is too high. Maybe we can go with this invariant first, but we can look at the performance (e.g. via traces) and reevaluate?
another point for discussion: whether to disable EDIT: made this call essentially a noop |
constraints: 1. No support for gradient accumulation 2. CPU offload is not supported. Will fix this for large scale CPU offload use cases as follow up. 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload is not supported. Will fix this for large scale CPU offload use cases as follow up. 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
): | ||
# TODO (rohan-varma): For CPU offload, this unfortunately | ||
# operates on CPU, because the parameters and gradients | ||
# have already been offloaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to move the step()
to GPU? If so, should we clarify the TODO here?
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 194021796 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 194085953 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
@pytorchbot merge -f "CI done" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command
Details for Dev Infra teamRaised by workflow job |
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) [ghstack-poisoned]
Pull Request resolved: #98667 constraints: 1. No support for gradient accumulation 2. CPU offload not supported 3. Step is waited on in post backward final cb, when in theory it can wait until the next forward. ghstack-source-id: 194497876 Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
Stack from ghstack (oldest at bottom):
constraints:
Differential Revision: D44809582