Skip to content

Conversation

rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Apr 8, 2023

Stack from ghstack (oldest at bottom):

constraints:

  1. No support for gradient accumulation
  2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
  3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
  4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: D44809582

constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Apr 8, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98667

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 6084ced:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 8, 2023
rohan-varma added a commit that referenced this pull request Apr 8, 2023
constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

ghstack-source-id: 185503498
Pull Request resolved: #98667
constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 11, 2023
Pull Request resolved: #98667

constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.
ghstack-source-id: 185660100

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
@rohan-varma rohan-varma changed the title [WIP] Prototype FSDP optimizer overlap FSDP optimizer overlap Apr 11, 2023
constraints:

1. No support for gradient accumulation
2. CPU offload is not performant as the step still occurs on CPU. Will fix this for large scale CPU offload use cases as follow up.
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 11, 2023
Pull Request resolved: #98667

constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.
ghstack-source-id: 185740335

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
):
for optim in orig_param._in_backward_optimizers:
optim.step()
# Not sure if we need to do this. Setting entire
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, we should still do this.

We do not any references keeping the flat gradient alive, and we want the invariant that "if flat_param.grad is None, then each original parameter's .grad is None".


if root_state._sync_gradients:
# TODO: also waits for optimizer step to finish. This can be pushed to the next forward
# by recording an event for each individually wrapped FSDP module, and only waiting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for this mentioned solution to work, do we need to run the overlapped optimizer outside the post-backward stream and in a separate stream? Otherwise, wait_stream() is too coarse grained to differentiate between the optimizer computation and other existing post-backward ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this solution would involve running the optimizer in its own optimizer stream that's synced appropriately with the post-backward stream, and then we sync on that in the next forward. I think we can iron out the details in a follow up PR if we want to investigate this.

# their gradients. They must be re-registered every forward pass in case
# the `grad_fn` is mutated.
_register_post_backward_hooks(state, handles)
# We may have to reallocate the _cpu_grad if optimizer overlap set the grad to None in the backward pass.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? If we set .grad to None, the ._cpu_grad still exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I needed to have this in my CPU offload + optimizer overlap initial testing. Will check it again and provide the reasoning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awgu I'm keeping this since I'm setting cpu_grad=None in the post-backward. This is to maintain the invariant that if all orig param grad = None, then handle.sharded_grad should be None. In this case (when CPU offload enabled), handle.sharded_grad == cpu_grad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to maintain the invariant that if all orig param grad = None

Was this true before this PR? handle.sharded_grad has a lot of cases due to unifying NO_SHARD / CPU offload, so I do not remember off the top of my head.

I am wondering if that invariant is valuable, or if we should change sharded_grad. Reallocating the _cpu_grad every iteration may be bad/unnecessary from a performance perspective. This will be a CPU allocation plus zero-fill kernel (not sure if there is any overhead to pin the memory too -- is that a copy to pinned memory?).

Copy link
Contributor Author

@rohan-varma rohan-varma Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the invariant is valuable because one of the goals of optimizer in backward is to cut down on memory usage by not holding on to sharded gradients that have already been applied on the parameter.

Due to this, in GPU case, it seems essential that we set handle.sharded_grad (whether sharded_grad is the flat_param.grad or flat_param._saved_grad_shard) to None - if we don't do this and just set orig_param.grad = None, we won't see a memory reduction since FSDP internals will still hold on to a grad.

in CPU case, it is a little trickier - ideally we'd like to set _cpu_grad = None as well, but this results in the extra reallocation in the next forward. Since we're in CPU memory, saving memory is less crucial because usually GPU memory << CPU memory, so maybe we just zero the cpu_grad instead of setting to None. The counterargument is that having unified CPU offload / no CPU offload behavior is valuable, so optim in backward should always set the gradient to None and reduce the memory, regardless of whether the gradient is CPU or GPU

(I don't quite understand why I had to reallocate CPU grad, but GPU grad doesn't need to be reallocated when set to none in the next iteration. I'm assuming its automatically handled somewhere in pre-backward)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this analysis! I am fully onboard for the GPU case, but I am still skeptical about the performance on the CPU case (and I do not think right now that the CPU memory savings from freeing sharded gradients is meaningful). I am worried that the cost for this invariant is too high. Maybe we can go with this invariant first, but we can look at the performance (e.g. via traces) and reevaluate?

@rohan-varma
Copy link
Contributor Author

rohan-varma commented Apr 24, 2023

another point for discussion: whether to disable prepare_gradient_for_optim call when overlapping optimizer. cc @awgu

EDIT: made this call essentially a noop

constraints:

1. No support for gradient accumulation
2. CPU offload is not supported. Will fix this for large scale CPU offload use cases as follow up.
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
constraints:

1. No support for gradient accumulation
2. CPU offload is not supported. Will fix this for large scale CPU offload use cases as follow up.
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]

constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
):
# TODO (rohan-varma): For CPU offload, this unfortunately
# operates on CPU, because the parameters and gradients
# have already been offloaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to move the step() to GPU? If so, should we clarify the TODO here?


constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]

constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]

constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 7, 2023
Pull Request resolved: #98667

constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.
ghstack-source-id: 194021796

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 7, 2023
Pull Request resolved: #98667

constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.
ghstack-source-id: 194085953

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]

constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
@rohan-varma
Copy link
Contributor Author

@pytorchbot merge -f "CI done"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x ea557b2e5a91d95ad8c0606eef92db72e44b64fc returned non-zero exit code 1

Auto-merging test/distributed/fsdp/test_fsdp_misc.py
Auto-merging torch/distributed/fsdp/_runtime_utils.py
CONFLICT (content): Merge conflict in torch/distributed/fsdp/_runtime_utils.py
Auto-merging torch/distributed/fsdp/flat_param.py
error: could not apply ea557b2e5a9... FSDP optim overlap
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job


constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
@rohan-varma rohan-varma requested a review from fduwjj as a code owner July 13, 2023 01:12

constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 13, 2023
Pull Request resolved: #98667

constraints:

1. No support for gradient accumulation
2. CPU offload not supported
3. Step is waited on in post backward final cb, when in theory it can wait
until the next forward.
ghstack-source-id: 194497876

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/661/head branch July 16, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants