Skip to content

Conversation

davidberard98
Copy link
Contributor

@davidberard98 davidberard98 commented Sep 23, 2023

Stack from ghstack (oldest at bottom):

Background: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

work->future_->addCallback([work](at::ivalue::Future& /* unused */) {
work->recordFunctionEndCallback_();
});

In order to guarantee safety, ivalue::Future::invokeCallback calls recordStream on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

synchronizeWithCurrentStreams();
callback(*this);

Change: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter uses_future for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

Tests: (a) unit tests; (b) added an assert in recordStream:

void recordStream(const DataPtr& ptr, cuda::CUDAStream stream) override {
and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled

TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109933

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 156d115 with merge base 7e6cf04 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Sep 23, 2023
davidberard98 added a commit that referenced this pull request Sep 23, 2023
TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

ghstack-source-id: 784d2ac
Pull Request resolved: #109933
…r event"

TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

[ghstack-poisoned]
…r event"

TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Sep 24, 2023
TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

ghstack-source-id: 0b9886a
Pull Request resolved: #109933
…r event"

TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

[ghstack-poisoned]
…r event"

TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Sep 26, 2023
TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

ghstack-source-id: 7ff5773
Pull Request resolved: #109933
…r event"

TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Sep 29, 2023
TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

ghstack-source-id: 171e138
Pull Request resolved: #109933
@davidberard98 davidberard98 added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 29, 2023
@davidberard98 davidberard98 marked this pull request as ready for review September 29, 2023 21:10
@davidberard98 davidberard98 changed the title [WIP] Remove recordStream for callback that ends a profiler event [distributed] Remove recordStream for callback that ends a profiler event Sep 29, 2023
@davidberard98 davidberard98 requested a review from lw September 29, 2023 21:11
@davidberard98
Copy link
Contributor Author

@pytorchbot rebase -s

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

… profiler event"


**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824

In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173

**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

**Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/davidberard98/229/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/109933)

pytorchmergebot pushed a commit that referenced this pull request Sep 29, 2023
TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

ghstack-source-id: ce83955
Pull Request resolved: #109933
@davidberard98 davidberard98 added the topic: not user facing topic category label Sep 29, 2023
@awgu
Copy link
Collaborator

awgu commented Sep 29, 2023

Amazing work!

[work](at::ivalue::Future& /* unused */) {
work->recordFunctionEndCallback_();
},
/*uses_future=*/false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to directly set this to False, or we want to make it configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that, since the callback doesn't use the future, there's no point in synchronizing when the callback is invoked.

But if there's a reason to, LMK and I can make it configurable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that mostly is for gating and rolliout purpose. This PR in general looks good to me, but have we battle tested in all FSDP workload? If not, we might just want to gate it for now so that instead of reverting the PR, we can iterate on top of this change? WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we control the callback, this change seems reasonable to me without gating (so we know that we do not use the future). I feel that in my experience with FSDP, gating ends up costing more in terms of maintenance than helping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to add some comments and say something like if the future is non empty, the uses_future must be set to true. Or is it possible to add a check if the lambda func is taking a unnamed argument (maybe not lol)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments

will do

add a check if the lambda func is taking a unnamed argument

I tried to do this https://github.com/pytorch/pytorch/pull/109933/files/b3ce2ec4fab56f5173fdbcb681759d7887832da2
but couldn't get the windows builds to pass

static_assert(
std::is_invocable_r<void, T, Future&>::value,
"The callback must have signature void(Future&)");
"The callback must have signature void(Future&) or void()");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the or void() thing stale? i didn't see implementations for a void() callback. (And if it is supported, then why not use it in the PGNCCL callbacks below?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, it's stale - let me remove that. I was trying to get addCallback to take void() lambdas and I got it working on linux builds but not with windows builds

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. thanks @davidberard98!

… profiler event"


**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824

In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173

**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

**Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled

[ghstack-poisoned]
Copy link
Contributor

@xw285cornell xw285cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!! btw I wonder how would you test this PR and make sure it won't regress (you mentioned a unit test + assertion though I might have missed it). Maybe mock the record func and check the call? (btw we don't have to do it here - maybe some follow ups)

… profiler event"


**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824

In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173

**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

**Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Oct 3, 2023
TODO - description:
* The ProcessGroupNCCL callback that ends the profiler event
* The future callback handling
* calls synchronizeWithCurrentStreams()
* that calls recordStream

why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time.

ghstack-source-id: 20a3773
Pull Request resolved: #109933
@awgu
Copy link
Collaborator

awgu commented Oct 3, 2023

Land, land, land!

@davidberard98
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/davidberard98/229/head branch October 7, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants