-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[distributed] Remove recordStream for callback that ends a profiler event #109933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[distributed] Remove recordStream for callback that ends a profiler event #109933
Conversation
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109933
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 156d115 with merge base 7e6cf04 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 784d2ac Pull Request resolved: #109933
…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]
…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 0b9886a Pull Request resolved: #109933
…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]
…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 7ff5773 Pull Request resolved: #109933
…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 171e138 Pull Request resolved: #109933
@pytorchbot rebase -s |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
… profiler event" **Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824 In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173 **Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. **Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled [ghstack-poisoned]
Successfully rebased |
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: ce83955 Pull Request resolved: #109933
Amazing work! |
[work](at::ivalue::Future& /* unused */) { | ||
work->recordFunctionEndCallback_(); | ||
}, | ||
/*uses_future=*/false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to directly set this to False, or we want to make it configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that, since the callback doesn't use the future, there's no point in synchronizing when the callback is invoked.
But if there's a reason to, LMK and I can make it configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that mostly is for gating and rolliout purpose. This PR in general looks good to me, but have we battle tested in all FSDP workload? If not, we might just want to gate it for now so that instead of reverting the PR, we can iterate on top of this change? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think since we control the callback, this change seems reasonable to me without gating (so we know that we do not use the future). I feel that in my experience with FSDP, gating ends up costing more in terms of maintenance than helping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to add some comments and say something like if the future is non empty, the uses_future must be set to true. Or is it possible to add a check if the lambda func is taking a unnamed argument (maybe not lol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comments
will do
add a check if the lambda func is taking a unnamed argument
I tried to do this https://github.com/pytorch/pytorch/pull/109933/files/b3ce2ec4fab56f5173fdbcb681759d7887832da2
but couldn't get the windows builds to pass
aten/src/ATen/core/ivalue_inl.h
Outdated
static_assert( | ||
std::is_invocable_r<void, T, Future&>::value, | ||
"The callback must have signature void(Future&)"); | ||
"The callback must have signature void(Future&) or void()"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the or void()
thing stale? i didn't see implementations for a void() callback. (And if it is supported, then why not use it in the PGNCCL callbacks below?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, it's stale - let me remove that. I was trying to get addCallback
to take void()
lambdas and I got it working on linux builds but not with windows builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. thanks @davidberard98!
… profiler event" **Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824 In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173 **Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. **Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!! btw I wonder how would you test this PR and make sure it won't regress (you mentioned a unit test + assertion though I might have missed it). Maybe mock the record func and check the call? (btw we don't have to do it here - maybe some follow ups)
… profiler event" **Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824 In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173 **Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. **Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled [ghstack-poisoned]
TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 20a3773 Pull Request resolved: #109933
Land, land, land! |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Background: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.
Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:
pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Lines 1822 to 1824 in c2c7c40
In order to guarantee safety, ivalue::Future::invokeCallback calls
recordStream
on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.pytorch/aten/src/ATen/core/ivalue_inl.h
Lines 1171 to 1173 in c2c7c40
Change: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter
uses_future
for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.Tests: (a) unit tests; (b) added an assert in recordStream:
pytorch/c10/cuda/CUDACachingAllocator.cpp
Line 3260 in c2c7c40