[distributed] Remove recordStream for callback that ends a profiler event #109933

davidberard98 · 2023-09-23T00:49:24Z

Stack from ghstack (oldest at bottom):

-> [distributed] Remove recordStream for callback that ends a profiler event #109933

Background: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Lines 1822 to 1824 in c2c7c40

    
           work->future_->addCallback([work](at::ivalue::Future& /* unused */) { 
        
             work->recordFunctionEndCallback_(); 
        
           });

In order to guarantee safety, ivalue::Future::invokeCallback calls recordStream on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

pytorch/aten/src/ATen/core/ivalue_inl.h

Lines 1171 to 1173 in c2c7c40

    
           synchronizeWithCurrentStreams(); 
        
           callback(*this);

Change: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter uses_future for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

Tests: (a) unit tests; (b) added an assert in recordStream:

pytorch/c10/cuda/CUDACachingAllocator.cpp

Line 3260 in c2c7c40

void recordStream(const DataPtr& ptr, cuda::CUDAStream stream) override {

and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]

pytorch-bot · 2023-09-23T00:49:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109933

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 156d115 with merge base 7e6cf04 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 784d2ac Pull Request resolved: #109933

…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 0b9886a Pull Request resolved: #109933

…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 7ff5773 Pull Request resolved: #109933

…r event" TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. [ghstack-poisoned]

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 171e138 Pull Request resolved: #109933

davidberard98 · 2023-09-29T21:12:12Z

@pytorchbot rebase -s

pytorchmergebot · 2023-09-29T21:15:20Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

… profiler event" **Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824 In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173 **Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. **Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled [ghstack-poisoned]

pytorchmergebot · 2023-09-29T21:15:40Z

Successfully rebased gh/davidberard98/229/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/109933)

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: ce83955 Pull Request resolved: #109933

awgu · 2023-09-29T21:36:31Z

Amazing work!

fduwjj · 2023-10-02T18:19:25Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+          [work](at::ivalue::Future& /* unused */) {
+            work->recordFunctionEndCallback_();
+          },
+          /*uses_future=*/false);


Do we want to directly set this to False, or we want to make it configurable?

My understanding is that, since the callback doesn't use the future, there's no point in synchronizing when the callback is invoked.

But if there's a reason to, LMK and I can make it configurable

I think that mostly is for gating and rolliout purpose. This PR in general looks good to me, but have we battle tested in all FSDP workload? If not, we might just want to gate it for now so that instead of reverting the PR, we can iterate on top of this change? WDYT?

I think since we control the callback, this change seems reasonable to me without gating (so we know that we do not use the future). I feel that in my experience with FSDP, gating ends up costing more in terms of maintenance than helping.

I wonder if we want to add some comments and say something like if the future is non empty, the uses_future must be set to true. Or is it possible to add a check if the lambda func is taking a unnamed argument (maybe not lol)

add some comments

will do

add a check if the lambda func is taking a unnamed argument

I tried to do this https://github.com/pytorch/pytorch/pull/109933/files/b3ce2ec4fab56f5173fdbcb681759d7887832da2
but couldn't get the windows builds to pass

wconstab · 2023-10-03T00:04:00Z

aten/src/ATen/core/ivalue_inl.h

    static_assert(
        std::is_invocable_r<void, T, Future&>::value,
-        "The callback must have signature void(Future&)");
+        "The callback must have signature void(Future&) or void()");


is the or void() thing stale? i didn't see implementations for a void() callback. (And if it is supported, then why not use it in the PGNCCL callbacks below?)

you're right, it's stale - let me remove that. I was trying to get addCallback to take void() lambdas and I got it working on linux builds but not with windows builds

wconstab

This looks good to me. thanks @davidberard98!

… profiler event" **Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824 In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173 **Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. **Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled [ghstack-poisoned]

xw285cornell

Great work!! btw I wonder how would you test this PR and make sure it won't regress (you mentioned a unit test + assertion though I might have missed it). Maybe mock the record func and check the call? (btw we don't have to do it here - maybe some follow ups)

… profiler event" **Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled. Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1822-L1824 In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage. https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/aten/src/ATen/core/ivalue_inl.h#L1171-L1173 **Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda. **Tests**: (a) unit tests; (b) added an assert in recordStream: https://github.com/pytorch/pytorch/blob/c2c7c4035f41f90116aadf2df3f5d5b4834f838b/c10/cuda/CUDACachingAllocator.cpp#L3260 and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled [ghstack-poisoned]

TODO - description: * The ProcessGroupNCCL callback that ends the profiler event * The future callback handling * calls synchronizeWithCurrentStreams() * that calls recordStream why this is WIP: I don't actually know why we need the synchronizeWithCurrentStreams()? I think we might actually be able to remove that sync all the time. ghstack-source-id: 20a3773 Pull Request resolved: #109933

awgu · 2023-10-03T14:08:19Z

Land, land, land!

davidberard98 · 2023-10-03T14:37:54Z

@pytorchbot merge

pytorchmergebot · 2023-10-03T14:40:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

davidberard98 mentioned this pull request Sep 23, 2023

[easy] Don't check uint32 >= 0 #109932

Closed

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Sep 23, 2023

davidberard98 added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 29, 2023

davidberard98 marked this pull request as ready for review September 29, 2023 21:10

davidberard98 requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners September 29, 2023 21:10

davidberard98 changed the title ~~[WIP] Remove recordStream for callback that ends a profiler event~~ [distributed] Remove recordStream for callback that ends a profiler event Sep 29, 2023

davidberard98 requested a review from lw September 29, 2023 21:11

davidberard98 added the topic: not user facing topic category label Sep 29, 2023

fduwjj reviewed Oct 2, 2023

View reviewed changes

wconstab reviewed Oct 3, 2023

View reviewed changes

wconstab approved these changes Oct 3, 2023

View reviewed changes

xw285cornell reviewed Oct 3, 2023

View reviewed changes

pytorchmergebot added the merging label Oct 3, 2023

pytorchmergebot added Merged and removed merging labels Oct 3, 2023

pytorchmergebot closed this in 4069d1d Oct 3, 2023

facebook-github-bot deleted the gh/davidberard98/229/head branch October 7, 2023 14:22

	work->future_->addCallback([work](at::ivalue::Future& /* unused */) {
	work->recordFunctionEndCallback_();
	});

[distributed] Remove recordStream for callback that ends a profiler event #109933

[distributed] Remove recordStream for callback that ends a profiler event #109933

Uh oh!

Conversation

davidberard98 commented Sep 23, 2023 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109933

✅ No Failures

Uh oh!

davidberard98 commented Sep 29, 2023

Uh oh!

pytorchmergebot commented Sep 29, 2023

Uh oh!

pytorchmergebot commented Sep 29, 2023

Uh oh!

awgu commented Sep 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

xw285cornell left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Oct 3, 2023

Uh oh!

davidberard98 commented Oct 3, 2023

Uh oh!

pytorchmergebot commented Oct 3, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

davidberard98 commented Sep 23, 2023 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Sep 23, 2023 •

edited

Loading

xw285cornell left a comment •

edited

Loading