CUDACachingAllocator: Keep one event queue per stream #71745

nelhage · 2022-01-25T01:10:52Z

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that process_events may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:

An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
Or, they are, which is precisely the case where Effective memory leak due to head-of-line blocking in CUDACachingAllocator process_events. #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

pytorch-bot · 2022-01-25T01:10:58Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/nelhage/pytorch/blob/3032d0ce9fa3b61750a53848bcf8dc9a1e1572c2/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`, `ciflow/xla`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

facebook-github-bot · 2022-01-25T01:10:58Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/71745
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit b618ee7 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

colesbury

This looks good to me. I am convinced by the argument in the PR that the change is worth the (small) CPU cost.

ngimel

This approach looks good to me too, I have one small comment.

ngimel · 2022-01-27T18:52:58Z

c10/cuda/CUDACachingAllocator.cpp

use ska::flat_hash_map here, it's faster?

Will change! I deferred to the fact that this code was already using STL maps elsewhere, but I'm always happy to use a properly-optimized hash map instead of the STL.

Should I change the other maps in this file for consistency while I'm here, or leave that for a followup PR?

nelhage · 2022-01-27T20:48:06Z

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

We've since run some tests on our training runs, and this gets back a bit more than half of the [active - allocated] gap, at no measurable performance regression (per our metrics). So this is definitely a substantial win for us.

I'm pursuing some other theories for the rest of the gap, including that some of it is the system working-as-intended because we're freeing tensors that are still in-flight; I'll update the other issue or open new issues if I find other things that I think want to be fixed on the PyTorch side.

nelhage · 2022-01-27T21:37:13Z

I chose to update all the maps and sets to ska:: for consistency. Happy to back that out if you'd prefer a second PR.

ngimel · 2022-01-27T21:39:30Z

That's fine, there's #71667 to do that, but we can do it in this PR also.

colesbury · 2022-01-27T21:44:11Z

Does ska::flat_hash_map have the same guarantees about iterator validity? In particular does erase() only invalidate the iterator to the delete entry (and not other iterators)?

nelhage · 2022-01-27T22:35:22Z

Does ska::flat_hash_map have the same guarantees about iterator validity? In particular does erase() only invalidate the iterator to the delete entry (and not other iterators)?

I had this question. Frustratingly ska::flat_hash_map does not seem to explicitly document the invalidation policy.

(abseil's flat_hash_map does document this detail and erase promises not to rehash, but that's not the one we're using).

Looking at the source code, I'm not familiar enough with ska::flat_hash_map offhand to say for sure, but it does worry me that it moves elements around. The safest course of action is probably to rely on the return value from erase instead. so I'll put that change up.

nelhage · 2022-01-27T23:56:27Z

Seems like CI completely lost the plot on that push I don't understand why. Can someone kick it (if that's the right course of action)?

ngimel · 2022-01-28T00:39:49Z

github.com is having a bad day, so lots of jobs are failing

ngimel · 2022-01-28T17:21:35Z

Can you please rebase to trigger new CI run?

ngimel · 2022-01-31T23:46:16Z

Sorry about that @nelhage, but your rebase fell on a time when master was broken from an unrelated PR (hence test failures on yours), can you please rebase again? Really sorry about it, we try to keep CI in good state, but it doesn't always work.

Instead of relying on erase() having the same invalidation guarantees as std::unordered_map.

nelhage · 2022-01-31T23:49:24Z

Done! I'm in awe of how much CI y'all have, and I know the struggles of keeping it green even for much simpler builds :)

facebook-github-bot · 2022-02-01T22:53:44Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-02-02T18:48:28Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-02-02T18:58:21Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Fixes #71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: #71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81

Summary: Fixes #71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: #71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)

github-actions · 2022-02-03T01:35:48Z

Hey nelhage.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Fixes pytorch/pytorch#71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: pytorch/pytorch#71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)

mcarilli · 2022-02-09T17:22:28Z

I think it's still possible to hit head of line blocking behavior with this PR's diffs (ie, even within one stream):

a = torch.tensor(...)
b = torch.tensor(...)
with torch.cuda.stream(s):
    a.record_stream(s)
    b.record_stream(s)
    quick_usage(a)
    long_usage(b)
    del b
    del a

b's end of life events enter the per-stream deques before a's, but take a long time to resolve, delaying the reusability of a.

In any case this PR is better than the status quo.

A wild idea I had was to avoid end-of-life event deques entirely. Instead, blocks with recorded streams and associated end-of-life events at "free()" time go into a dedicated pool "small/largeBlocksMayStillHaveEventsPending". These two (small/large) pools would be used as intermediate fallbacks by malloc, ie, malloc would service allocation requests by first look for a size-appropriate block in the ordinary pools, then if that fails look for a size-appropriate block in small/largeBlocksMayStillHaveEventsPending. If suitable block found, check its events. If events resolved, use the block and delete it from its Pending pool. If not, if desired move to the next bigger block in the Pending pool and check its events, or bail out to the usual final fallbacks (try cudaMalloc, then error). A block returned from the Pending pool to service an allocation would effectively be reset (losing its Pending-specific identity) which is fine because it'll only be returned if its events have resolved.

A downside of this approach is the malloc Pending-pool fallback would be another std::set lookup.

pytorch-bot bot added the ciflow/default label Jan 25, 2022

facebook-github-bot added the cla signed label Jan 25, 2022

pytorchbot added the open source label Jan 25, 2022

anjali411 added the module: cuda Related to torch.cuda, and CUDA support in general label Jan 25, 2022

anjali411 requested a review from ngimel January 25, 2022 20:46

anjali411 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 25, 2022

ngimel requested a review from mcarilli January 25, 2022 21:06

colesbury reviewed Jan 27, 2022

View reviewed changes

ngimel reviewed Jan 27, 2022

View reviewed changes

nelhage force-pushed the events-per-stream branch from 00b7a09 to 5b0f495 Compare January 28, 2022 17:58

nelhage added 4 commits January 31, 2022 15:47

CUDACachingAllocator: Keep one event queue per stream

773a8b6

Comment update

3bf4ad7

CUDACachingAllocator: Switch to skarupke maps everywhere.

d70fc0e

Use the return value from ska::flat_hash_map::erase(iterator).

b618ee7

Instead of relying on erase() having the same invalidation guarantees as std::unordered_map.

nelhage force-pushed the events-per-stream branch from 5b0f495 to b618ee7 Compare January 31, 2022 23:47

nelhage mentioned this pull request Jan 31, 2022

Effective memory leak due to head-of-line blocking in CUDACachingAllocator process_events. #71616

Closed

ngimel approved these changes Feb 1, 2022

View reviewed changes

pytorchmergebot closed this Feb 3, 2022

nelhage deleted the events-per-stream branch February 4, 2022 01:33

CUDACachingAllocator: Keep one event queue per stream #71745

CUDACachingAllocator: Keep one event queue per stream #71745

Uh oh!

Conversation

nelhage commented Jan 25, 2022

Uh oh!

pytorch-bot bot commented Jan 25, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

nelhage Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

nelhage Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

nelhage commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelhage commented Jan 27, 2022

Uh oh!

ngimel commented Jan 27, 2022

Uh oh!

colesbury commented Jan 27, 2022

Uh oh!

nelhage commented Jan 27, 2022

Uh oh!

nelhage commented Jan 27, 2022

Uh oh!

ngimel commented Jan 28, 2022

Uh oh!

ngimel commented Jan 28, 2022

Uh oh!

ngimel commented Jan 31, 2022

Uh oh!

nelhage commented Jan 31, 2022

Uh oh!

facebook-github-bot commented Feb 1, 2022

Uh oh!

facebook-github-bot commented Feb 2, 2022

Uh oh!

facebook-github-bot commented Feb 2, 2022

Uh oh!

github-actions bot commented Feb 3, 2022

Uh oh!

mcarilli commented Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

facebook-github-bot commented Jan 25, 2022 •

edited

Loading

nelhage commented Jan 27, 2022 •

edited

Loading

mcarilli commented Feb 9, 2022 •

edited

Loading