Skip to content

Conversation

nelhage
Copy link
Contributor

@nelhage nelhage commented Jan 25, 2022

Fixes #71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that process_events may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 25, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/nelhage/pytorch/blob/3032d0ce9fa3b61750a53848bcf8dc9a1e1572c2/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk, ciflow/xla ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/linux, ciflow/rocm, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 25, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit b618ee7 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@anjali411 anjali411 added the module: cuda Related to torch.cuda, and CUDA support in general label Jan 25, 2022
@anjali411 anjali411 requested a review from ngimel January 25, 2022 20:46
@anjali411 anjali411 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 25, 2022
@ngimel ngimel requested a review from mcarilli January 25, 2022 21:06
Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I am convinced by the argument in the PR that the change is worth the (small) CPU cost.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good to me too, I have one small comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ska::flat_hash_map here, it's faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change! I deferred to the fact that this code was already using STL maps elsewhere, but I'm always happy to use a properly-optimized hash map instead of the STL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change the other maps in this file for consistency while I'm here, or leave that for a followup PR?

@nelhage
Copy link
Contributor Author

nelhage commented Jan 27, 2022

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

We've since run some tests on our training runs, and this gets back a bit more than half of the [active - allocated] gap, at no measurable performance regression (per our metrics). So this is definitely a substantial win for us.

I'm pursuing some other theories for the rest of the gap, including that some of it is the system working-as-intended because we're freeing tensors that are still in-flight; I'll update the other issue or open new issues if I find other things that I think want to be fixed on the PyTorch side.

@nelhage
Copy link
Contributor Author

nelhage commented Jan 27, 2022

I chose to update all the maps and sets to ska:: for consistency. Happy to back that out if you'd prefer a second PR.

@ngimel
Copy link
Collaborator

ngimel commented Jan 27, 2022

That's fine, there's #71667 to do that, but we can do it in this PR also.

@colesbury
Copy link
Member

Does ska::flat_hash_map have the same guarantees about iterator validity? In particular does erase() only invalidate the iterator to the delete entry (and not other iterators)?

@nelhage
Copy link
Contributor Author

nelhage commented Jan 27, 2022

Does ska::flat_hash_map have the same guarantees about iterator validity? In particular does erase() only invalidate the iterator to the delete entry (and not other iterators)?

I had this question. Frustratingly ska::flat_hash_map does not seem to explicitly document the invalidation policy.

(abseil's flat_hash_map does document this detail and erase promises not to rehash, but that's not the one we're using).

Looking at the source code, I'm not familiar enough with ska::flat_hash_map offhand to say for sure, but it does worry me that it moves elements around. The safest course of action is probably to rely on the return value from erase instead. so I'll put that change up.

@nelhage
Copy link
Contributor Author

nelhage commented Jan 27, 2022

Seems like CI completely lost the plot on that push I don't understand why. Can someone kick it (if that's the right course of action)?

@ngimel
Copy link
Collaborator

ngimel commented Jan 28, 2022

github.com is having a bad day, so lots of jobs are failing

@ngimel
Copy link
Collaborator

ngimel commented Jan 28, 2022

Can you please rebase to trigger new CI run?

@ngimel
Copy link
Collaborator

ngimel commented Jan 31, 2022

Sorry about that @nelhage, but your rebase fell on a time when master was broken from an unrelated PR (hence test failures on yours), can you please rebase again? Really sorry about it, we try to keep CI in good state, but it doesn't always work.

@nelhage
Copy link
Contributor Author

nelhage commented Jan 31, 2022

Done! I'm in awe of how much CI y'all have, and I know the struggles of keeping it green even for much simpler builds :)

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Feb 3, 2022
Summary:
Fixes #71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: #71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
pytorchmergebot pushed a commit that referenced this pull request Feb 3, 2022
Summary:
Fixes #71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: #71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719)
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2022

Hey nelhage.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 3, 2022
Summary:
Fixes pytorch/pytorch#71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: pytorch/pytorch#71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 3, 2022
Summary:
Fixes pytorch/pytorch#71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: pytorch/pytorch#71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719)
@nelhage nelhage deleted the events-per-stream branch February 4, 2022 01:33
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 9, 2022
Summary:
Fixes pytorch/pytorch#71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: pytorch/pytorch#71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 9, 2022
Summary:
Fixes pytorch/pytorch#71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: pytorch/pytorch#71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719)
@mcarilli
Copy link
Collaborator

mcarilli commented Feb 9, 2022

I think it's still possible to hit head of line blocking behavior with this PR's diffs (ie, even within one stream):

a = torch.tensor(...)
b = torch.tensor(...)
with torch.cuda.stream(s):
    a.record_stream(s)
    b.record_stream(s)
    quick_usage(a)
    long_usage(b)
    del b
    del a

b's end of life events enter the per-stream deques before a's, but take a long time to resolve, delaying the reusability of a.

In any case this PR is better than the status quo.

A wild idea I had was to avoid end-of-life event deques entirely. Instead, blocks with recorded streams and associated end-of-life events at "free()" time go into a dedicated pool "small/largeBlocksMayStillHaveEventsPending". These two (small/large) pools would be used as intermediate fallbacks by malloc, ie, malloc would service allocation requests by first look for a size-appropriate block in the ordinary pools, then if that fails look for a size-appropriate block in small/largeBlocksMayStillHaveEventsPending. If suitable block found, check its events. If events resolved, use the block and delete it from its Pending pool. If not, if desired move to the next bigger block in the Pending pool and check its events, or bail out to the usual final fallbacks (try cudaMalloc, then error). A block returned from the Pending pool to service an allocation would effectively be reset (losing its Pending-specific identity) which is fine because it'll only be returned if its events have resolved.

A downside of this approach is the malloc Pending-pool fallback would be another std::set lookup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Effective memory leak due to head-of-line blocking in CUDACachingAllocator process_events.

7 participants