-
Notifications
You must be signed in to change notification settings - Fork 25.7k
CUDACachingAllocator: Keep one event queue per stream #71745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit b618ee7 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I am convinced by the argument in the PR that the change is worth the (small) CPU cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks good to me too, I have one small comment.
c10/cuda/CUDACachingAllocator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ska::flat_hash_map
here, it's faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change! I deferred to the fact that this code was already using STL maps elsewhere, but I'm always happy to use a properly-optimized hash map instead of the STL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I change the other maps in this file for consistency while I'm here, or leave that for a followup PR?
We've since run some tests on our training runs, and this gets back a bit more than half of the [active - allocated] gap, at no measurable performance regression (per our metrics). So this is definitely a substantial win for us. I'm pursuing some other theories for the rest of the gap, including that some of it is the system working-as-intended because we're freeing tensors that are still in-flight; I'll update the other issue or open new issues if I find other things that I think want to be fixed on the PyTorch side. |
I chose to update all the maps and sets to |
That's fine, there's #71667 to do that, but we can do it in this PR also. |
Does |
I had this question. Frustratingly ( Looking at the source code, I'm not familiar enough with |
Seems like CI completely lost the plot on that push I don't understand why. Can someone kick it (if that's the right course of action)? |
github.com is having a bad day, so lots of jobs are failing |
Can you please rebase to trigger new CI run? |
00b7a09
to
5b0f495
Compare
Sorry about that @nelhage, but your rebase fell on a time when master was broken from an unrelated PR (hence test failures on yours), can you please rebase again? Really sorry about it, we try to keep CI in good state, but it doesn't always work. |
Instead of relying on erase() having the same invalidation guarantees as std::unordered_map.
5b0f495
to
b618ee7
Compare
Done! I'm in awe of how much CI y'all have, and I know the struggles of keeping it green even for much simpler builds :) |
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Fixes #71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: #71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
Summary: Fixes #71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: #71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)
Hey nelhage. |
Summary: Fixes pytorch/pytorch#71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: pytorch/pytorch#71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)
Summary: Fixes pytorch/pytorch#71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: pytorch/pytorch#71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)
Summary: Fixes pytorch/pytorch#71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: pytorch/pytorch#71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)
Summary: Fixes pytorch/pytorch#71616 This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can. This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them. However, I suspect that in general, either: - An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same - Or, they are, which is precisely the case where pytorch/pytorch#71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here. I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion. Pull Request resolved: pytorch/pytorch#71745 Reviewed By: soulitzer Differential Revision: D33948288 Pulled By: ngimel fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81 (cherry picked from commit d233719)
I think it's still possible to hit head of line blocking behavior with this PR's diffs (ie, even within one stream): a = torch.tensor(...)
b = torch.tensor(...)
with torch.cuda.stream(s):
a.record_stream(s)
b.record_stream(s)
quick_usage(a)
long_usage(b)
del b
del a b's end of life events enter the per-stream deques before a's, but take a long time to resolve, delaying the reusability of a. In any case this PR is better than the status quo. A wild idea I had was to avoid end-of-life event deques entirely. Instead, blocks with recorded streams and associated end-of-life events at "free()" time go into a dedicated pool "small/largeBlocksMayStillHaveEventsPending". These two (small/large) pools would be used as intermediate fallbacks by malloc, ie, malloc would service allocation requests by first look for a size-appropriate block in the ordinary pools, then if that fails look for a size-appropriate block in small/largeBlocksMayStillHaveEventsPending. If suitable block found, check its events. If events resolved, use the block and delete it from its Pending pool. If not, if desired move to the next bigger block in the Pending pool and check its events, or bail out to the usual final fallbacks (try cudaMalloc, then error). A block returned from the Pending pool to service an allocation would effectively be reset (losing its Pending-specific identity) which is fine because it'll only be returned if its events have resolved. A downside of this approach is the malloc Pending-pool fallback would be another std::set lookup. |
Fixes #71616
This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.
This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that
process_events
may look at a larger number of events in the case where there are multiple streams with long-running ops on them.However, I suspect that in general, either:
process_events
. #71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.