-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[CUDA graphs] Private mempools for CUDA graphs #51436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 00620eb (more details on the Dr. CI page):
2 failures not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
c10/cuda/CUDACachingAllocator.cpp
Outdated
|
||
// Maps a capturing stream to its assigned private pool, | ||
// in case we want multiple captures to share the same pool | ||
std::unordered_map<CUDACaptureid_t/*capture id*/, CUDACaptureid_t/*pool id*/> capture_to_pool_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You sure you want to use an unordered_map
here? They're famously slow. I guess this only slows you down when capturing so it's nbd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capture ids are assigned to each capture by Cuda, we can't predict their values (In Cuda today, I think they happen to be sequential starting from 1, but that's not guaranteed and may change). So afaict we need unordered_map, but im open to better ideas. As you say it only affects capture so performance isn't crucial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you need a non-crappy hash map flat_hash_map
may be of use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case i think unordered map is alright, but I don't mind changing to flat_hash_map
if you have strong feelings about it.
c10/cuda/CUDACachingAllocator.cpp
Outdated
} else if (&pool == &large_blocks) { | ||
return StatType::LARGE_POOL; | ||
} else { | ||
for (auto& gp : graph_pools) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the asymptotics of this function for graph pools, would be good to make sure people aren't assuming this is free. (It might be easier to just store StatType on BlockPool post this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm im not sure i understand what you're suggesting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potential problem is a call site calling this on private pools expecting it to be O(1) when it is not. The potential solution is if a pool records, as a field, whether or not it is small or large, you don't have to do the pointer test; you can just read it out of the field. But this is only a very minor problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new BlockPool.is_small
field avoids this address-checking loop.
c10/cuda/CUDACachingAllocator.cpp
Outdated
} else if (block->pool == &large_blocks) { | ||
return remaining > kSmallSize; | ||
} else { | ||
for (auto& gp : graph_pools) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely looks very perf fishy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perf fishy, but for common eager mode allocations we always return before reaching the loop. The loop should only be reached when allocating from a private pool, in other words, while capturing. The only perf danger i see here is I've increased the code size in a fairly hot function which might increase icache pressure. No idea if that effect is tangible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new BlockPool.is_small
field avoids the loop.
// captures_underway tracks if a capture might be underway on any stream. | ||
// Most of the time it's zero, in which case malloc can avoid calling | ||
// cudaStreamGetCaptureInfo in the hot path. | ||
int captures_underway = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a stronger invariant here, something like captures underway equals the sum of all use counts of pools. If we had some slow invariant checker for the caching allocator that would be a good one to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
captures underway equals the sum of all use counts of pools
I don't think this is an invariant, because captures_underway is only nonzero during capture. Post capture, each graph object (and corresponding use_count increment) lasts much longer. Or am i misinterpreting?
c10/cuda/CUDACachingAllocator.cpp
Outdated
p.pool == &gp.second.small_blocks) { | ||
gp.second.cudaMalloc_count++; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be O(n) cost for number of live cuda graphs, even if you aren't recording, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we only got here if we called cudaMalloc, which dwarfs everything else in this function. This function is already the slow (hopefully rare) fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new BlockPool.owner_PrivatePool
member avoids the address checking loop.
c10/cuda/CUDACachingAllocator.cpp
Outdated
free_blocks(large_blocks); | ||
free_blocks(small_blocks); | ||
|
||
for (auto it = graph_pools.begin(); it != graph_pools.end(); ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When use_count == 0
, is it possible for the use count to go back up again? If not, a simple way to do cleanup but maintain the correct asymptotics when there are many live mempools is to just move mempools to a free list when their use count goes to zero, so you only have to iterate over the free list at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When use_count == 0, is it possible for the use count to go back up again
hmm definitely not intended. I'll think about this, let's not merge until we're sure it can't happen or isn't a danger.
just move mempools to a free list when their use count goes to zero, so you only have to iterate over the free list at this point.
that's a better idea, I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created secondary map (graph_pools_freeable
) to avoid iterating through all graph_pools
in free_cached_blocks.
if (it->second.cudaMalloc_count == 0) { | ||
it = graph_pools.erase(it); | ||
} else { | ||
++it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, when could this case possibly happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is the case where the graph's private pool still has blocks referenced by tensors the user holds, such that free_blocks didn't cudaFree all the pool's cudaMallocs yet. We can't erase the PrivatePool from graph_pools until the graph is destroyed and the user drops all their references such that free_blocks finishes cleaning up the cudaMallocs. So the first time free_blocks gets called on a private mempool, it might only cudaFree most of the pool's memory back to the system. free_blocks will be called again on the private pool's large_blocks
and/or small_blocks
in future synchronize_and_free_blocks()
es until all user references have been dropped, after which free_blocks will finish the last cudaFrees and we'll take the it = graph_pools.erase(it);
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, that was my plan. Do you see a problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a problem but I need to through and check the invariants again.
This all looks essentially good, just some minor algorithmic suggestions. Also want to know testing strategy. |
Thanks very much!! I'll work on the changes tomorrow.
So would i 😛 It's hard to test allocator policy from Python. Added test plan to PR description. |
#endif | ||
|
||
// internal states for error checking | ||
bool has_graph_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here about what is has_graph_ and has_graph_exec_, and how they are different, will be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an exception is thrown during capture (or capture_end) and stack unwinding destroys the CUDAGraph object, these flags are a halfassed attempt to make reset() destroy the right things depending where the exception happened. But there's so much statefulness associated with a capture (in the allocator, rng generator, CUDAGraph itself) that it's hard to be sure reset() makes everything right again, such that the user could catch the exception and continue.
Added a comment at least ¯_(ツ)_/¯
aten/src/ATen/cuda/CUDAGraph.h
Outdated
// RAII guard for "cudaStreamCaptureMode", a thread-local value | ||
// that controls the error-checking strictness of a capture. | ||
#if CUDA_VERSION >= 11000 | ||
struct TORCH_CUDA_CPP_API CUDAStreamCaptureModeGuard { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 copies of this guard (second one in CUDAGraphsC10Utils.h)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, CUDAGraphsC10Utils.h is the one I want, I forgot to delete the original.
c10/cuda/CUDACachingAllocator.cpp
Outdated
remaining->size -= size; | ||
pool.insert(remaining); | ||
bool inserted = pool.blocks.insert(remaining).second; | ||
TORCH_INTERNAL_ASSERT(inserted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be assert, or debug only assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you. It wasn't there before, and is unrelated to the graph pool changes, I added it to be more defensive about the allocator's control flow in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have it in the hotpath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also changed similar assert on active_blocks.insert
to DEBUG_ONLY.
c10/cuda/CUDACachingAllocator.cpp
Outdated
// Called by CUDAGraph::reset | ||
void notifyCaptureDestroy(MempoolId_t mempool_id) { | ||
std::lock_guard<std::recursive_mutex> lock(mutex); | ||
// The graph's been destroyed. We can't blindly delete and cudaFree its mempool, because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of accuracy, it's graph executor that's likely been destroyed here? Graph is destroyed at normal capture end, and shouldn't result in a call to this function.
} | ||
return (block->pool->is_small) ? | ||
(remaining >= kMinBlockSize) : | ||
(remaining > kSmallSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure invalid pool can't come here? There was a check for it before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an earlier version** I checked block
against &small_blocks
, &large_blocks,
then a loop that checked against all the private pools' &small_blocks
and &large_blocks
, then finally an error if no match was found. It was kinda ugly and Ed said it was "perf fishy". Using a new pool->is_small
field made the code much simpler. If we still want to validate block
, i have to restore all the original checks. I can do so if you want, but personally I don't think it's worth the ugliness. I have other checks that pools are present or not present when expected.
get_stat_type_for_pool
is in the same boat.
** hopefully that link works, in my browser i have to wait a few seconds for it to jump to the spot.
Importing to see what internal CI says. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
aten/src/ATen/cuda/CUDAGraph.h
Outdated
|
||
namespace at { | ||
|
||
class CUDAGeneratorImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our linter has a reasonable complaint here that in CUDAGeneratorImpl.h CUDAGeneratorImpl is defined as struct
, not class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This pull request has been reverted by 76129c7. |
Summary: Resubmit of #51436. Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams. Pull Request resolved: #54038 Reviewed By: mruberry Differential Revision: D27068649 Pulled By: ngimel fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
Summary: Implements pytorch#51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad). [High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82) The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want. Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better. Graph bindings in Python are almost unchanged from pytorch#48875: ```python # Same bindings as 48875, but now implicitly grabs a private mempool graph1.capture_begin() graph1.capture_end() # pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool graph2.capture_begin(pool=graph1.pool()) graph2.capture_end() # graph3 also implicitly creates its own mempool graph3.capture_begin() graph3.capture_end() ``` Test plan (other suggestions appreciated): - [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other. - [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory. - [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory. - [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](pytorch#51075)). - [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted. Pull Request resolved: pytorch#51436 Reviewed By: mruberry Differential Revision: D26993790 Pulled By: ngimel fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
Implements #51075 (comment) and additions discussed offline with @ezyang @ngimel .
High level strategy
The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.
Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg
DevicePrivateAllocator : public DeviceAllocator
) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.Graph bindings in Python are almost unchanged from #48875:
Test plan (other suggestions appreciated):
test_graph_two_successive
: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.test_graph_concurrent_replay
: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.test_graph_three_successive
: A three-graph case, checking the safe and unsafe replay patterns in Restrictions of the Strawman API).test_graph_memory_stats_and_use_result_after_destroy_graph
: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.