Skip to content

Conversation

mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Feb 1, 2021

Implements #51075 (comment) and additions discussed offline with @ezyang @ngimel .

High level strategy

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg DevicePrivateAllocator : public DeviceAllocator) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from #48875:

# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()

Test plan (other suggestions appreciated):

  • Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
  • test_graph_two_successive: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
  • test_graph_concurrent_replay: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
  • test_graph_three_successive: A three-graph case, checking the safe and unsafe replay patterns in Restrictions of the Strawman API).
  • test_graph_memory_stats_and_use_result_after_destroy_graph: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 1, 2021

💊 CI failures summary and remediations

As of commit 00620eb (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

2 failures not recognized by patterns:

Job Step Action
CircleCI pytorch_windows_vs2019_py36_cuda10.1_test1 Test 🔁 rerun
CircleCI pytorch_windows_vs2019_py36_cuda10.1_test2 Test 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@mcarilli mcarilli changed the title Private mempools for CUDA graphs [CUDA graphs] Private mempools for CUDA graphs Feb 1, 2021

// Maps a capturing stream to its assigned private pool,
// in case we want multiple captures to share the same pool
std::unordered_map<CUDACaptureid_t/*capture id*/, CUDACaptureid_t/*pool id*/> capture_to_pool_map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure you want to use an unordered_map here? They're famously slow. I guess this only slows you down when capturing so it's nbd.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capture ids are assigned to each capture by Cuda, we can't predict their values (In Cuda today, I think they happen to be sequential starting from 1, but that's not guaranteed and may change). So afaict we need unordered_map, but im open to better ideas. As you say it only affects capture so performance isn't crucial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need a non-crappy hash map flat_hash_map may be of use

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case i think unordered map is alright, but I don't mind changing to flat_hash_map if you have strong feelings about it.

} else if (&pool == &large_blocks) {
return StatType::LARGE_POOL;
} else {
for (auto& gp : graph_pools) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the asymptotics of this function for graph pools, would be good to make sure people aren't assuming this is free. (It might be easier to just store StatType on BlockPool post this change)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm im not sure i understand what you're suggesting here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential problem is a call site calling this on private pools expecting it to be O(1) when it is not. The potential solution is if a pool records, as a field, whether or not it is small or large, you don't have to do the pointer test; you can just read it out of the field. But this is only a very minor problem.

Copy link
Collaborator Author

@mcarilli mcarilli Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new BlockPool.is_small field avoids this address-checking loop.

} else if (block->pool == &large_blocks) {
return remaining > kSmallSize;
} else {
for (auto& gp : graph_pools) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely looks very perf fishy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perf fishy, but for common eager mode allocations we always return before reaching the loop. The loop should only be reached when allocating from a private pool, in other words, while capturing. The only perf danger i see here is I've increased the code size in a fairly hot function which might increase icache pressure. No idea if that effect is tangible.

Copy link
Collaborator Author

@mcarilli mcarilli Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new BlockPool.is_small field avoids the loop.

// captures_underway tracks if a capture might be underway on any stream.
// Most of the time it's zero, in which case malloc can avoid calling
// cudaStreamGetCaptureInfo in the hot path.
int captures_underway = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a stronger invariant here, something like captures underway equals the sum of all use counts of pools. If we had some slow invariant checker for the caching allocator that would be a good one to check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captures underway equals the sum of all use counts of pools

I don't think this is an invariant, because captures_underway is only nonzero during capture. Post capture, each graph object (and corresponding use_count increment) lasts much longer. Or am i misinterpreting?

p.pool == &gp.second.small_blocks) {
gp.second.cudaMalloc_count++;
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be O(n) cost for number of live cuda graphs, even if you aren't recording, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we only got here if we called cudaMalloc, which dwarfs everything else in this function. This function is already the slow (hopefully rare) fallback.

Copy link
Collaborator Author

@mcarilli mcarilli Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new BlockPool.owner_PrivatePool member avoids the address checking loop.

free_blocks(large_blocks);
free_blocks(small_blocks);

for (auto it = graph_pools.begin(); it != graph_pools.end(); ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When use_count == 0, is it possible for the use count to go back up again? If not, a simple way to do cleanup but maintain the correct asymptotics when there are many live mempools is to just move mempools to a free list when their use count goes to zero, so you only have to iterate over the free list at this point.

Copy link
Collaborator Author

@mcarilli mcarilli Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When use_count == 0, is it possible for the use count to go back up again

hmm definitely not intended. I'll think about this, let's not merge until we're sure it can't happen or isn't a danger.

just move mempools to a free list when their use count goes to zero, so you only have to iterate over the free list at this point.

that's a better idea, I'll make that change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created secondary map (graph_pools_freeable) to avoid iterating through all graph_pools in free_cached_blocks.

if (it->second.cudaMalloc_count == 0) {
it = graph_pools.erase(it);
} else {
++it;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, when could this case possibly happen?

Copy link
Collaborator Author

@mcarilli mcarilli Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is the case where the graph's private pool still has blocks referenced by tensors the user holds, such that free_blocks didn't cudaFree all the pool's cudaMallocs yet. We can't erase the PrivatePool from graph_pools until the graph is destroyed and the user drops all their references such that free_blocks finishes cleaning up the cudaMallocs. So the first time free_blocks gets called on a private mempool, it might only cudaFree most of the pool's memory back to the system. free_blocks will be called again on the private pool's large_blocks and/or small_blocks in future synchronize_and_free_blocks()es until all user references have been dropped, after which free_blocks will finish the last cudaFrees and we'll take the it = graph_pools.erase(it); branch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, that was my plan. Do you see a problem here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem but I need to through and check the invariants again.

@ezyang
Copy link
Contributor

ezyang commented Feb 2, 2021

This all looks essentially good, just some minor algorithmic suggestions. Also want to know testing strategy.

@mcarilli
Copy link
Collaborator Author

mcarilli commented Feb 2, 2021

This all looks essentially good, just some minor algorithmic suggestions.

Thanks very much!! I'll work on the changes tomorrow.

Also want to know testing strategy.

So would i 😛 It's hard to test allocator policy from Python. Added test plan to PR description.

#endif

// internal states for error checking
bool has_graph_ = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here about what is has_graph_ and has_graph_exec_, and how they are different, will be useful

Copy link
Collaborator Author

@mcarilli mcarilli Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an exception is thrown during capture (or capture_end) and stack unwinding destroys the CUDAGraph object, these flags are a halfassed attempt to make reset() destroy the right things depending where the exception happened. But there's so much statefulness associated with a capture (in the allocator, rng generator, CUDAGraph itself) that it's hard to be sure reset() makes everything right again, such that the user could catch the exception and continue.

Added a comment at least ¯_(ツ)_/¯

// RAII guard for "cudaStreamCaptureMode", a thread-local value
// that controls the error-checking strictness of a capture.
#if CUDA_VERSION >= 11000
struct TORCH_CUDA_CPP_API CUDAStreamCaptureModeGuard {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 copies of this guard (second one in CUDAGraphsC10Utils.h)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, CUDAGraphsC10Utils.h is the one I want, I forgot to delete the original.

remaining->size -= size;
pool.insert(remaining);
bool inserted = pool.blocks.insert(remaining).second;
TORCH_INTERNAL_ASSERT(inserted);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be assert, or debug only assert?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. It wasn't there before, and is unrelated to the graph pool changes, I added it to be more defensive about the allocator's control flow in general.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have it in the hotpath.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also changed similar assert on active_blocks.insert to DEBUG_ONLY.

// Called by CUDAGraph::reset
void notifyCaptureDestroy(MempoolId_t mempool_id) {
std::lock_guard<std::recursive_mutex> lock(mutex);
// The graph's been destroyed. We can't blindly delete and cudaFree its mempool, because
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of accuracy, it's graph executor that's likely been destroyed here? Graph is destroyed at normal capture end, and shouldn't result in a call to this function.

}
return (block->pool->is_small) ?
(remaining >= kMinBlockSize) :
(remaining > kSmallSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure invalid pool can't come here? There was a check for it before

Copy link
Collaborator Author

@mcarilli mcarilli Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier version** I checked block against &small_blocks, &large_blocks, then a loop that checked against all the private pools' &small_blocks and &large_blocks, then finally an error if no match was found. It was kinda ugly and Ed said it was "perf fishy". Using a new pool->is_small field made the code much simpler. If we still want to validate block, i have to restore all the original checks. I can do so if you want, but personally I don't think it's worth the ugliness. I have other checks that pools are present or not present when expected.

get_stat_type_for_pool is in the same boat.

** hopefully that link works, in my browser i have to wait a few seconds for it to jump to the spot.

@ngimel
Copy link
Collaborator

ngimel commented Mar 11, 2021

Importing to see what internal CI says.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.


namespace at {

class CUDAGeneratorImpl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our linter has a reasonable complaint here that in CUDAGeneratorImpl.h CUDAGeneratorImpl is defined as struct, not class

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 90dfdef.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by 76129c7.

facebook-github-bot pushed a commit that referenced this pull request Mar 16, 2021
Summary:
Resubmit of #51436.

Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams.

Pull Request resolved: #54038

Reviewed By: mruberry

Differential Revision: D27068649

Pulled By: ngimel

fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Implements pytorch#51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).

[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from pytorch#48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```

Test plan (other suggestions appreciated):

- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](pytorch#51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

Pull Request resolved: pytorch#51436

Reviewed By: mruberry

Differential Revision: D26993790

Pulled By: ngimel

fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
@mcarilli mcarilli added the module: cuda graphs Ability to capture and then replay streams of CUDA kernels label Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: cuda graphs Ability to capture and then replay streams of CUDA kernels open source Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: Private CUDA memory pools

6 participants