[CUDA graphs] Private mempools for CUDA graphs #51436

mcarilli · 2021-02-01T00:43:45Z

Implements #51075 (comment) and additions discussed offline with @ezyang @ngimel .

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg DevicePrivateAllocator : public DeviceAllocator) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from #48875:

# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()

Test plan (other suggestions appreciated):

Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
test_graph_two_successive: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
test_graph_concurrent_replay: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
test_graph_three_successive: A three-graph case, checking the safe and unsafe replay patterns in Restrictions of the Strawman API).
test_graph_memory_stats_and_use_result_after_destroy_graph: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

facebook-github-bot · 2021-02-01T00:43:58Z

💊 CI failures summary and remediations

As of commit 00620eb (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 1/3 non-scanned failure(s)

2 failures not recognized by patterns:

Job	Step	Action
^{pytorch_windows_vs2019_py36_cuda10.1_test1}	^Test	🔁 rerun
^{pytorch_windows_vs2019_py36_cuda10.1_test2}	^Test	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

torch/cuda/streams.py

aten/src/ATen/cuda/CUDAGraphsUtils.cuh

aten/src/ATen/cuda/CUDAGraph.h

ezyang · 2021-02-01T22:38:53Z

c10/cuda/CUDACachingAllocator.cpp

+
+  // Maps a capturing stream to its assigned private pool,
+  // in case we want multiple captures to share the same pool
+  std::unordered_map<CUDACaptureid_t/*capture id*/, CUDACaptureid_t/*pool id*/> capture_to_pool_map;


You sure you want to use an unordered_map here? They're famously slow. I guess this only slows you down when capturing so it's nbd.

Capture ids are assigned to each capture by Cuda, we can't predict their values (In Cuda today, I think they happen to be sequential starting from 1, but that's not guaranteed and may change). So afaict we need unordered_map, but im open to better ideas. As you say it only affects capture so performance isn't crucial.

if you need a non-crappy hash map flat_hash_map may be of use

For this case i think unordered map is alright, but I don't mind changing to flat_hash_map if you have strong feelings about it.

ezyang · 2021-02-01T22:41:30Z

c10/cuda/CUDACachingAllocator.cpp

    } else if (&pool == &large_blocks) {
      return StatType::LARGE_POOL;
    } else {
+      for (auto& gp : graph_pools) {


This changes the asymptotics of this function for graph pools, would be good to make sure people aren't assuming this is free. (It might be easier to just store StatType on BlockPool post this change)

hmm im not sure i understand what you're suggesting here.

The potential problem is a call site calling this on private pools expecting it to be O(1) when it is not. The potential solution is if a pool records, as a field, whether or not it is small or large, you don't have to do the pointer test; you can just read it out of the field. But this is only a very minor problem.

new BlockPool.is_small field avoids this address-checking loop.

ezyang · 2021-02-01T22:41:57Z

c10/cuda/CUDACachingAllocator.cpp

    } else if (block->pool == &large_blocks) {
      return remaining > kSmallSize;
    } else {
+      for (auto& gp : graph_pools) {


This definitely looks very perf fishy

It is perf fishy, but for common eager mode allocations we always return before reaching the loop. The loop should only be reached when allocating from a private pool, in other words, while capturing. The only perf danger i see here is I've increased the code size in a fairly hot function which might increase icache pressure. No idea if that effect is tangible.

new BlockPool.is_small field avoids the loop.

c10/cuda/CUDACachingAllocator.cpp

ezyang · 2021-02-02T01:19:52Z

c10/cuda/CUDACachingAllocator.cpp

+  // captures_underway tracks if a capture might be underway on any stream.
+  // Most of the time it's zero, in which case malloc can avoid calling
+  // cudaStreamGetCaptureInfo in the hot path.
+  int captures_underway = 0;


There's a stronger invariant here, something like captures underway equals the sum of all use counts of pools. If we had some slow invariant checker for the caching allocator that would be a good one to check.

captures underway equals the sum of all use counts of pools

I don't think this is an invariant, because captures_underway is only nonzero during capture. Post capture, each graph object (and corresponding use_count increment) lasts much longer. Or am i misinterpreting?

c10/cuda/CUDACachingAllocator.cpp

ezyang · 2021-02-02T01:40:35Z

c10/cuda/CUDACachingAllocator.cpp

+          p.pool == &gp.second.small_blocks) {
+        gp.second.cudaMalloc_count++;
+        break;
+      }


This is going to be O(n) cost for number of live cuda graphs, even if you aren't recording, no?

Yes, but we only got here if we called cudaMalloc, which dwarfs everything else in this function. This function is already the slow (hopefully rare) fallback.

new BlockPool.owner_PrivatePool member avoids the address checking loop.

ezyang · 2021-02-02T01:42:14Z

c10/cuda/CUDACachingAllocator.cpp

    free_blocks(large_blocks);
    free_blocks(small_blocks);
+
+    for (auto it = graph_pools.begin(); it != graph_pools.end(); ) {


When use_count == 0, is it possible for the use count to go back up again? If not, a simple way to do cleanup but maintain the correct asymptotics when there are many live mempools is to just move mempools to a free list when their use count goes to zero, so you only have to iterate over the free list at this point.

When use_count == 0, is it possible for the use count to go back up again

hmm definitely not intended. I'll think about this, let's not merge until we're sure it can't happen or isn't a danger.

just move mempools to a free list when their use count goes to zero, so you only have to iterate over the free list at this point.

that's a better idea, I'll make that change.

Created secondary map (graph_pools_freeable) to avoid iterating through all graph_pools in free_cached_blocks.

ezyang · 2021-02-02T01:46:33Z

c10/cuda/CUDACachingAllocator.cpp

+      if (it->second.cudaMalloc_count == 0) {
+        it = graph_pools.erase(it);
+      } else {
+        ++it;


Err, when could this case possibly happen?

This branch is the case where the graph's private pool still has blocks referenced by tensors the user holds, such that free_blocks didn't cudaFree all the pool's cudaMallocs yet. We can't erase the PrivatePool from graph_pools until the graph is destroyed and the user drops all their references such that free_blocks finishes cleaning up the cudaMallocs. So the first time free_blocks gets called on a private mempool, it might only cudaFree most of the pool's memory back to the system. free_blocks will be called again on the private pool's large_blocks and/or small_blocks in future synchronize_and_free_blocks()es until all user references have been dropped, after which free_blocks will finish the last cudaFrees and we'll take the it = graph_pools.erase(it); branch.

At least, that was my plan. Do you see a problem here?

I don't see a problem but I need to through and check the invariants again.

ezyang · 2021-02-02T01:50:23Z

This all looks essentially good, just some minor algorithmic suggestions. Also want to know testing strategy.

mcarilli · 2021-02-02T03:06:24Z

This all looks essentially good, just some minor algorithmic suggestions.

Thanks very much!! I'll work on the changes tomorrow.

Also want to know testing strategy.

So would i 😛 It's hard to test allocator policy from Python. Added test plan to PR description.

ngimel · 2021-03-10T00:43:31Z

aten/src/ATen/cuda/CUDAGraph.h

 #endif

  // internal states for error checking
  bool has_graph_ = false;


comment here about what is has_graph_ and has_graph_exec_, and how they are different, will be useful

If an exception is thrown during capture (or capture_end) and stack unwinding destroys the CUDAGraph object, these flags are a halfassed attempt to make reset() destroy the right things depending where the exception happened. But there's so much statefulness associated with a capture (in the allocator, rng generator, CUDAGraph itself) that it's hard to be sure reset() makes everything right again, such that the user could catch the exception and continue.

Added a comment at least ¯_(ツ)_/¯

ngimel · 2021-03-10T01:33:11Z

aten/src/ATen/cuda/CUDAGraph.h

+// RAII guard for "cudaStreamCaptureMode", a thread-local value
+// that controls the error-checking strictness of a capture.
+#if CUDA_VERSION >= 11000
+struct TORCH_CUDA_CPP_API CUDAStreamCaptureModeGuard {


there are 2 copies of this guard (second one in CUDAGraphsC10Utils.h)

oops, CUDAGraphsC10Utils.h is the one I want, I forgot to delete the original.

c10/cuda/CUDACachingAllocator.cpp

ngimel · 2021-03-10T18:39:53Z

c10/cuda/CUDACachingAllocator.cpp

      remaining->size -= size;
-      pool.insert(remaining);
+      bool inserted = pool.blocks.insert(remaining).second;
+      TORCH_INTERNAL_ASSERT(inserted);


Does this need to be assert, or debug only assert?

Up to you. It wasn't there before, and is unrelated to the graph pool changes, I added it to be more defensive about the allocator's control flow in general.

I'd rather not have it in the hotpath.

also changed similar assert on active_blocks.insert to DEBUG_ONLY.

ngimel · 2021-03-10T18:55:27Z

c10/cuda/CUDACachingAllocator.cpp

+  // Called by CUDAGraph::reset
+  void notifyCaptureDestroy(MempoolId_t mempool_id) {
+    std::lock_guard<std::recursive_mutex> lock(mutex);
+    // The graph's been destroyed. We can't blindly delete and cudaFree its mempool, because


In the interest of accuracy, it's graph executor that's likely been destroyed here? Graph is destroyed at normal capture end, and shouldn't result in a call to this function.

ngimel · 2021-03-10T19:04:03Z

c10/cuda/CUDACachingAllocator.cpp

-    }
+    return (block->pool->is_small) ?
+           (remaining >= kMinBlockSize) :
+           (remaining > kSmallSize);


are you sure invalid pool can't come here? There was a check for it before

In an earlier version** I checked block against &small_blocks, &large_blocks, then a loop that checked against all the private pools' &small_blocks and &large_blocks, then finally an error if no match was found. It was kinda ugly and Ed said it was "perf fishy". Using a new pool->is_small field made the code much simpler. If we still want to validate block, i have to restore all the original checks. I can do so if you want, but personally I don't think it's worth the ugliness. I have other checks that pools are present or not present when expected.

get_stat_type_for_pool is in the same boat.

** hopefully that link works, in my browser i have to wait a few seconds for it to jump to the spot.

ngimel · 2021-03-11T22:40:36Z

Importing to see what internal CI says.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-03-12T01:26:40Z

aten/src/ATen/cuda/CUDAGraph.h


 namespace at {
+
+class CUDAGeneratorImpl;


our linter has a reasonable complaint here that in CUDAGeneratorImpl.h CUDAGeneratorImpl is defined as struct, not class

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-03-12T19:10:03Z

@ngimel merged this pull request in 90dfdef.

facebook-github-bot · 2021-03-12T21:09:36Z

This pull request has been reverted by 76129c7.

Summary: Resubmit of #51436. Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams. Pull Request resolved: #54038 Reviewed By: mruberry Differential Revision: D27068649 Pulled By: ngimel fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207

Summary: Implements pytorch#51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad). [High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82) The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want. Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better. Graph bindings in Python are almost unchanged from pytorch#48875: ```python # Same bindings as 48875, but now implicitly grabs a private mempool graph1.capture_begin() graph1.capture_end() # pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool graph2.capture_begin(pool=graph1.pool()) graph2.capture_end() # graph3 also implicitly creates its own mempool graph3.capture_begin() graph3.capture_end() ``` Test plan (other suggestions appreciated): - [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other. - [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory. - [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory. - [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](pytorch#51075)). - [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted. Pull Request resolved: pytorch#51436 Reviewed By: mruberry Differential Revision: D26993790 Pulled By: ngimel fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da

Michael Carilli added 6 commits January 29, 2021 21:03

graph object<->allocator plumbing

393e25a

pool

88b861e

¯\_(ツ)_/¯

f2aa1ab

compiles

06a58ae

Now compiles with cudaMallocMaybeCapturing

c66ed58

Existing graphs tests pass locally

3287e6b

facebook-github-bot added the cla signed label Feb 1, 2021

pytorchbot added the open source label Feb 1, 2021

mcarilli requested review from ezyang and ngimel February 1, 2021 00:53

comment

d07f4ab

mcarilli changed the title ~~Private mempools for CUDA graphs~~ [CUDA graphs] Private mempools for CUDA graphs Feb 1, 2021

Notes

987e429

ezyang reviewed Feb 1, 2021

View reviewed changes

torch/cuda/streams.py Outdated Show resolved Hide resolved

ezyang reviewed Feb 1, 2021

View reviewed changes

aten/src/ATen/cuda/CUDAGraphsUtils.cuh Outdated Show resolved Hide resolved

ezyang reviewed Feb 1, 2021

View reviewed changes

aten/src/ATen/cuda/CUDAGraph.h Outdated Show resolved Hide resolved

ezyang reviewed Feb 1, 2021

View reviewed changes

aten/src/ATen/cuda/CUDAGraph.h Outdated Show resolved Hide resolved

ezyang reviewed Feb 1, 2021

View reviewed changes

ezyang reviewed Feb 2, 2021

View reviewed changes

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

ezyang reviewed Feb 2, 2021

View reviewed changes

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

ezyang reviewed Feb 2, 2021

View reviewed changes

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

ezyang reviewed Feb 2, 2021

View reviewed changes

Michael Carilli added 2 commits March 9, 2021 13:48

Merge remote-tracking branch 'upstream/master' into simple_graph_mempool

427b724

delete redundant CUDAStreamCaptureModeGuard

c982a7a

ngimel reviewed Mar 10, 2021

View reviewed changes

Michael Carilli added 3 commits March 10, 2021 00:14

Compiles, tests pass

789b438

more explanation of MempoolId_t

325b125

Add graph_pool_handle case to pool sharing tests

b50a409

ngimel reviewed Mar 10, 2021

View reviewed changes

Michael Carilli added 4 commits March 10, 2021 15:50

comments

030db06

fix cpu import i think?

6ec7f4f

mypy is a plague

d4aaf0f

mypy is a scourge

700940f

facebook-github-bot reviewed Mar 11, 2021

View reviewed changes

ngimel reviewed Mar 12, 2021

View reviewed changes

Michael Carilli added 2 commits March 11, 2021 19:04

class -> struct

bd6ae79

Merge remote-tracking branch 'upstream/master' into simple_graph_mempool

00620eb

facebook-github-bot reviewed Mar 12, 2021

View reviewed changes

ngimel approved these changes Mar 12, 2021

View reviewed changes

facebook-github-bot closed this in 90dfdef Mar 12, 2021

facebook-github-bot added the Merged label Mar 12, 2021

facebook-github-bot added the Reverted label Mar 12, 2021

This was referenced Mar 13, 2021

[RELAND] [CUDA graphs] Private mempools for CUDA graphs #53944

Closed

[RELAND] [CUDA graphs] Private mempools for CUDA graphs #54038

Closed

jjsjann123 mentioned this pull request Apr 12, 2021

Stream capture failure when the caching allocator is disabled csarofeen/pytorch#810

Closed

mcarilli added the module: cuda graphs Ability to capture and then replay streams of CUDA kernels label Apr 28, 2021

[CUDA graphs] Private mempools for CUDA graphs #51436

[CUDA graphs] Private mempools for CUDA graphs #51436

Uh oh!

Conversation

mcarilli commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

2 failures not recognized by patterns:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Feb 2, 2021

Uh oh!

mcarilli commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcarilli commented Feb 1, 2021 •

edited

Loading

facebook-github-bot commented Feb 1, 2021 •

edited

Loading

mcarilli Feb 5, 2021 •

edited

Loading

mcarilli Feb 5, 2021 •

edited

Loading

mcarilli Feb 5, 2021 •

edited

Loading

mcarilli Feb 2, 2021 •

edited

Loading

mcarilli Feb 2, 2021 •

edited

Loading

mcarilli commented Feb 2, 2021 •

edited

Loading

mcarilli Mar 10, 2021 •

edited

Loading

mcarilli Mar 10, 2021 •

edited

Loading