[CUDA graphs] Cuda RNG-safe graph capture and replay bindings #48875

mcarilli · 2020-12-05T03:09:20Z

Part 2 of #46148 refactor. (part 1 was #48694.)
Contains

a few more CUDAGeneratorImpl diffs to clean up graph capture interaction
Capture and replay bindings that interact correctly with CUDAGeneratorImpl
Tests.

Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds.

See Note [CUDA Graph-safe RNG states] for the strategy, based on #46148 (comment).

dr-ci · 2020-12-05T04:07:21Z

💊 CI failures summary and remediations

As of commit 1d69b27 (more details on the Dr. CI page):

1/2 failures introduced in this PR
1/2 broken upstream at merge base 7a2abbd on Dec 09 from 2:50pm to 7:46pm

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 11 22:37:01 AssertionError: mypy failed: tools/codegen/gen.py:388:9: error: Missing return statement

Dec 11 22:36:38   test_run_mypy (__main__.TestTypeHints) ... ok (59.788s)
Dec 11 22:36:41   test_run_mypy_strict (__main__.TestTypeHints) ... FAIL (2.806s)
Dec 11 22:37:01   test_type_hint_examples (__main__.TestTypeHints) ... ok (19.741s)
Dec 11 22:37:01 
Dec 11 22:37:01 ======================================================================
Dec 11 22:37:01 FAIL [2.806s]: test_run_mypy_strict (__main__.TestTypeHints)
Dec 11 22:37:01 ----------------------------------------------------------------------
Dec 11 22:37:01 Traceback (most recent call last):
Dec 11 22:37:01   File "test_type_hints.py", line 239, in test_run_mypy_strict
Dec 11 22:37:01     self.fail(f"mypy failed: {stdout} {stderr}")
Dec 11 22:37:01 AssertionError: mypy failed: tools/codegen/gen.py:388:9: error: Missing return statement
Dec 11 22:37:01 Found 1 error in 1 file (checked 11 source files)
Dec 11 22:37:01  
Dec 11 22:37:01 
Dec 11 22:37:01 ----------------------------------------------------------------------
Dec 11 22:37:01 Ran 4 tests in 92.974s
Dec 11 22:37:01 
Dec 11 22:37:01 FAILED (failures=1)
Dec 11 22:37:01 
Dec 11 22:37:01 Generating XML reports...
Dec 11 22:37:01 Generated XML report: test-reports/dist-gloo/TEST-TestTypeHints-20201211223528.xml

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_backward_compatibility_check_test on Dec 09 from 2:50pm to 7:46pm (dfa3808 - f431e47)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 83 times.

ezyang · 2020-12-08T16:17:03Z

Lotta test failures lol

aten/src/ATen/cuda/CUDAGraph.h

ezyang · 2020-12-08T17:14:30Z

aten/src/ATen/cuda/CUDAGraph.h

+namespace at {
+namespace cuda {
+
+struct TORCH_CUDA_API CUDAGraph {


Probably should have a // See Note [CUDA Graph... reference somewhere here; probably framed as "why does PyTorch need its own CUDAGraph rep"

Added Note [CUDA Graph Wrapper Class]

ezyang · 2020-12-08T17:16:27Z

aten/src/ATen/cuda/CUDAGraph.cpp

+      c10::nullopt, cuda::detail::getDefaultCUDAGenerator());
+
+  auto options = TensorOptions().device(at::kCUDA).dtype(at::kLong);
+  offset_extragraph_ = at::empty({1}, options);


you sure you want empty here?

Yes, it doesn't matter if we run on garbage, all we're trying to do is bake in the right pointers. If there's data dependent control flow in the graphed region, we have no business graphing it anyway.

Also, deliberately beginning capture with garbage here will stress any unwanted/unexpected data dependence, and help catch failures.

I mean, half the time the garbage is going to be zeros, if you really want to find unexpected data dependence, should fill it with some sentinel. (But duly taken)

aten/src/ATen/cuda/CUDAGraph.cpp

ezyang · 2020-12-08T17:20:35Z

aten/src/ATen/cuda/CUDAGeneratorImpl.cpp

  offset_extragraph_ = offset_extragraph;
  offset_intragraph_ = 0;
+  graph_expects_this_gen_ = true;


What if you have multiple captures going on at the same time?

Good point. ~~I think the best approach is to make graph_expects_this_gen_ an argument to those consumer-called members.~~ nvm my idea is nonsense. There are ways I can try to make this safe which we can discuss in more detail, but I can't think of a simple one. The simplest answer is to disallow that usage: only one capture may be underway at a time.

ezyang · 2020-12-08T17:22:26Z

test/test_cuda.py

+            a = torch.zeros((1000,), device="cuda")
+            a += 1
+            g = torch.cuda.Graph()
+            g.capture_begin()


you sure you don't want a context manager for this?

The manager could also take care of getting onto a non-default stream

I think eventually a context manager is the right exposure, but right now, this won't be documented, and exists to serve experiments. In experiments, I don't want to exit gracefully from graphed regions if capture fails. I want all hell to break loose so we see it.

I'm hacking on a local UX that wraps the exposures here and takes care of non-default stream. The goal of this PR is to introduce a minimal capture exposure to enable flexible UX hacking.

ezyang · 2020-12-08T17:23:45Z

test/test_cuda.py

+        self.assertTrue(a.sum().item() == 3000.)
+
+    def test_graph_rng_functional(self):
+        # The caching allocator isn't yet graph-safe.


Yeah, so what's the plan for this anyway? Thread local allocator override?

https://www.youtube.com/watch?v=Lf6aMcnR9WQ
Long term (many months) I'm working with cudaMallocAsync people to ensure cudaMallocAsync plays well with cuda graphs. Hopefully after we integrate cudaMallocAsync, capturing allocations will "just work."

Short term, Arslan Zulfiqar, one of our people, hacked the Pytorch allocator to request and reserve a stream-silo for the graph(s): graph capture gets its own memory silo/pool in which it can reuse memory, and it won't be affected by other eager allocations. My very next task after this PR is to look into upstreaming his changes or some variation.

So, are you saying, when cudaMallocAsync becomes a thing, we get rid of PyTorch's CUDA caching allocator?

I think that would be a sad waste, not to mention loss of control for corner cases when you need it. I think the perfect world would be, cudaMallocAsync and the Pytorch allocator are alternative user-selectable backends for some common allocator interface (which may also accept external allocators that support the same interface #43144). I do think cudaMallocAsync should be the default.

ezyang · 2020-12-08T17:26:11Z

The implementation bits seem fine. But the UX may need some smithing. Is there a plan on record here? (@ngimel?)

codecov · 2020-12-10T08:54:26Z

Codecov Report

Merging #48875 (b374e76) into master (7a2abbd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #48875   +/-   ##
=======================================
  Coverage   80.69%   80.70%           
=======================================
  Files        1871     1871           
  Lines      202062   202064    +2     
=======================================
+ Hits       163064   163071    +7     
+ Misses      38998    38993    -5

ezyang · 2020-12-10T16:58:56Z

test/test_cuda.py

+        with torch.cuda.stream(s1):
+            a = torch.zeros((1000,), device="cuda")
+            a += 1
+            g = torch.cuda.Graph()


This API name don't look very private to me! 🤣

🤫
Alright, I can not import it as Graph for now, so usage (for experimenters) would be

g = torch._C._CudaGraphBase() g.capture_begin/capture_end/replay()

Just make it torch.cuda._Graph for now, rename when we have a better idea what we want to do with it.

ezyang · 2020-12-10T18:17:02Z

I guess I'm OK with merging this on an experimental basis, but right now it's too public looking for my taste.

ezyang · 2020-12-10T23:57:00Z

this seems ok. @ngimel do you want to land it?

ngimel · 2020-12-11T00:02:50Z

Let me take a last look and I will.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-12-14T19:15:45Z

@ngimel merged this pull request in c068180.

Summary: Implements #51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad). [High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82) The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want. Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better. Graph bindings in Python are almost unchanged from #48875: ```python # Same bindings as 48875, but now implicitly grabs a private mempool graph1.capture_begin() graph1.capture_end() # pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool graph2.capture_begin(pool=graph1.pool()) graph2.capture_end() # graph3 also implicitly creates its own mempool graph3.capture_begin() graph3.capture_end() ``` Test plan (other suggestions appreciated): - [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other. - [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory. - [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory. - [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](#51075)). - [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted. Pull Request resolved: #51436 Reviewed By: mruberry Differential Revision: D26993790 Pulled By: ngimel fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da

Summary: Implements pytorch#51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad). [High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82) The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want. Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better. Graph bindings in Python are almost unchanged from pytorch#48875: ```python # Same bindings as 48875, but now implicitly grabs a private mempool graph1.capture_begin() graph1.capture_end() # pool=... is new. It hints that allocations during graph2's capture may share graph1's mempool graph2.capture_begin(pool=graph1.pool()) graph2.capture_end() # graph3 also implicitly creates its own mempool graph3.capture_begin() graph3.capture_end() ``` Test plan (other suggestions appreciated): - [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other. - [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory. - [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory. - [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](pytorch#51075)). - [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted. Pull Request resolved: pytorch#51436 Reviewed By: mruberry Differential Revision: D26993790 Pulled By: ngimel fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da

mcarilli added 12 commits November 24, 2020 09:39

possibly best approach

652f7e2

Merge remote-tracking branch 'upstream/master' into cudagraphs_bindings

8b64bb6

Tentative files i'll need

26b07da

Works on my machine (for minimal example)

820c753

Tentative tests

31a26ee

Merge remote-tracking branch 'upstream/master' into cudagraphs_bindings

dbb24ba

steps in the right direction

160b42e

compiles

9ae4fd9

Tests work!

1386e1b

Distributions test passes

e0bb708

merging master

abe0289

no need to include cuda.h in CUDAGeneratorImpl.cpp

f61bbe2

mcarilli requested a review from ngimel December 5, 2020 03:09

facebook-github-bot added the cla signed label Dec 5, 2020

mcarilli requested a review from ezyang December 5, 2020 03:09

mcarilli changed the title ~~[CUDA graphs] CUDAGeneratorImpl-safe graph capture and replay bindings~~ [CUDA graphs] CUDA RNG-safe graph capture and replay bindings Dec 5, 2020

pytorchbot added the open source label Dec 5, 2020

mcarilli changed the title ~~[CUDA graphs] CUDA RNG-safe graph capture and replay bindings~~ [CUDA graphs] Cuda RNG-safe graph capture and replay bindings Dec 5, 2020

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 7, 2020

ezyang reviewed Dec 8, 2020

View reviewed changes

aten/src/ATen/cuda/CUDAGraph.h Outdated Show resolved Hide resolved

ezyang reviewed Dec 8, 2020

View reviewed changes

aten/src/ATen/cuda/CUDAGraph.cpp Show resolved Hide resolved

ezyang reviewed Dec 8, 2020

View reviewed changes

Error-check gen, skip tests for Cuda < 11

768123e

mcarilli added 8 commits December 8, 2020 23:21

clever hans round 3

fbf0089

clever hans round 4

dd604c5

clever hans round 5: add reset() to __init__.pyi.in

c7c87d4

reset takes self

83f0e81

Merge remote-tracking branch 'upstream/master' into cudagraphs_bindings

bd44b17

Merge remote-tracking branch 'upstream/master' into cudagraphs_bindings

49d8a19

looks like skipIfRocm did not, in fact, skip if rocm

fa1ac41

explain __del__

1d9a903

ezyang reviewed Dec 10, 2020

View reviewed changes

graph destruction in destructor with warnings

12ac954

facebook-github-bot reviewed Dec 11, 2020

View reviewed changes

ngimel approved these changes Dec 11, 2020

View reviewed changes

typo

b374e76

facebook-github-bot reviewed Dec 11, 2020

View reviewed changes

s/module/module_

1d69b27

facebook-github-bot reviewed Dec 11, 2020

View reviewed changes

facebook-github-bot closed this in c068180 Dec 14, 2020

facebook-github-bot added the Merged label Dec 14, 2020

ngimel mentioned this pull request Dec 15, 2020

Static Graphs using CUDA 10 Graphs API #15623

Closed

This was referenced Jan 26, 2021

RFC: Private CUDA memory pools #51075

Open

[CUDA graphs] Private mempools for CUDA graphs #51436

Closed

mcarilli added the module: cuda graphs Ability to capture and then replay streams of CUDA kernels label Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA graphs] Cuda RNG-safe graph capture and replay bindings #48875

[CUDA graphs] Cuda RNG-safe graph capture and replay bindings #48875

mcarilli commented Dec 5, 2020 •

edited

dr-ci bot commented Dec 5, 2020 •

edited by facebook-github-bot

ezyang commented Dec 8, 2020

ezyang Dec 8, 2020

mcarilli Dec 9, 2020

ezyang Dec 8, 2020

mcarilli Dec 8, 2020

mcarilli Dec 8, 2020

ezyang Dec 10, 2020 •

edited

ezyang Dec 8, 2020

mcarilli Dec 8, 2020 •

edited

ezyang Dec 8, 2020

ezyang Dec 8, 2020

mcarilli Dec 8, 2020

mcarilli Dec 8, 2020 •

edited

ezyang Dec 8, 2020

mcarilli Dec 8, 2020 •

edited

ezyang Dec 10, 2020

mcarilli Dec 10, 2020 •

edited

ezyang commented Dec 8, 2020

codecov bot commented Dec 10, 2020 •

edited

ezyang Dec 10, 2020

mcarilli Dec 10, 2020

ngimel Dec 10, 2020

ezyang commented Dec 10, 2020

ezyang commented Dec 10, 2020

ngimel commented Dec 11, 2020

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented Dec 14, 2020

[CUDA graphs] Cuda RNG-safe graph capture and replay bindings #48875

[CUDA graphs] Cuda RNG-safe graph capture and replay bindings #48875

Conversation

mcarilli commented Dec 5, 2020 • edited

dr-ci bot commented Dec 5, 2020 • edited by facebook-github-bot

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

🚧 1 fixed upstream failure:

ezyang commented Dec 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang Dec 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcarilli Dec 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcarilli Dec 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcarilli Dec 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcarilli Dec 10, 2020 • edited

Choose a reason for hiding this comment

ezyang commented Dec 8, 2020

codecov bot commented Dec 10, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Dec 10, 2020

ezyang commented Dec 10, 2020

ngimel commented Dec 11, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 14, 2020

mcarilli commented Dec 5, 2020 •

edited

dr-ci bot commented Dec 5, 2020 •

edited by facebook-github-bot

ezyang Dec 10, 2020 •

edited

mcarilli Dec 8, 2020 •

edited

mcarilli Dec 8, 2020 •

edited

mcarilli Dec 8, 2020 •

edited

mcarilli Dec 10, 2020 •

edited

codecov bot commented Dec 10, 2020 •

edited