Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA graphs] Cuda RNG-safe graph capture and replay bindings #48875

Closed
wants to merge 29 commits into from

Conversation

mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Dec 5, 2020

Part 2 of #46148 refactor. (part 1 was #48694.)
Contains

  • a few more CUDAGeneratorImpl diffs to clean up graph capture interaction
  • Capture and replay bindings that interact correctly with CUDAGeneratorImpl
  • Tests.

Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds.

See Note [CUDA Graph-safe RNG states] for the strategy, based on #46148 (comment).

@mcarilli mcarilli requested a review from ngimel December 5, 2020 03:09
@mcarilli mcarilli requested a review from ezyang December 5, 2020 03:09
@mcarilli mcarilli changed the title [CUDA graphs] CUDAGeneratorImpl-safe graph capture and replay bindings [CUDA graphs] CUDA RNG-safe graph capture and replay bindings Dec 5, 2020
@dr-ci
Copy link

dr-ci bot commented Dec 5, 2020

💊 CI failures summary and remediations

As of commit 1d69b27 (more details on the Dr. CI page):


  • 1/2 failures introduced in this PR
  • 1/2 broken upstream at merge base 7a2abbd on Dec 09 from 2:50pm to 7:46pm

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 11 22:37:01 AssertionError: mypy failed: tools/codegen/gen.py:388:9: error: Missing return statement
Dec 11 22:36:38   test_run_mypy (__main__.TestTypeHints) ... ok (59.788s)
Dec 11 22:36:41   test_run_mypy_strict (__main__.TestTypeHints) ... FAIL (2.806s)
Dec 11 22:37:01   test_type_hint_examples (__main__.TestTypeHints) ... ok (19.741s)
Dec 11 22:37:01 
Dec 11 22:37:01 ======================================================================
Dec 11 22:37:01 FAIL [2.806s]: test_run_mypy_strict (__main__.TestTypeHints)
Dec 11 22:37:01 ----------------------------------------------------------------------
Dec 11 22:37:01 Traceback (most recent call last):
Dec 11 22:37:01   File "test_type_hints.py", line 239, in test_run_mypy_strict
Dec 11 22:37:01     self.fail(f"mypy failed: {stdout} {stderr}")
Dec 11 22:37:01 AssertionError: mypy failed: tools/codegen/gen.py:388:9: error: Missing return statement
Dec 11 22:37:01 Found 1 error in 1 file (checked 11 source files)
Dec 11 22:37:01  
Dec 11 22:37:01 
Dec 11 22:37:01 ----------------------------------------------------------------------
Dec 11 22:37:01 Ran 4 tests in 92.974s
Dec 11 22:37:01 
Dec 11 22:37:01 FAILED (failures=1)
Dec 11 22:37:01 
Dec 11 22:37:01 Generating XML reports...
Dec 11 22:37:01 Generated XML report: test-reports/dist-gloo/TEST-TestTypeHints-20201211223528.xml

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 83 times.

@mcarilli mcarilli changed the title [CUDA graphs] CUDA RNG-safe graph capture and replay bindings [CUDA graphs] Cuda RNG-safe graph capture and replay bindings Dec 5, 2020
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 7, 2020
@ezyang
Copy link
Contributor

ezyang commented Dec 8, 2020

Lotta test failures lol

namespace at {
namespace cuda {

struct TORCH_CUDA_API CUDAGraph {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should have a // See Note [CUDA Graph... reference somewhere here; probably framed as "why does PyTorch need its own CUDAGraph rep"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c10::nullopt, cuda::detail::getDefaultCUDAGenerator());

auto options = TensorOptions().device(at::kCUDA).dtype(at::kLong);
offset_extragraph_ = at::empty({1}, options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you sure you want empty here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't matter if we run on garbage, all we're trying to do is bake in the right pointers. If there's data dependent control flow in the graphed region, we have no business graphing it anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, deliberately beginning capture with garbage here will stress any unwanted/unexpected data dependence, and help catch failures.

Copy link
Contributor

@ezyang ezyang Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, half the time the garbage is going to be zeros, if you really want to find unexpected data dependence, should fill it with some sentinel. (But duly taken)

offset_extragraph_ = offset_extragraph;
offset_intragraph_ = 0;
graph_expects_this_gen_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you have multiple captures going on at the same time?

Copy link
Collaborator Author

@mcarilli mcarilli Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think the best approach is to make graph_expects_this_gen_ an argument to those consumer-called members. nvm my idea is nonsense. There are ways I can try to make this safe which we can discuss in more detail, but I can't think of a simple one. The simplest answer is to disallow that usage: only one capture may be underway at a time.

a = torch.zeros((1000,), device="cuda")
a += 1
g = torch.cuda.Graph()
g.capture_begin()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you sure you don't want a context manager for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manager could also take care of getting onto a non-default stream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventually a context manager is the right exposure, but right now, this won't be documented, and exists to serve experiments. In experiments, I don't want to exit gracefully from graphed regions if capture fails. I want all hell to break loose so we see it.

Copy link
Collaborator Author

@mcarilli mcarilli Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hacking on a local UX that wraps the exposures here and takes care of non-default stream. The goal of this PR is to introduce a minimal capture exposure to enable flexible UX hacking.

self.assertTrue(a.sum().item() == 3000.)

def test_graph_rng_functional(self):
# The caching allocator isn't yet graph-safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so what's the plan for this anyway? Thread local allocator override?

Copy link
Collaborator Author

@mcarilli mcarilli Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.youtube.com/watch?v=Lf6aMcnR9WQ
Long term (many months) I'm working with cudaMallocAsync people to ensure cudaMallocAsync plays well with cuda graphs. Hopefully after we integrate cudaMallocAsync, capturing allocations will "just work."

Short term, Arslan Zulfiqar, one of our people, hacked the Pytorch allocator to request and reserve a stream-silo for the graph(s): graph capture gets its own memory silo/pool in which it can reuse memory, and it won't be affected by other eager allocations. My very next task after this PR is to look into upstreaming his changes or some variation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are you saying, when cudaMallocAsync becomes a thing, we get rid of PyTorch's CUDA caching allocator?

Copy link
Collaborator Author

@mcarilli mcarilli Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a sad waste, not to mention loss of control for corner cases when you need it. I think the perfect world would be, cudaMallocAsync and the Pytorch allocator are alternative user-selectable backends for some common allocator interface (which may also accept external allocators that support the same interface #43144). I do think cudaMallocAsync should be the default.

@ezyang
Copy link
Contributor

ezyang commented Dec 8, 2020

The implementation bits seem fine. But the UX may need some smithing. Is there a plan on record here? (@ngimel?)

@codecov
Copy link

codecov bot commented Dec 10, 2020

Codecov Report

Merging #48875 (b374e76) into master (7a2abbd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #48875   +/-   ##
=======================================
  Coverage   80.69%   80.70%           
=======================================
  Files        1871     1871           
  Lines      202062   202064    +2     
=======================================
+ Hits       163064   163071    +7     
+ Misses      38998    38993    -5     

with torch.cuda.stream(s1):
a = torch.zeros((1000,), device="cuda")
a += 1
g = torch.cuda.Graph()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API name don't look very private to me! 🤣

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤫
Alright, I can not import it as Graph for now, so usage (for experimenters) would be

g = torch._C._CudaGraphBase()
g.capture_begin/capture_end/replay()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make it torch.cuda._Graph for now, rename when we have a better idea what we want to do with it.

@ezyang
Copy link
Contributor

ezyang commented Dec 10, 2020

I guess I'm OK with merging this on an experimental basis, but right now it's too public looking for my taste.

@ezyang
Copy link
Contributor

ezyang commented Dec 10, 2020

this seems ok. @ngimel do you want to land it?

@ngimel
Copy link
Collaborator

ngimel commented Dec 11, 2020

Let me take a last look and I will.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in c068180.

facebook-github-bot pushed a commit that referenced this pull request Mar 12, 2021
Summary:
Implements #51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).

[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from #48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```

Test plan (other suggestions appreciated):

- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](#51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

Pull Request resolved: #51436

Reviewed By: mruberry

Differential Revision: D26993790

Pulled By: ngimel

fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Implements pytorch#51075 (comment) and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).

[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from pytorch#48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```

Test plan (other suggestions appreciated):

- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](pytorch#51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

Pull Request resolved: pytorch#51436

Reviewed By: mruberry

Differential Revision: D26993790

Pulled By: ngimel

fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
@mcarilli mcarilli added the module: cuda graphs Ability to capture and then replay streams of CUDA kernels label Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: cuda graphs Ability to capture and then replay streams of CUDA kernels open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants