Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506

mrshenli · 2021-02-01T23:09:22Z

Stack from ghstack:

Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506 Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents
Prevent CUDAFuture from using uninitialized device index #51505 Prevent CUDAFuture using uninitialized device index

If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the wait() wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.

Differential Revision: D26187379

If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. [ghstack-poisoned]

facebook-github-bot · 2021-02-01T23:09:53Z

💊 CI failures summary and remediations

As of commit 8d56e6d (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. [ghstack-poisoned]

If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. ghstack-source-id: 8a59f6c25c18c05d93c8a7181032f66e758a7cf4 Pull Request resolved: #51506

codecov · 2021-02-02T07:19:58Z

Codecov Report

Merging #51506 (8d56e6d) into gh/mrshenli/291/base (dacfdbf) will decrease coverage by 0.00%.
The diff coverage is 33.33%.

@@                   Coverage Diff                    @@
##           gh/mrshenli/291/base   #51506      +/-   ##
========================================================
- Coverage                 80.86%   80.85%   -0.01%     
========================================================
  Files                      1942     1942              
  Lines                    211376   211392      +16     
========================================================
+ Hits                     170925   170928       +3     
- Misses                    40451    40464      +13

lw

I have some concerns about this change, so I'd like to understand better what is going on. The problem it's trying to solve is supporting arbitrary user types for the value held by CUDAFuture, right? And the specific instance where this limitation manifested itself is when such value is a RemoteException?

The limited type support of CUDAFuture was a known limitation and should indeed be addressed. Failing loudly in such cases was not ideal but it was intended (as per the Python Zen, "Errors should never pass silently"). My understanding is that after this diff we'd never fail loudly, and instead a failure when parsing the value would be automatically "dealt with" by re-using the parent future's events. Is that correct?

I think such a behavior could lead to sneaky and hard-to-debug issues, as it's not obvious to the user and it may even be wrong in some cases. Imagine the following code:

fut = rpc.rpc_async(...)  # fut just contains one tensor

def cb(t):
    return MyCustomClass(t + 1)

child_fut = fut.then(cb)

The original fut just contains a tensor whose dataptr is thus correctly extracted. The callback, then, launches a CUDA kernel (for that + 1 operation), and this is correctly synchronized with fut's streams/events. However, the callback then returns a user-defined custom class, which the child future cannot parse and cannot extract its tensors. Thus, child_fut will share fut's events. This means that synchronizing with child_fut will not synchronize with the + 1 kernel, and thus any further use of child_fut might operate on an incomplete value, or race with that kernel. Also, using the child future will not record the data ptrs of the result of the + 1 kernel with the caching allocator. These are both failures that are hard to detect and debug.

Do you think the above scenario could indeed occur, or did I misunderstand something in the code?

My original intention for dealing with types that couldn't be converted to IValues was to invoke the pickler, which is able to return the data ptrs used by the value we're serializing, which is exactly what we need here. One reason I didn't end up doing so was because of a "dependency" issue (such a code would need to live in CUDAFuture, which is part of ATen, and thus couldn't include the pickler, which is part of libtorch). However, for the RPC agent, we can circumvent this issue because we're already defining a subclass of CUDAFuture, outside of ATen, and thus there we could use the pickler without that problem.

Another simpler alternative is to modify our custom subclass of CUDAFuture to handle RemoteException as a special case, just like it handles Message as a special case right now. This means that the internal types used by RPC would be supported (and more efficiently than if we invoked the pickler), but on the other hand the user would still need to only use "standard" values which can be converted to IValues.

mrshenli · 2021-02-02T15:23:46Z

Do you think the above scenario could indeed occur, or did I misunderstand something in the code?

Yep, this can indeed occur. I was thinking about using a decorator to solve this problem as mentioned in the code comments.

The limited type support of CUDAFuture was a known limitation and should indeed be addressed. Failing loudly in such cases was not ideal but it was intended (as per the Python Zen, "Errors should never pass silently"). My understanding is that after this diff we'd never fail loudly, and instead a failure when parsing the value would be automatically "dealt with" by re-using the parent future's events. Is that correct?

Yep, failing hard is good, but when I debug the current code, it does not actually give me an error message. It just hang and then timed out. And it will be hard for users to trace to that getSubvalues(), as there are many threads running in the background, and they might not know why the user thread is blocked. Even if they identified the error, they don't really have a good solution to deal with custom types. So, the only conclusion users can draw from this behavior is CUDA RPC does not work.

Another simpler alternative is to modify our custom subclass of CUDAFuture to handle RemoteException as a special case, just like it handles Message as a special case right now. This means that the internal types used by RPC would be supported (and more efficiently than if we invoked the pickler), but on the other hand the user would still need to only use "standard" values which can be converted to IValues.

Yep, thought about this as well, and felt registering RemoteException as a JIT type might be an overkill. But looks like this is simplest solution without pickler if we prefer to crash on user types.

mrshenli · 2021-02-08T22:31:20Z

superseded by #51820

mrshenli requested review from mingzhe09088, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 1, 2021 23:09

mrshenli mentioned this pull request Feb 1, 2021

Prevent CUDAFuture from using uninitialized device index #51505

Closed

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 1, 2021

mrshenli requested a review from lw February 1, 2021 23:58

lw reviewed Feb 2, 2021

View reviewed changes

mrshenli closed this Feb 8, 2021

facebook-github-bot deleted the gh/mrshenli/291/head branch March 11, 2021 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506

Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506

mrshenli commented Feb 1, 2021 •

edited

facebook-github-bot commented Feb 1, 2021 •

edited

codecov bot commented Feb 2, 2021

lw left a comment

mrshenli commented Feb 2, 2021

mrshenli commented Feb 8, 2021

Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506

Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506

Conversation

mrshenli commented Feb 1, 2021 • edited

facebook-github-bot commented Feb 1, 2021 • edited

💊 CI failures summary and remediations

codecov bot commented Feb 2, 2021

Codecov Report

lw left a comment

Choose a reason for hiding this comment

mrshenli commented Feb 2, 2021

mrshenli commented Feb 8, 2021

mrshenli commented Feb 1, 2021 •

edited

facebook-github-bot commented Feb 1, 2021 •

edited