Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506

Closed
wants to merge 3 commits into from

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Feb 1, 2021

Stack from ghstack:

If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the wait() wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.

Differential Revision: D26187379

If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the `wait()` wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 1, 2021

💊 CI failures summary and remediations

As of commit 8d56e6d (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the `wait()` wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.

[ghstack-poisoned]
If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the `wait()` wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Feb 1, 2021
If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the `wait()` wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.

This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.

ghstack-source-id: 8a59f6c25c18c05d93c8a7181032f66e758a7cf4
Pull Request resolved: #51506
@mrshenli mrshenli requested a review from lw February 1, 2021 23:58
@codecov
Copy link

codecov bot commented Feb 2, 2021

Codecov Report

Merging #51506 (8d56e6d) into gh/mrshenli/291/base (dacfdbf) will decrease coverage by 0.00%.
The diff coverage is 33.33%.

@@                   Coverage Diff                    @@
##           gh/mrshenli/291/base   #51506      +/-   ##
========================================================
- Coverage                 80.86%   80.85%   -0.01%     
========================================================
  Files                      1942     1942              
  Lines                    211376   211392      +16     
========================================================
+ Hits                     170925   170928       +3     
- Misses                    40451    40464      +13     

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about this change, so I'd like to understand better what is going on. The problem it's trying to solve is supporting arbitrary user types for the value held by CUDAFuture, right? And the specific instance where this limitation manifested itself is when such value is a RemoteException?

The limited type support of CUDAFuture was a known limitation and should indeed be addressed. Failing loudly in such cases was not ideal but it was intended (as per the Python Zen, "Errors should never pass silently"). My understanding is that after this diff we'd never fail loudly, and instead a failure when parsing the value would be automatically "dealt with" by re-using the parent future's events. Is that correct?

I think such a behavior could lead to sneaky and hard-to-debug issues, as it's not obvious to the user and it may even be wrong in some cases. Imagine the following code:

fut = rpc.rpc_async(...)  # fut just contains one tensor

def cb(t):
    return MyCustomClass(t + 1)

child_fut = fut.then(cb)

The original fut just contains a tensor whose dataptr is thus correctly extracted. The callback, then, launches a CUDA kernel (for that + 1 operation), and this is correctly synchronized with fut's streams/events. However, the callback then returns a user-defined custom class, which the child future cannot parse and cannot extract its tensors. Thus, child_fut will share fut's events. This means that synchronizing with child_fut will not synchronize with the + 1 kernel, and thus any further use of child_fut might operate on an incomplete value, or race with that kernel. Also, using the child future will not record the data ptrs of the result of the + 1 kernel with the caching allocator. These are both failures that are hard to detect and debug.

Do you think the above scenario could indeed occur, or did I misunderstand something in the code?

My original intention for dealing with types that couldn't be converted to IValues was to invoke the pickler, which is able to return the data ptrs used by the value we're serializing, which is exactly what we need here. One reason I didn't end up doing so was because of a "dependency" issue (such a code would need to live in CUDAFuture, which is part of ATen, and thus couldn't include the pickler, which is part of libtorch). However, for the RPC agent, we can circumvent this issue because we're already defining a subclass of CUDAFuture, outside of ATen, and thus there we could use the pickler without that problem.

Another simpler alternative is to modify our custom subclass of CUDAFuture to handle RemoteException as a special case, just like it handles Message as a special case right now. This means that the internal types used by RPC would be supported (and more efficiently than if we invoked the pickler), but on the other hand the user would still need to only use "standard" values which can be converted to IValues.

@mrshenli
Copy link
Contributor Author

mrshenli commented Feb 2, 2021

Do you think the above scenario could indeed occur, or did I misunderstand something in the code?

Yep, this can indeed occur. I was thinking about using a decorator to solve this problem as mentioned in the code comments.

The limited type support of CUDAFuture was a known limitation and should indeed be addressed. Failing loudly in such cases was not ideal but it was intended (as per the Python Zen, "Errors should never pass silently"). My understanding is that after this diff we'd never fail loudly, and instead a failure when parsing the value would be automatically "dealt with" by re-using the parent future's events. Is that correct?

Yep, failing hard is good, but when I debug the current code, it does not actually give me an error message. It just hang and then timed out. And it will be hard for users to trace to that getSubvalues(), as there are many threads running in the background, and they might not know why the user thread is blocked. Even if they identified the error, they don't really have a good solution to deal with custom types. So, the only conclusion users can draw from this behavior is CUDA RPC does not work.

Another simpler alternative is to modify our custom subclass of CUDAFuture to handle RemoteException as a special case, just like it handles Message as a special case right now. This means that the internal types used by RPC would be supported (and more efficiently than if we invoked the pickler), but on the other hand the user would still need to only use "standard" values which can be converted to IValues.

Yep, thought about this as well, and felt registering RemoteException as a JIT type might be an overkill. But looks like this is simplest solution without pickler if we prefer to crash on user types.

@mrshenli
Copy link
Contributor Author

mrshenli commented Feb 8, 2021

superseded by #51820

@mrshenli mrshenli closed this Feb 8, 2021
@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/291/head branch March 11, 2021 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants