New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents #51506
Conversation
If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 8d56e6d (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. [ghstack-poisoned]
If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. [ghstack-poisoned]
If the child cannot extract tensors from returned IValue, the current child CUDAFuture won't wait for anything. In this case, if the `wait()` wasn't called on the parent Future, streams are not synchronized, and it is possible that parent Future's CUDA ops have not been added to streams yet. This commit let the child CUDAFuture holds on to parent's CUDAEvents at initialization time, and only replaces those events with its own ones if sub-value extraction from the returned IValue succeeded. ghstack-source-id: 8a59f6c25c18c05d93c8a7181032f66e758a7cf4 Pull Request resolved: #51506
Codecov Report
@@ Coverage Diff @@
## gh/mrshenli/291/base #51506 +/- ##
========================================================
- Coverage 80.86% 80.85% -0.01%
========================================================
Files 1942 1942
Lines 211376 211392 +16
========================================================
+ Hits 170925 170928 +3
- Misses 40451 40464 +13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about this change, so I'd like to understand better what is going on. The problem it's trying to solve is supporting arbitrary user types for the value held by CUDAFuture, right? And the specific instance where this limitation manifested itself is when such value is a RemoteException?
The limited type support of CUDAFuture was a known limitation and should indeed be addressed. Failing loudly in such cases was not ideal but it was intended (as per the Python Zen, "Errors should never pass silently"). My understanding is that after this diff we'd never fail loudly, and instead a failure when parsing the value would be automatically "dealt with" by re-using the parent future's events. Is that correct?
I think such a behavior could lead to sneaky and hard-to-debug issues, as it's not obvious to the user and it may even be wrong in some cases. Imagine the following code:
fut = rpc.rpc_async(...) # fut just contains one tensor
def cb(t):
return MyCustomClass(t + 1)
child_fut = fut.then(cb)
The original fut
just contains a tensor whose dataptr is thus correctly extracted. The callback, then, launches a CUDA kernel (for that + 1
operation), and this is correctly synchronized with fut
's streams/events. However, the callback then returns a user-defined custom class, which the child future cannot parse and cannot extract its tensors. Thus, child_fut
will share fut
's events. This means that synchronizing with child_fut
will not synchronize with the + 1
kernel, and thus any further use of child_fut
might operate on an incomplete value, or race with that kernel. Also, using the child future will not record the data ptrs of the result of the + 1
kernel with the caching allocator. These are both failures that are hard to detect and debug.
Do you think the above scenario could indeed occur, or did I misunderstand something in the code?
My original intention for dealing with types that couldn't be converted to IValues was to invoke the pickler, which is able to return the data ptrs used by the value we're serializing, which is exactly what we need here. One reason I didn't end up doing so was because of a "dependency" issue (such a code would need to live in CUDAFuture, which is part of ATen, and thus couldn't include the pickler, which is part of libtorch). However, for the RPC agent, we can circumvent this issue because we're already defining a subclass of CUDAFuture, outside of ATen, and thus there we could use the pickler without that problem.
Another simpler alternative is to modify our custom subclass of CUDAFuture to handle RemoteException as a special case, just like it handles Message as a special case right now. This means that the internal types used by RPC would be supported (and more efficiently than if we invoked the pickler), but on the other hand the user would still need to only use "standard" values which can be converted to IValues.
Yep, this can indeed occur. I was thinking about using a decorator to solve this problem as mentioned in the code comments.
Yep, failing hard is good, but when I debug the current code, it does not actually give me an error message. It just hang and then timed out. And it will be hard for users to trace to that
Yep, thought about this as well, and felt registering |
superseded by #51820 |
Stack from ghstack:
If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the
wait()
wasn't called on the parent Future, streams arenot synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.
This commit let the child CUDAFuture holds on to parent's
CUDAEvents at initialization time, and only replaces those events
with its own ones if sub-value extraction from the returned IValue
succeeded.
Differential Revision: D26187379