Support wider range of types in FutureNCCL #48502

lw · 2020-11-26T18:56:55Z

Stack from ghstack:

Add support for async callbacks in ivalue::Future #48790 Add support for async callbacks in ivalue::Future
Drop FutureNCCL in favor of vanilla CUDAFuture #49014 Drop FutureNCCL in favor of vanilla CUDAFuture
Make CUDAFuture remember and restore current device in callback #48789 Make CUDAFuture remember and restore current device in callback
Remove DataPtr extractor from CUDAFuture #48840 Remove DataPtr extractor from CUDAFuture
Cache the DataPtrs in CUDAFuture #48788 Cache the DataPtrs in CUDAFuture
Split out reusable CUDAFuture from FutureNCCL #48506 Split out reusable CUDAFuture from FutureNCCL
Merge common parts of FutureNCCL into at::ivalue::Future #48505 Merge common parts of FutureNCCL into at::ivalue::Future
Split FutureNCCL's CUDA-specific parts from generic future logic #48504 Split FutureNCCL's CUDA-specific parts from generic future logic
Support wider range of types in FutureNCCL #48502 Support wider range of types in FutureNCCL
Don't store device indices separately on FutureNCCL #48501 Don't store device indices separately on FutureNCCL
Add multi-GPU support to FutureNCCL #48500 Add multi-GPU support to FutureNCCL
Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563 Fix FutureNCCL not recording dataptrs with caching alloc in wait()
Fix FutureNCCL's completed() disagreeing with wait() #48503 Fix FutureNCCL's completed() disagreeing with wait()
Record CUDA events for "follow-up" FutureNCCL inside markCompleted #48499 Record CUDA events for "follow-up" FutureNCCL inside markCompleted
Use fresh stream from pool for each FutureNCCL callback #48498 Use fresh stream from pool for each FutureNCCL callback
Make FutureNCCL record events in current stream #48497 Make FutureNCCL record events in current stream
Have FutureNCCL record streams w/ allocator in addCallback #48496 Have FutureNCCL record streams w/ allocator in addCallback
Add some safeguards to FutureNCCL #48562 Add some safeguards to FutureNCCL
Remove NCCL dependency from PythonFutureWrapper #48495 Remove NCCL dependency from PythonFutureWrapper
Avoid using FutureNCCL before it's ready #48561 Avoid using FutureNCCL before it's ready

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though.

The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated.

Differential Revision: D25177562

lw · 2020-11-26T19:09:02Z

aten/src/ATen/core/ivalue_inl.h

+    // Prefer getSubValues() over visit() as the latter is a silent no-op for
+    // some unsupported types, whereas the former at least fails loudly.
+    value.getSubValues(sub_values);


To clarify this, it seems that getSubValues triggers an assert for these types:

pytorch/aten/src/ATen/core/ivalue.cpp

Lines 202 to 206 in 8b248af

case Tag::Future:

case Tag::Device:

case Tag::PyObject:

case Tag::Uninitialized:

case Tag::Capsule:

Whereas visit just ignores such values but succeeds silently. I must admit that I expected getSubValues and visit to behave consistently (and perhaps the former to be implemented through the latter). Is there any reason for that?

Also, are there fundamental reasons for those types to not be supported? Are they planned in the future?

Are there better ways to extract DataPtrs from an IValue? One idea I had was pickling it, but I'm worried about the performance implications.

cc @wanchaol :)

Are there better ways to extract DataPtrs from an IValue? One idea I had was pickling it, but I'm worried about the performance implications.

I assume you are concerned about nested data structures with Tensors? It should be fine for ProcessGroupNCCL and futures returned by RPC, but might need a recursive/pickle-like solution for general CudaFuture objects returned by then?

Two cases that I'm not sure are supported in this version of the code are mixed-type lists (e.g., the user having a callback that does return [42, tensor]) and custom user-defined classes that contain tensors as some fields.

Also, I am not sure how this applies to RPC, because in that case the IValue held by the future will be a Message, which will contain the result of the remote user function as a field, and that value could be of any type, including a mixed-type lists and user-defined classes. Right?

As a general comment, if getSubValues() or visit() was able to support PyObjects (and properly acquire the GIL and such), then we wouldn't need all that DataPtr extractor craziness. That would by far be the ideal scenario. I don't know how much effort is required to get there...

admit that I expected getSubValues and visit to behave consistently (and perhaps the former to be implemented through the latter). Is there any reason for that?

semantic-wise it seems these two functions are a bit different, but yeah i agree they could be rewrite together. cc @bzinodev who implemented those two functions.

are there fundamental reasons for those types to not be supported? Are they planned in the future?

I think it's a matter of semantic undefined for those kind of types, i.e. what does it mean to get a subvalues of a FutureType ivalue? there might not be subvalues inside if the future is not completed. If we could defined well on those semantics, we could add those.

Are there better ways to extract DataPtrs from an IValue?

I think what you need here is a "flatten" function which flattens all different types of objects so that you could iterate them and get all the tensor data ptrs for those "tensor" ivalues. Your approach looks good by using getSubValues, though we still need to define what it should behave for those unsupported types if we want to expand it.

As a general comment, if getSubValues() or visit() was able to support PyObjects (and properly acquire the GIL and such), then we wouldn't need all that DataPtr extractor craziness.

Can you explain a bit more when/why we need to extract PyObjects to get DataPtrs? not sure if I understood fully on why supporting PyObjects would remove those craziness. But you want a python land to iterate the ivalues, you can possibly define those inside https://github.com/pytorch/pytorch/blob/2e9d6d99be52cdd2f27512ff357e8aae7f4d3d78/torch%2Fcsrc%2Fjit%2Fpython%2Fpython_ivalue.h#L0-L1

Thanks for your answers!

I think it's a matter of semantic undefined for those kind of types

Good point, makes sense! I'd say I'm totally fine with only supporting a (reasonable) subset of types in CUDAFuture. The remaining concern is how can we effectively communicate this to the user? Both in terms of how do we raise a meaningful error, and do we explain it in the docs? Do you have suggestions?

Can you explain a bit more when/why we need to extract PyObjects to get DataPtrs?

Sorry, it's the other way around: I need to extract DataPtrs from PyObjects. The reason is that the value of a "follow-up" future could be produced by a user-provided Python function, in which case the type of the IValue would be PyObject, right?

Anyways, thanks a lot for pointing me to python_ivalue.h! I hadn't seen it before and it gave me an idea to fix this properly. I'll try to implement and send out a PR, and maybe we can discuss further over there.

Done this in #48840.

lw · 2020-11-26T19:11:32Z

torch/csrc/jit/python/pybind_utils.h

+        // FIXME This could fail. As a fallback we could try to pickle the
+        // object, since the pickler might support broader types and it is able
+        // to extract the tensors from the object as a vector.
+        auto new_value = torch::jit::toTypeInferredIValue(obj);


My understanding is that the type can only be inferred when it's "regular", i.e., a mixed-type list won't work. Are there ways to handle such cases? Would pickling work?

More in general, I'd like to understand more about the PyObject type. Are all objects coming from Python represented as PyObjects, or only the ones for which no better type is found? Do we always need to hold the GIL when manipulating such types?

dr-ci · 2020-11-26T23:04:35Z

💊 CI failures summary and remediations

As of commit 07fc860 (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 3/3 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though. The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated. Differential Revision: [D25177562](https://our.internmc.facebook.com/intern/diff/D25177562/) [ghstack-poisoned]

wanchaol · 2020-12-04T08:31:51Z

aten/src/ATen/core/ivalue_inl.h

@@ -430,20 +430,18 @@ struct C10_EXPORT ivalue::Future : c10::intrusive_ptr_target {
  // Expose the default implementation so that external ones can defer to it.
  static std::vector<std::reference_wrapper<const at::DataPtr>>
  defaultDataPtrExtractor(const at::IValue& value) {


seems to me that defaultDataPtrExtractor is only used by pybind_utils.h, maybe we can move this to python_ivalue.h to make it handle the PyObject case as well?

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though. The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated. Differential Revision: [D25177562](https://our.internmc.facebook.com/intern/diff/D25177562/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though. The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated. Differential Revision: [D25177562](https://our.internmc.facebook.com/intern/diff/D25177562/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

wanchaol

looks good

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though. The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated. Differential Revision: [D25177562](https://our.internmc.facebook.com/intern/diff/D25177562/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T13:14:06Z

This pull request has been merged in a677898.

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) [ghstack-poisoned]

@wanchaol

Pull Request resolved: #48840 The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank @wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. ghstack-source-id: 118704935 Differential Revision: [D25334355](https://our.internmc.facebook.com/intern/diff/D25334355/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D25334355/)!

Summary: Pull Request resolved: #48840 The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In #48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. ghstack-source-id: 118704935 Test Plan: Unit tests Reviewed By: wanchaol Differential Revision: D25334355 fbshipit-source-id: 3f1d3bf6e6e8505a114c877fb9a6fcc3f68d91d3

Summary: Pull Request resolved: pytorch#48840 The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In pytorch#48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. ghstack-source-id: 118704935 Test Plan: Unit tests Reviewed By: wanchaol Differential Revision: D25334355 fbshipit-source-id: 3f1d3bf6e6e8505a114c877fb9a6fcc3f68d91d3

lw requested review from apaszke, mingzhe09088, mrshenli, pietern, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 26, 2020 18:56

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 26, 2020

lw commented Nov 26, 2020

View reviewed changes

This was referenced Nov 29, 2020

Avoid using FutureNCCL before it's ready #48561

Closed

Add some safeguards to FutureNCCL #48562

Closed

Fix FutureNCCL not recording dataptrs with caching alloc in wait() #48563

Closed

lw added 2 commits November 29, 2020 14:33

This was referenced Dec 3, 2020

Cache the DataPtrs in CUDAFuture #48788

Closed

Make CUDAFuture remember and restore current device in callback #48789

Closed

Add support for async callbacks in ivalue::Future #48790

Closed

wanchaol reviewed Dec 4, 2020

View reviewed changes

lw mentioned this pull request Dec 4, 2020

Remove DataPtr extractor from CUDAFuture #48840

Closed

lw mentioned this pull request Dec 8, 2020

Drop FutureNCCL in favor of vanilla CUDAFuture #49014

Closed

wanchaol approved these changes Dec 8, 2020

View reviewed changes

facebook-github-bot closed this in a677898 Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/lw/91/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support wider range of types in FutureNCCL #48502

Support wider range of types in FutureNCCL #48502

lw commented Nov 26, 2020 •

edited

lw Nov 26, 2020

mrshenli Nov 26, 2020

mrshenli Nov 26, 2020

lw Nov 27, 2020

lw Nov 27, 2020

wanchaol Dec 4, 2020

lw Dec 4, 2020

lw Dec 4, 2020

lw Nov 26, 2020

dr-ci bot commented Nov 26, 2020 •

edited

wanchaol Dec 4, 2020

wanchaol left a comment

facebook-github-bot commented Dec 10, 2020

	case Tag::Future:
	case Tag::Device:
	case Tag::PyObject:
	case Tag::Uninitialized:
	case Tag::Capsule:

Support wider range of types in FutureNCCL #48502

Support wider range of types in FutureNCCL #48502

Conversation

lw commented Nov 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-ci bot commented Nov 26, 2020 • edited

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

codecov.io: 1 failed

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 10, 2020

lw commented Nov 26, 2020 •

edited

dr-ci bot commented Nov 26, 2020 •

edited