[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671

mortzur · 2019-10-23T19:45:04Z

Summary:
Device resident tensors API implementation with OpenCL backend.

DeviceResidencyInfo - the class which stores residency info.
Tensor holds a shared_ptr to a default DRI object in order to be able to share the pointer on tensor "getUnowned" calls before any device transfers take place. Calls to "clone" produce an independent copy with a different DRI object.
DeviceTensorTransferManager - an interface of the transfer operations related to DRT.
Tensor: asserts were added on accessing host tensor data to ensure host manipulations and device residency are mutually exclusive.
Functionality: OpenCL backend now supports DRT and leaves tensors on the device by default. This requires the user to transfer tensors to host to access data.
- HostManager - implements the method "ensureOutputsAvailable()". Calling this after runNetworks ensures all tensors are on host.
- Execution Engine - calling EE::run() uses hostManager and calls ensureOutputsAvailable().
- DeviceManager based testing - "copyDeviceTensorsToHost" field was added to DeviceConfig to enable runFunction to transfer all tensors to host when complete.
OpenCL: in order to do device transfers a base address (cl_mem) and an offset are required. Both were wrapped in a struct OpenCLDeviceTransferContext.
This should be eliminated if/when we change openCL compiled function to direct pointer calls (instead of base+offset pointer arithmetic, see below).

Previous notes on openCL function:
OpenCL compiled function and kernels rely on pointer arithmetic in the contiguous memory section (defined by the runtime bundle). This doesn't fit with dynamic memory allocation for device tensor buffers. (Ongoing discussion with @nickgg ) This may be addressed multiple ways, for example:
- Modify openCL function and kernels to support absolute addresses (instead of base+offset scheme). This enables dynamic memory allocation and also different memory management schemes (allocating per-tensor ahead-of time).
- Blocking the entire contiguous memory section until all tensors are released (while doing the required ref-counting)

Currently solved by "buffering" : copy to a separate device buffer when the function completes and free the function buffer.

Test Plan:
ninja test

nickgg

Awesome! Nice job @mortzur.

I do think we'll need some pretty sophisticated tests to hit the ins and outs of this though.

nickgg · 2019-10-28T21:24:09Z

include/glow/Base/Tensor.h

+  bool isDeviceResident() const { return residencyInfoP_->isDeviceResident(); }
+
+  // Update device residency info with new device manager and context
+  void moveToDevice(runtime::DeviceManager *deviceManager, void *context) {


Question for a future diff: To support pinned Tensors, should we have an optional argument here to control whether tensorResidency_ is Device or Host with DeviceManager state?

What is the use case for pinned tensors -
Do want to be able to copy from pinned memory directly to device and vice-versa? (meaning: do we need 2 pointers (to opaque memory) in the tensor class for both pinned location and device location?)

Yes we want to copy into pinned memory ahead of time and then move it to the device from that memory. I think we can store the extra pointer in the DeviceResidencyInfo for that backend? Unsure, but we dont need to solve it for this diff.

include/glow/Base/Tensor.h

lib/Backends/OpenCL/OpenCLDeviceManager.cpp

nickgg · 2019-10-28T21:39:06Z

include/glow/Base/Tensor.h

+  DeviceResidencyInfo()
+      : tensorResidency_(TensorResidency::Host), deviceManager(nullptr),
+        context(nullptr) {}
+


I think you should have a destructor here which calls deviceManager->releaseTensor if the tensor is resident.

Yes!
What we want to do is actually give DM an option to release / put-back-in-the-pool the memory buffer.

I think we should change the API to releaseDeviceTensor(void *context).
What's needed in order to free the buffer is the context. Tensor may be used to get the context from DRI. If we want DRI DTOR to call "release" it can pass the stored context.
Is there another motivation to have Tensor& as the input arg?

Well, releasing has two parts: freeing the device resource and clearing the DeviceResidencyInfo state in Tensor. If we pass in Tensor we can do both in the releaseDeviceTensor call, if we pass in context we can only do the first.

gcatron · 2019-10-28T22:09:14Z

include/glow/Backends/DeviceManager.h

+
+  /// Copies the contents of \p tensor from the host to the \p location address
+  /// on this device. Updates the tensor residency info.
+  virtual bool transferToDevice(Tensor &tensor, void *context) {


This updates the tensor residency info to the information in void *context right?
Should context be named residencyInfo or location?
Where does the location information come from? Should there be a getAddress() method as well?

Personally, I think it should be called location and should definitely be optional.

Don't think getAddress() works because some devices may not use addresses.

Device Residency Info contains pointers to:

DeviceManager

Opaque context
So it is not the residencyInfo itself.
In addition, I think there may be use cases where the context is useful to store additional information besides the actual memory address / location.
For example, let's consider a backend with a memory pool for tensors. Its device manager should have a mapping between keys and buffers. If a tensor becomes device resident, it should get a buffer. In this case, one option is to store that key in the context.

Naming - how about locationContext?

GetAddress(...) : this goes to how a certain device manager creates a new locationContext.
The challenge here is to figure out if there is a single interface that fits most backends, i.e. if it makes sense to add createLocationContext() (taking no input arguments). Or does a certain backend needs the function name? or the placeholder name? (as we briefly started discussing :))

I'd be concerned that if we find an approach that works well for all backends we know about now, we'll encounter a future backend that doesn't fit that scheme.

facebook-github-bot

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

nickgg · 2019-10-30T18:45:34Z

lib/ExecutionEngine/ExecutionEngine.cpp


  fut.wait();
+  hostManager_->ensureOutputsAvailable(*contextOut.get());


It would be nice to be able to control this behaviour somehow so if we're using the EE but we want to leave Tensors of the device we could.

nickgg · 2019-10-30T18:46:25Z

lib/Runtime/HostManager/HostManager.cpp

@@ -480,6 +480,15 @@ void HostManager::updateExecutionStats(
                                1000000 / duration);
 }

+void HostManager::ensureOutputsAvailable(ExecutionContext &context) {


It's important to make sure we're calling this everywhere we use the HostManager until the user is able to deal with Tensors staying Device Resident. The big one I can think of is Onnxifi, but there may be others.

nickgg · 2019-10-30T22:16:51Z

lib/Backends/OpenCL/OpenCLDeviceManager.cpp

+}
+
+/// Releases the device buffer associated with \p tensor.
+bool OpenCLDeviceManager::releaseDeviceTensor(void *locationContext) {


I think we spoke about it in person, but I think it would reduce error potential if this took in a Tensor and cleared the DeviceResidencyInfo. This way the caller has to always clear the residency info themselves.

This is doable but not as simple as it sounds, I think.
Writing my thoughts to consider the pros and cons:

DRI is then required to hold a pointer to its owning tensor.
This is a results of DRI'a destructor calling releaseDeviceTensor(Tensor &t).
This was put in place to prevent device memory leaks, but can also be error prone as it requires the user to be aware of double-free flows. For easiest use, "releaseDeviceTensor" must be implemented to check for double-free.

Technical details: moving DRI member init from definition to each Tensor constructor (adds "static" complexity).
In addition, a weird handling will then be required for Tensor's move-constructor and assignment operator , which swaps DRI pointers.

Ah I see, thats a tricky one. Ok lets leave it as it is and maybe make a comment?

facebook-github-bot

@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

nickgg · 2019-10-31T18:29:50Z

include/glow/Base/DeviceTensorTransferManager.h

+  virtual ~DeviceTensorTransferManager() {}
+  /// Copies the contents of \p tensor from the host to the \p location address
+  /// on this device. Updates the tensor residency info.
+  virtual bool transferToDevice(Tensor &tensor, void *locationContext) = 0;


I really think locationContext should default to null, which means the device should alloc memory itself. This prevents callers from needing to understand how to create the right locationContext.

nickgg · 2019-10-31T18:30:36Z

lib/Backends/OpenCL/OpenCLDeviceManager.cpp

+                ->isDeviceResident()) {
+          transferFromDevice(
+              *(context->getPlaceholderBindings()->get(it->second.first)),
+              /* release deivce memory*/ true);


nit: typo in device

nickgg · 2019-10-31T18:31:32Z

lib/Backends/OpenCL/OpenCLDeviceManager.cpp

+          transferFromDevice(
+              *(context->getPlaceholderBindings()->get(it->second.first)),
+              /* release deivce memory*/ true);
+        }
        auto handle = context->getPlaceholderBindings()
                          ->get(it->second.first)
                          ->getHandle<int64_t>();


This lookup is done twice, can store the Tensor in a local var above?

nickgg · 2019-10-31T20:22:09Z

lib/Backends/OpenCL/OpenCLDeviceManager.cpp

@@ -701,3 +731,105 @@ void OpenCLDeviceManager::runFunctionImpl(
  // Fire the resultCB.
  resultCB(id, std::move(executeErr), std::move(context));
 }
+
+bool OpenCLDeviceManager::transferToDevice(Tensor &tensor, void *context) {
+  runtime::OpenCLDeviceTransferContext *ctx =


DCHECK(context) ?

nickgg · 2019-10-31T20:45:10Z

Are you planning to add any unit tests to this PR?

nickgg · 2019-11-06T19:27:54Z

@mortzur is away for awhile and I'll take over this work. Since I'm going to have to put up a new PR to push to it anyway, I'm going to split out the OCL and Non-OCL parts.

Summary: Taking over #3671, but spinning out the API and Glow-core level changes associated with the DRT plan in #3629. This does not implement DRT support on any device. Documentation: See #3629. Pull Request resolved: #3745 Test Plan: Ran tests, added two simple new sanity checks to DeviceManagerTest. The first `DeviceResidentTensors` should run only for backends that support resident tensors (none currently). The second `CanHandleDeviceResidentTensors` should run on all devices. Differential Revision: D18378905 Pulled By: nickgg fbshipit-source-id: 887c290dae5a6b9b75e9b41a415958d499bc5402

stale · 2019-11-21T19:59:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale · 2019-12-06T20:24:37Z

This PR has been automatically closed due to being stale for 15 days. Thank you for your contributions and feel free to reopen it in case of further progress.

Summary: Taking over pytorch#3671, but spinning out the API and Glow-core level changes associated with the DRT plan in pytorch#3629. This does not implement DRT support on any device. Documentation: See pytorch#3629. Pull Request resolved: pytorch#3745 Test Plan: Ran tests, added two simple new sanity checks to DeviceManagerTest. The first `DeviceResidentTensors` should run only for backends that support resident tensors (none currently). The second `CanHandleDeviceResidentTensors` should run on all devices. Differential Revision: D18378905 Pulled By: nickgg fbshipit-source-id: 887c290dae5a6b9b75e9b41a415958d499bc5402

mortzur requested a review from nickgg October 23, 2019 19:45

facebook-github-bot added the CLA Signed label Oct 23, 2019

mortzur force-pushed the drt branch 3 times, most recently from a2fb9b9 to 1c41c07 Compare October 28, 2019 18:43

mortzur requested a review from gcatron October 28, 2019 18:44

nickgg reviewed Oct 28, 2019

View reviewed changes

gcatron reviewed Oct 28, 2019

View reviewed changes

gcatron mentioned this pull request Oct 29, 2019

Added new interface for cacheable Placeholders #3690

Closed

SplitInfinity assigned nickgg and gcatron Oct 29, 2019

mortzur force-pushed the drt branch from 1c41c07 to e9ff7b3 Compare October 29, 2019 21:55

add device resident tensors api

f18c64e

mortzur force-pushed the drt branch 3 times, most recently from 8e5e8c3 to ceebc7c Compare October 29, 2019 22:51

facebook-github-bot reviewed Oct 30, 2019

View reviewed changes

mortzur force-pushed the drt branch from ceebc7c to 2985dea Compare October 30, 2019 17:38

facebook-github-bot reviewed Oct 30, 2019

View reviewed changes

openCL device resident tensors API implementation

a6bb98a

mortzur force-pushed the drt branch from 2985dea to a6bb98a Compare October 30, 2019 18:00

nickgg reviewed Oct 30, 2019

View reviewed changes

tensor asserts for DRT

6f87f07

mortzur marked this pull request as ready for review October 30, 2019 23:53

mortzur changed the title ~~[WIP] Implementing device resident tensors in OpenCL~~ [DeviceResidentTensors] Implementing device resident tensors in OpenCL Oct 30, 2019

facebook-github-bot reviewed Oct 31, 2019

View reviewed changes

nickgg reviewed Oct 31, 2019

View reviewed changes

nickgg mentioned this pull request Nov 6, 2019

Device Resident Tensors - API & Framework #3745

Closed

stale bot added the stale_will_be_closed label Nov 21, 2019

stale bot closed this Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671

[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671

mortzur commented Oct 23, 2019 •

edited

nickgg left a comment

nickgg Oct 28, 2019

mortzur Oct 31, 2019

nickgg Nov 1, 2019

nickgg Oct 28, 2019

mortzur Oct 29, 2019

nickgg Oct 29, 2019

gcatron Oct 28, 2019

nickgg Oct 28, 2019

mortzur Oct 29, 2019

nickgg Oct 29, 2019

facebook-github-bot left a comment

facebook-github-bot left a comment

nickgg Oct 30, 2019

nickgg Oct 30, 2019

nickgg Oct 30, 2019

mortzur Oct 30, 2019

nickgg Oct 31, 2019

facebook-github-bot left a comment

nickgg Oct 31, 2019

nickgg Oct 31, 2019

nickgg Oct 31, 2019

nickgg Oct 31, 2019

nickgg commented Oct 31, 2019

nickgg commented Nov 6, 2019

stale bot commented Nov 21, 2019

stale bot commented Dec 6, 2019


		fut.wait();
		hostManager_->ensureOutputsAvailable(*contextOut.get());

[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671

[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671

Conversation

mortzur commented Oct 23, 2019 • edited

nickgg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickgg commented Oct 31, 2019

nickgg commented Nov 6, 2019

stale bot commented Nov 21, 2019

stale bot commented Dec 6, 2019

mortzur commented Oct 23, 2019 •

edited