Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671

Closed
wants to merge 3 commits into from

Conversation

mortzur
Copy link
Contributor

@mortzur mortzur commented Oct 23, 2019

Summary:
Device resident tensors API implementation with OpenCL backend.

  • DeviceResidencyInfo - the class which stores residency info.
    Tensor holds a shared_ptr to a default DRI object in order to be able to share the pointer on tensor "getUnowned" calls before any device transfers take place. Calls to "clone" produce an independent copy with a different DRI object.

  • DeviceTensorTransferManager - an interface of the transfer operations related to DRT.

  • Tensor: asserts were added on accessing host tensor data to ensure host manipulations and device residency are mutually exclusive.

  • Functionality: OpenCL backend now supports DRT and leaves tensors on the device by default. This requires the user to transfer tensors to host to access data.

    • HostManager - implements the method "ensureOutputsAvailable()". Calling this after runNetworks ensures all tensors are on host.
    • Execution Engine - calling EE::run() uses hostManager and calls ensureOutputsAvailable().
    • DeviceManager based testing - "copyDeviceTensorsToHost" field was added to DeviceConfig to enable runFunction to transfer all tensors to host when complete.
  • OpenCL: in order to do device transfers a base address (cl_mem) and an offset are required. Both were wrapped in a struct OpenCLDeviceTransferContext.
    This should be eliminated if/when we change openCL compiled function to direct pointer calls (instead of base+offset pointer arithmetic, see below).

Previous notes on openCL function:
OpenCL compiled function and kernels rely on pointer arithmetic in the contiguous memory section (defined by the runtime bundle). This doesn't fit with dynamic memory allocation for device tensor buffers. (Ongoing discussion with @nickgg ) This may be addressed multiple ways, for example:
- Modify openCL function and kernels to support absolute addresses (instead of base+offset scheme). This enables dynamic memory allocation and also different memory management schemes (allocating per-tensor ahead-of time).
- Blocking the entire contiguous memory section until all tensors are released (while doing the required ref-counting)

Currently solved by "buffering" : copy to a separate device buffer when the function completes and free the function buffer.

Test Plan:
ninja test

Copy link
Contributor

@nickgg nickgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Nice job @mortzur.

I do think we'll need some pretty sophisticated tests to hit the ins and outs of this though.

bool isDeviceResident() const { return residencyInfoP_->isDeviceResident(); }

// Update device residency info with new device manager and context
void moveToDevice(runtime::DeviceManager *deviceManager, void *context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for a future diff: To support pinned Tensors, should we have an optional argument here to control whether tensorResidency_ is Device or Host with DeviceManager state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for pinned tensors -
Do want to be able to copy from pinned memory directly to device and vice-versa? (meaning: do we need 2 pointers (to opaque memory) in the tensor class for both pinned location and device location?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we want to copy into pinned memory ahead of time and then move it to the device from that memory. I think we can store the extra pointer in the DeviceResidencyInfo for that backend? Unsure, but we dont need to solve it for this diff.

include/glow/Base/Tensor.h Outdated Show resolved Hide resolved
lib/Backends/OpenCL/OpenCLDeviceManager.cpp Outdated Show resolved Hide resolved
DeviceResidencyInfo()
: tensorResidency_(TensorResidency::Host), deviceManager(nullptr),
context(nullptr) {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have a destructor here which calls deviceManager->releaseTensor if the tensor is resident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!
What we want to do is actually give DM an option to release / put-back-in-the-pool the memory buffer.

I think we should change the API to releaseDeviceTensor(void *context).
What's needed in order to free the buffer is the context. Tensor may be used to get the context from DRI. If we want DRI DTOR to call "release" it can pass the stored context.
Is there another motivation to have Tensor& as the input arg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, releasing has two parts: freeing the device resource and clearing the DeviceResidencyInfo state in Tensor. If we pass in Tensor we can do both in the releaseDeviceTensor call, if we pass in context we can only do the first.


/// Copies the contents of \p tensor from the host to the \p location address
/// on this device. Updates the tensor residency info.
virtual bool transferToDevice(Tensor &tensor, void *context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updates the tensor residency info to the information in void *context right?
Should context be named residencyInfo or location?
Where does the location information come from? Should there be a getAddress() method as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it should be called location and should definitely be optional.

Don't think getAddress() works because some devices may not use addresses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device Residency Info contains pointers to:

  1. DeviceManager
  2. Opaque context
    So it is not the residencyInfo itself.
    In addition, I think there may be use cases where the context is useful to store additional information besides the actual memory address / location.
    For example, let's consider a backend with a memory pool for tensors. Its device manager should have a mapping between keys and buffers. If a tensor becomes device resident, it should get a buffer. In this case, one option is to store that key in the context.

Naming - how about locationContext?

GetAddress(...) : this goes to how a certain device manager creates a new locationContext.
The challenge here is to figure out if there is a single interface that fits most backends, i.e. if it makes sense to add createLocationContext() (taking no input arguments). Or does a certain backend needs the function name? or the placeholder name? (as we briefly started discussing :))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be concerned that if we find an approach that works well for all backends we know about now, we'll encounter a future backend that doesn't fit that scheme.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.


fut.wait();
hostManager_->ensureOutputsAvailable(*contextOut.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to control this behaviour somehow so if we're using the EE but we want to leave Tensors of the device we could.

@@ -480,6 +480,15 @@ void HostManager::updateExecutionStats(
1000000 / duration);
}

void HostManager::ensureOutputsAvailable(ExecutionContext &context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to make sure we're calling this everywhere we use the HostManager until the user is able to deal with Tensors staying Device Resident. The big one I can think of is Onnxifi, but there may be others.

}

/// Releases the device buffer associated with \p tensor.
bool OpenCLDeviceManager::releaseDeviceTensor(void *locationContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we spoke about it in person, but I think it would reduce error potential if this took in a Tensor and cleared the DeviceResidencyInfo. This way the caller has to always clear the residency info themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doable but not as simple as it sounds, I think.
Writing my thoughts to consider the pros and cons:

DRI is then required to hold a pointer to its owning tensor.
This is a results of DRI'a destructor calling releaseDeviceTensor(Tensor &t).
This was put in place to prevent device memory leaks, but can also be error prone as it requires the user to be aware of double-free flows. For easiest use, "releaseDeviceTensor" must be implemented to check for double-free.

Technical details: moving DRI member init from definition to each Tensor constructor (adds "static" complexity).
In addition, a weird handling will then be required for Tensor's move-constructor and assignment operator , which swaps DRI pointers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thats a tricky one. Ok lets leave it as it is and maybe make a comment?

@mortzur mortzur marked this pull request as ready for review October 30, 2019 23:53
@mortzur mortzur changed the title [WIP] Implementing device resident tensors in OpenCL [DeviceResidentTensors] Implementing device resident tensors in OpenCL Oct 30, 2019
Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

virtual ~DeviceTensorTransferManager() {}
/// Copies the contents of \p tensor from the host to the \p location address
/// on this device. Updates the tensor residency info.
virtual bool transferToDevice(Tensor &tensor, void *locationContext) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think locationContext should default to null, which means the device should alloc memory itself. This prevents callers from needing to understand how to create the right locationContext.

->isDeviceResident()) {
transferFromDevice(
*(context->getPlaceholderBindings()->get(it->second.first)),
/* release deivce memory*/ true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in device

transferFromDevice(
*(context->getPlaceholderBindings()->get(it->second.first)),
/* release deivce memory*/ true);
}
auto handle = context->getPlaceholderBindings()
->get(it->second.first)
->getHandle<int64_t>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lookup is done twice, can store the Tensor in a local var above?

@@ -701,3 +731,105 @@ void OpenCLDeviceManager::runFunctionImpl(
// Fire the resultCB.
resultCB(id, std::move(executeErr), std::move(context));
}

bool OpenCLDeviceManager::transferToDevice(Tensor &tensor, void *context) {
runtime::OpenCLDeviceTransferContext *ctx =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCHECK(context) ?

@nickgg
Copy link
Contributor

nickgg commented Oct 31, 2019

Are you planning to add any unit tests to this PR?

@nickgg
Copy link
Contributor

nickgg commented Nov 6, 2019

@mortzur is away for awhile and I'll take over this work. Since I'm going to have to put up a new PR to push to it anyway, I'm going to split out the OCL and Non-OCL parts.

facebook-github-bot pushed a commit that referenced this pull request Nov 13, 2019
Summary:
Taking over #3671, but spinning out the API and Glow-core level changes associated with the DRT plan in #3629. This does not implement DRT support on any device.

Documentation: See #3629.
Pull Request resolved: #3745

Test Plan: Ran tests, added two simple new sanity checks to DeviceManagerTest. The first `DeviceResidentTensors` should run only for backends that support resident tensors (none currently). The second `CanHandleDeviceResidentTensors` should run on all devices.

Differential Revision: D18378905

Pulled By: nickgg

fbshipit-source-id: 887c290dae5a6b9b75e9b41a415958d499bc5402
@stale
Copy link

stale bot commented Nov 21, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Dec 6, 2019

This PR has been automatically closed due to being stale for 15 days. Thank you for your contributions and feel free to reopen it in case of further progress.

@stale stale bot closed this Dec 6, 2019
vdantu pushed a commit to vdantu/glow that referenced this pull request Jul 12, 2020
Summary:
Taking over pytorch#3671, but spinning out the API and Glow-core level changes associated with the DRT plan in pytorch#3629. This does not implement DRT support on any device.

Documentation: See pytorch#3629.
Pull Request resolved: pytorch#3745

Test Plan: Ran tests, added two simple new sanity checks to DeviceManagerTest. The first `DeviceResidentTensors` should run only for backends that support resident tensors (none currently). The second `CanHandleDeviceResidentTensors` should run on all devices.

Differential Revision: D18378905

Pulled By: nickgg

fbshipit-source-id: 887c290dae5a6b9b75e9b41a415958d499bc5402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants