New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DeviceResidentTensors] Implementing device resident tensors in OpenCL #3671
Conversation
a2fb9b9
to
1c41c07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Nice job @mortzur.
I do think we'll need some pretty sophisticated tests to hit the ins and outs of this though.
include/glow/Base/Tensor.h
Outdated
bool isDeviceResident() const { return residencyInfoP_->isDeviceResident(); } | ||
|
||
// Update device residency info with new device manager and context | ||
void moveToDevice(runtime::DeviceManager *deviceManager, void *context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for a future diff: To support pinned Tensors, should we have an optional argument here to control whether tensorResidency_ is Device or Host with DeviceManager state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for pinned tensors -
Do want to be able to copy from pinned memory directly to device and vice-versa? (meaning: do we need 2 pointers (to opaque memory) in the tensor class for both pinned location and device location?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we want to copy into pinned memory ahead of time and then move it to the device from that memory. I think we can store the extra pointer in the DeviceResidencyInfo for that backend? Unsure, but we dont need to solve it for this diff.
include/glow/Base/Tensor.h
Outdated
DeviceResidencyInfo() | ||
: tensorResidency_(TensorResidency::Host), deviceManager(nullptr), | ||
context(nullptr) {} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have a destructor here which calls deviceManager->releaseTensor if the tensor is resident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
What we want to do is actually give DM an option to release / put-back-in-the-pool the memory buffer.
I think we should change the API to releaseDeviceTensor(void *context).
What's needed in order to free the buffer is the context. Tensor may be used to get the context from DRI. If we want DRI DTOR to call "release" it can pass the stored context.
Is there another motivation to have Tensor& as the input arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, releasing has two parts: freeing the device resource and clearing the DeviceResidencyInfo state in Tensor. If we pass in Tensor we can do both in the releaseDeviceTensor call, if we pass in context we can only do the first.
|
||
/// Copies the contents of \p tensor from the host to the \p location address | ||
/// on this device. Updates the tensor residency info. | ||
virtual bool transferToDevice(Tensor &tensor, void *context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This updates the tensor residency info to the information in void *context right?
Should context be named residencyInfo or location?
Where does the location information come from? Should there be a getAddress() method as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think it should be called location and should definitely be optional.
Don't think getAddress() works because some devices may not use addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device Residency Info contains pointers to:
- DeviceManager
- Opaque context
So it is not the residencyInfo itself.
In addition, I think there may be use cases where the context is useful to store additional information besides the actual memory address / location.
For example, let's consider a backend with a memory pool for tensors. Its device manager should have a mapping between keys and buffers. If a tensor becomes device resident, it should get a buffer. In this case, one option is to store that key in the context.
Naming - how about locationContext?
GetAddress(...) : this goes to how a certain device manager creates a new locationContext.
The challenge here is to figure out if there is a single interface that fits most backends, i.e. if it makes sense to add createLocationContext() (taking no input arguments). Or does a certain backend needs the function name? or the placeholder name? (as we briefly started discussing :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be concerned that if we find an approach that works well for all backends we know about now, we'll encounter a future backend that doesn't fit that scheme.
8e5e8c3
to
ceebc7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
||
fut.wait(); | ||
hostManager_->ensureOutputsAvailable(*contextOut.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to control this behaviour somehow so if we're using the EE but we want to leave Tensors of the device we could.
@@ -480,6 +480,15 @@ void HostManager::updateExecutionStats( | |||
1000000 / duration); | |||
} | |||
|
|||
void HostManager::ensureOutputsAvailable(ExecutionContext &context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to make sure we're calling this everywhere we use the HostManager until the user is able to deal with Tensors staying Device Resident. The big one I can think of is Onnxifi, but there may be others.
} | ||
|
||
/// Releases the device buffer associated with \p tensor. | ||
bool OpenCLDeviceManager::releaseDeviceTensor(void *locationContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we spoke about it in person, but I think it would reduce error potential if this took in a Tensor and cleared the DeviceResidencyInfo. This way the caller has to always clear the residency info themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doable but not as simple as it sounds, I think.
Writing my thoughts to consider the pros and cons:
DRI is then required to hold a pointer to its owning tensor.
This is a results of DRI'a destructor calling releaseDeviceTensor(Tensor &t).
This was put in place to prevent device memory leaks, but can also be error prone as it requires the user to be aware of double-free flows. For easiest use, "releaseDeviceTensor" must be implemented to check for double-free.
Technical details: moving DRI member init from definition to each Tensor constructor (adds "static" complexity).
In addition, a weird handling will then be required for Tensor's move-constructor and assignment operator , which swaps DRI pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, thats a tricky one. Ok lets leave it as it is and maybe make a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
virtual ~DeviceTensorTransferManager() {} | ||
/// Copies the contents of \p tensor from the host to the \p location address | ||
/// on this device. Updates the tensor residency info. | ||
virtual bool transferToDevice(Tensor &tensor, void *locationContext) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think locationContext
should default to null, which means the device should alloc memory itself. This prevents callers from needing to understand how to create the right locationContext.
->isDeviceResident()) { | ||
transferFromDevice( | ||
*(context->getPlaceholderBindings()->get(it->second.first)), | ||
/* release deivce memory*/ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo in device
transferFromDevice( | ||
*(context->getPlaceholderBindings()->get(it->second.first)), | ||
/* release deivce memory*/ true); | ||
} | ||
auto handle = context->getPlaceholderBindings() | ||
->get(it->second.first) | ||
->getHandle<int64_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lookup is done twice, can store the Tensor in a local var above?
@@ -701,3 +731,105 @@ void OpenCLDeviceManager::runFunctionImpl( | |||
// Fire the resultCB. | |||
resultCB(id, std::move(executeErr), std::move(context)); | |||
} | |||
|
|||
bool OpenCLDeviceManager::transferToDevice(Tensor &tensor, void *context) { | |||
runtime::OpenCLDeviceTransferContext *ctx = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DCHECK(context) ?
Are you planning to add any unit tests to this PR? |
@mortzur is away for awhile and I'll take over this work. Since I'm going to have to put up a new PR to push to it anyway, I'm going to split out the OCL and Non-OCL parts. |
Summary: Taking over #3671, but spinning out the API and Glow-core level changes associated with the DRT plan in #3629. This does not implement DRT support on any device. Documentation: See #3629. Pull Request resolved: #3745 Test Plan: Ran tests, added two simple new sanity checks to DeviceManagerTest. The first `DeviceResidentTensors` should run only for backends that support resident tensors (none currently). The second `CanHandleDeviceResidentTensors` should run on all devices. Differential Revision: D18378905 Pulled By: nickgg fbshipit-source-id: 887c290dae5a6b9b75e9b41a415958d499bc5402
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
This PR has been automatically closed due to being stale for 15 days. Thank you for your contributions and feel free to reopen it in case of further progress. |
Summary: Taking over pytorch#3671, but spinning out the API and Glow-core level changes associated with the DRT plan in pytorch#3629. This does not implement DRT support on any device. Documentation: See pytorch#3629. Pull Request resolved: pytorch#3745 Test Plan: Ran tests, added two simple new sanity checks to DeviceManagerTest. The first `DeviceResidentTensors` should run only for backends that support resident tensors (none currently). The second `CanHandleDeviceResidentTensors` should run on all devices. Differential Revision: D18378905 Pulled By: nickgg fbshipit-source-id: 887c290dae5a6b9b75e9b41a415958d499bc5402
Summary:
Device resident tensors API implementation with OpenCL backend.
DeviceResidencyInfo - the class which stores residency info.
Tensor holds a shared_ptr to a default DRI object in order to be able to share the pointer on tensor "getUnowned" calls before any device transfers take place. Calls to "clone" produce an independent copy with a different DRI object.
DeviceTensorTransferManager - an interface of the transfer operations related to DRT.
Tensor: asserts were added on accessing host tensor data to ensure host manipulations and device residency are mutually exclusive.
Functionality: OpenCL backend now supports DRT and leaves tensors on the device by default. This requires the user to transfer tensors to host to access data.
OpenCL: in order to do device transfers a base address (cl_mem) and an offset are required. Both were wrapped in a struct OpenCLDeviceTransferContext.
This should be eliminated if/when we change openCL compiled function to direct pointer calls (instead of base+offset pointer arithmetic, see below).
Previous notes on openCL function:
OpenCL compiled function and kernels rely on pointer arithmetic in the contiguous memory section (defined by the runtime bundle). This doesn't fit with dynamic memory allocation for device tensor buffers. (Ongoing discussion with @nickgg ) This may be addressed multiple ways, for example:
- Modify openCL function and kernels to support absolute addresses (instead of base+offset scheme). This enables dynamic memory allocation and also different memory management schemes (allocating per-tensor ahead-of time).
- Blocking the entire contiguous memory section until all tensors are released (while doing the required ref-counting)
Currently solved by "buffering" : copy to a separate device buffer when the function completes and free the function buffer.
Test Plan:
ninja test