Add dynamic buffer support to OCL Backend #3765
Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations.
There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins.
The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl.
Test Plan: tests in release and ASAN
opti-mix left a comment
My two major comments are:
Kernel kernel("kernel_name"); kernel.pushArg(arg1); kernel.pushArgs(arg2, arg3, arg4); enqueueKernel(kernel, ...);
Of course, this second comment is not directly related to the scope of this PR and probably should be handled in a separate PR/issue.
Summary: This reverts commit bd69664. I had thought that I had gotten the last POCL issue in #3765, but I had not. Reverting to fix the OCL build. Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing. Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it. Pull Request resolved: #3784 Differential Revision: D18480248 Pulled By: nickgg fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303
Just curious: does https://github.com/pocl/pocl/issues know about "the normal pocl issues" you refer to here?
For me Glow now works quite well with pocl, but I've one single liner patch I need to upstream to pocl due to the way Glow checks for platform existence with 0 device query which currently fails.