Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic buffer support to OCL Backend #3765

Closed
wants to merge 1 commit into from

Conversation

@nickgg
Copy link
Contributor

nickgg commented Nov 9, 2019

Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations.

There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins.

The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl.

Documentation: NFC

Test Plan: tests in release and ASAN

@nickgg

This comment has been minimized.

Copy link
Contributor Author

nickgg commented Nov 9, 2019

Perf should be neutral, but I'll run some tests with image-classifier and attach the results.

@nickgg nickgg requested review from gcatron and opti-mix Nov 9, 2019
Copy link
Contributor

opti-mix left a comment

@nickgg Overall, I like this change a lot! It really provides a uniform way of working with OpenCL buffers. BTW, we've discussed this approach with @mortzur a couple of weeks ago.

My two major comments are:

  1. I'd really like to see if is has any performance implications or if creating sub-buffers is essentially done for free. In particular, it should not slow down the copying of constants/weights at the beginning/end of each run.
  2. Glow's OpenCL backend currently uses a very explicit way of passing the arguments to a kernel by using argument indices, e.g. setKernelArg(kernel, 1, ...). This is very fragile and if we change the scheme (e.g. we do not pass mem as the first argument), we need to touch all places where we pass arguments and change their indices. It seems like it would be more robust to introduce something which does not use explicit indices, e.g. something like this:
Kernel kernel("kernel_name");
kernel.pushArg(arg1);
kernel.pushArgs(arg2, arg3, arg4);
enqueueKernel(kernel, ...);

Of course, this second comment is not directly related to the scope of this PR and probably should be handled in a separate PR/issue.

lib/Backends/OpenCL/OpenCL.cpp Outdated Show resolved Hide resolved
lib/Backends/OpenCL/OpenCL.cpp Outdated Show resolved Hide resolved
setKernelArg(kernel, 0, deviceBuffer);
auto numArgs = setKernelArgsForBuffers(kernel, I, 1, runtimeBundle_);
unsigned numArgs = 0;
setKernelArg(kernel, numArgs++, deviceBuffer);

This comment has been minimized.

Copy link
@opti-mix

opti-mix Nov 10, 2019

Contributor

Why do you need to expand/inline setKernelArgsForBuffers here, but not in certain cases below?

This comment has been minimized.

Copy link
@nickgg

nickgg Nov 11, 2019

Author Contributor

In this case it looks like it's because I made the deviceBuffer the first argument again. I actually had a lot of trouble making this kernel work correctly and vaguely remember this being the only thing that worked. Will look into it.

This comment has been minimized.

Copy link
@nickgg

nickgg Nov 11, 2019

Author Contributor

I've added a comment about this but basically if you remove the first void* arg from this kernel it doesn't compile. Why? No idea it should be fine, and all the other kernels were. This is a compromise.

lib/Backends/OpenCL/OpenCL.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

nickgg left a comment

Thanks @opti-mix. I'm very curious about #1 as well, will verify today.

For #2 I agree, was thinking the same thing while doing the work. Figured this diff was big enough as it is, follow up?

@nickgg

This comment has been minimized.

Copy link
Contributor Author

nickgg commented Nov 11, 2019

Perf comparison:

before:
image

after:
image

Looks neutral to me.

@opti-mix

This comment has been minimized.

Copy link
Contributor

opti-mix commented Nov 11, 2019

For #2 I agree, was thinking the same thing while doing the work. Figured this diff was big enough as it is, follow up?

Yes, at least file an issue about it, so that we do not forget.

@opti-mix

This comment has been minimized.

Copy link
Contributor

opti-mix commented Nov 11, 2019

@nickgg Thanks for checking the performance. Looks like the change is neutral, which is very good.

@nickgg nickgg force-pushed the nickgg:oclBuffers branch from f4da2b4 to 23f9fa1 Nov 11, 2019
@nickgg

This comment has been minimized.

Copy link
Contributor Author

nickgg commented Nov 12, 2019

Both test suite fails here look spurious, but it seems like I can't rerun.

@nickgg nickgg force-pushed the nickgg:oclBuffers branch 3 times, most recently from 6878ff8 to d72e0a8 Nov 12, 2019
Copy link
Contributor

opti-mix left a comment

LGTM

cl_int err =
clEnqueueCopyBuffer(commands, srcBuf, destBuf, 0, 0, sizeInBytes, 0,
nullptr, kernelProfiling_ ? &event : nullptr);
llvm::outs() << "COPY\n";

This comment has been minimized.

Copy link
@opti-mix

opti-mix Nov 12, 2019

Contributor

Please remove debug prints.

This comment has been minimized.

Copy link
@nickgg

nickgg Nov 12, 2019

Author Contributor

just temporary I'm trying to printf debug the POCL build. will fix before landing

This comment has been minimized.

Copy link
@pjaaskel

pjaaskel Nov 16, 2019

Contributor

Did you notice POCL_DEBUG=1 which might be useful?

@@ -1376,6 +1376,7 @@ TEST_P(MLTest, testFindPixelRegression) {
auto dx = LH.at({i, 0}) - RH.at({i, 0});
auto dy = LH.at({i, 1}) - RH.at({i, 1});
auto distance = std::sqrt(std::pow(dx, 2) + std::pow(dy, 2));
llvm::outs() << distance << "\n";

This comment has been minimized.

Copy link
@gcatron

gcatron Nov 12, 2019

Contributor

Was this a debugging print statement?

This comment has been minimized.

Copy link
@nickgg

nickgg Nov 12, 2019

Author Contributor

yup will remove as well

Copy link
Contributor

gcatron left a comment

Looks good!

@nickgg nickgg force-pushed the nickgg:oclBuffers branch 2 times, most recently from 22ec1c5 to a0a407f Nov 12, 2019
Copy link
Contributor Author

nickgg left a comment

Think I got the POCL issue, it's due to alignment which isn't enforced on nvidia/cpu but is in pocl (potentially amd devices as well).

@nickgg

This comment has been minimized.

Copy link
Contributor Author

nickgg commented Nov 13, 2019

OK! Lint problems are from fc64547 not this diff, OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me.

Copy link

facebook-github-bot left a comment

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@nickgg nickgg force-pushed the nickgg:oclBuffers branch from a0a407f to 203e6a8 Nov 13, 2019
Copy link

facebook-github-bot left a comment

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@nickgg nickgg force-pushed the nickgg:oclBuffers branch from 203e6a8 to 50cb744 Nov 13, 2019
Copy link

facebook-github-bot left a comment

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Nov 13, 2019

@nickgg merged this pull request in bd69664.

nickgg added a commit to nickgg/glow that referenced this pull request Nov 13, 2019
facebook-github-bot added a commit that referenced this pull request Nov 13, 2019
Summary:
This reverts commit bd69664.

I had thought that I had gotten the last POCL issue in #3765, but I had not. Reverting to fix the OCL build.

Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing.

Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it.
Pull Request resolved: #3784

Differential Revision: D18480248

Pulled By: nickgg

fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303
@pjaaskel

This comment has been minimized.

Copy link
Contributor

pjaaskel commented Nov 16, 2019

... OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me.

Just curious: does https://github.com/pocl/pocl/issues know about "the normal pocl issues" you refer to here?

For me Glow now works quite well with pocl, but I've one single liner patch I need to upstream to pocl due to the way Glow checks for platform existence with 0 device query which currently fails.
Are there other remaining issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.