OpenCL support #22

Open
outlace opened this Issue Nov 9, 2015 · 239 comments

Projects

None yet
@outlace
outlace commented Nov 9, 2015

I understand TensorFlow only supports CUDA. What would need to be done to add in OpenCL support?

@nmabhinandan

It's strange that Google ditched open OpenCL for proprietary CUDA.
im-just-saying

@ebrevdo
Contributor
ebrevdo commented Nov 9, 2015

At the very least, the Eigen library would have to support OpenCL.

@vrv vrv referenced this issue Nov 9, 2015
Closed

Could port to OpenCL? #28

@bhack
bhack commented Nov 9, 2015

👍

@keveman keveman added the cuda label Nov 9, 2015
@jamesliu96

👍

@alexatknit

👍

@dhess
dhess commented Nov 11, 2015

thumbs up and all that.

@gujunli
gujunli commented Nov 11, 2015

I will be interested in expanding Tensor Flow with OpenCL. As we have already released OpenCL caffe. https://github.com/amd/OpenCL-caffe. Hopefully it can get integrated in light way? Is anyone interested in working together on this?

@bhack
bhack commented Nov 11, 2015

@gujunli Nice to see AMD here. /cc @naibaf7 @lunochod

@nmabhinandan

would be great.

@sasadep
sasadep commented Nov 11, 2015

👍

@bhack
bhack commented Nov 15, 2015

/cc @lukeiwanski for Eigen/OpenCL/SYCL

@ankdesh
ankdesh commented Nov 16, 2015

@gujunli Certainly would be interested in contributing. Please let me know when you plan to start.

@lukeiwanski
Contributor

Hi all,

Here at Codeplay we are looking into Eigen's tensor running on GPU using SYCL (a modern C++ layer on top of OpenCL). From what we have gathered so far, GPU tensor design is very closely coupled with CUDA and it will require interface changes for another programming model and particularly a SYCL and OpenCL 1.2 version.

If anyone is interested in digging deeper / helping out, we are most certainly interested in contributing.

Thanks,
Luke

@bhack
bhack commented Nov 25, 2015

@lukeiwanski Thank you for the feedback. I think that @benoitsteiner worked at the tensor extension part of eigen.

@haahh
haahh commented Dec 6, 2015

👍 I can help code some OpenCL/SYCL if someone makes a plan, divides work into tasks etc. I recommend using Boost.Compute as a wrapper for OpenCL (it makes running kernels, testing, templating easier).

@ieee8023
ieee8023 commented Dec 7, 2015

+1

@armish
armish commented Dec 7, 2015

👍

@lukeiwanski
Contributor

Hi all,

Just to keep you posted, we are still investigating how we can change the Eigen interface to better fit the SYCL/OpenCL 1.2 programming model.
Once we come up with a reasonable approach that targets heterogeneous programming models ( not only OpenCL / SYCL ) we will create a proposal.

Thanks,
Luke

@gujunli
gujunli commented Dec 8, 2015

Pls keep me update. I developed opencl-caffe for AMD. I am also looking at
tensor flow.

Thanks.
Junlu
On Dec 8, 2015 10:19 AM, "Luke Iwanski" notifications@github.com wrote:

Hi all,

Just to keep you posted, we are still investigating how we can change the
Eigen interface to better fit the SYCL/OpenCL 1.2 programming model.
Once we come up with a reasonable approach we will create a proposal.

Thanks,
Luke


Reply to this email directly or view it on GitHub
#22 (comment)
.

@bhack
bhack commented Dec 9, 2015

/cc @ptillet @gongzg Is there any interest in this by Intel? I really hope that we don't fragment OPENCL here like in Caffe where we have an AMD fork, Intel unmerged PRs, another semi-unofficial AMD PR, and a long staging user PR (plus two old abandoned Opencl efforts). If somebody is interested in the history can take a look at BVLC/caffe#2610 comments.

@gongzg
gongzg commented Dec 17, 2015

@bhack We do have interest in this. Thanks for letting me know. If there is a proposal for Eigen's OpenCL/SYCL implementation, we will see what we can do from Intel side.

@benoitsteiner benoitsteiner self-assigned this Dec 23, 2015
@ZirconCode

👍

@bhack
bhack commented Jan 1, 2016

An interesting initiative at https://github.com/ptillet/isaac also if here we rely on Eigen tensor extension.

@DanMcLaughlin

I also would like to contribute. @benoitsteiner can you organize it?

@bhack
bhack commented Jan 19, 2016

This was included in the Roadmap but also tagged as contribution so a direction/bootstrap could be really useful.

@gujunli
gujunli commented Jan 19, 2016

I can contribute to organize it. who is responsible for OpenCL support in
Tensor flow now?

Thanks a lot.
Junli

On Tue, Jan 19, 2016 at 7:50 AM, bhack notifications@github.com wrote:

This was included in the Roadmap but also tagged as contribution so a
direction/bootstrap could be really useful.


Reply to this email directly or view it on GitHub
#22 (comment)
.


Junli Gu--谷俊丽
Coordinated Science Lab
University of Illinois at Urbana-Champaign


@DanMcLaughlin

I just assumed Benoit because he self assigned the feature, but I think you've got it Junli! Maybe start with an email or forum thread of interested parties?

@martinwicke
Member

@benoitsteiner knows more about interested parties that may not have shown
up in this thread (or this issue). I'd wait for him to coordinate to make
sure we avoid duplicating work.

On Tue, Jan 19, 2016 at 11:42 AM Dan McLaughlin notifications@github.com
wrote:

I just assumed Benoit because he self assigned the feature, but I think
you've got it Junli! Maybe start with an email or forum thread of
interested parties?


Reply to this email directly or view it on GitHub
#22 (comment)
.

@MikalaiDrabovich
Contributor

I'm interested. Is there any roadmap?

On Jan 19, 2016, at 11:46 AM, Martin Wicke notifications@github.com wrote:

@benoitsteiner knows more about interested parties that may not have shown
up in this thread (or this issue). I'd wait for him to coordinate to make
sure we avoid duplicating work.

On Tue, Jan 19, 2016 at 11:42 AM Dan McLaughlin notifications@github.com
wrote:

I just assumed Benoit because he self assigned the feature, but I think
you've got it Junli! Maybe start with an email or forum thread of
interested parties?


Reply to this email directly or view it on GitHub
#22 (comment)
.


Reply to this email directly or view it on GitHub.

@hsaputra
Contributor

Is there a list of CUDA dependency libraries that Tensorflow relying on?

This would help to see if we could have immediate OpenCL alternatives.

@naibaf7
naibaf7 commented Jan 19, 2016

@hsaputra
There is clFFT, clBLAS (alternatively ViennaCL). Random number generator is a bit more tricky (no curand), either use a CPU generator and transfer to GPU or use another existing kernel for RNG.

The biggest pitfall will again be efficient convolution implementations (something like cuDNN).

There is experience about such issues here:
BVLC/caffe#2610
BVLC/caffe#2195
https://github.com/amd/OpenCL-caffe

@bhack
bhack commented Jan 19, 2016

Tensorflow use tensor extension upstreamed to Eigen. So I think that an Opencl/Sycl support to Eigen is needed. See this thread

@hsaputra
Contributor

Thanks @naibaf7. Yeah, I don't think there is a viable alternative for cuDNN for OpenCL right now.

@VincentSC

The website http://opencl.org is created to support open source porting projects just like these! We're currently installing all necessary tools at the website and have space for repositories at https://github.com/OpenCL/ - later on we're adding build-servers to test for several types of hardware and can provide our expertise in how to write code that runs at full speed on numerous hardware.

We're launching a porting initiative for GEGL next week, but we're happy to also support you.

@DanMcLaughlin

@bhack from that thread and here it seems like @lukeiwanski is looking into it. I think we have enough willing people to work on it, we just need @benoitsteiner, @lukeiwanski or @gujunli to coordinate. Benoit has been quiet, maybe he's on holiday.

@hsaputra
Contributor

I would love to help contribute with this initiative.

@lukeiwanski
Contributor

hi all,

we will coordinate the effort of porting Eigen’s tensor module to SYCL for OpenCL as we already have something mostly working, but it’s not ready for review yet.

We are in favour of this approach as it will introduce less invasion to the code base. SYCL supports the single-source C++ templated model that eigen already uses.

Road map design is in progress so it shouldn’t be too long now.

Thanks,
Luke

@bhack
bhack commented Jan 22, 2016

@lukeiwanski Are you working or in contact with upstream? Do you think will be accepted upstream in Eigen?

@Konard
Konard commented Jan 22, 2016

+1

@DanMcLaughlin

Great news @lukeiwanski, let us know of any help you need.

I'll guess you are using your own implementation of SYCL - will that be available for developers/researchers? On what platforms?

@ville-k
Contributor
ville-k commented Jan 23, 2016

@lukeiwanski SYCL seems like the right way to go given the amount of template metaprogramming involved with Eigen. I'm an experienced c++ developer with OpenCL experience gained from developing my own neural nets and linear algebra library. I'd love to help with this effort and get started developing with SYCL.

@lukeiwanski
Contributor

@bhack We are in contact with @benoitsteiner, but we will discuss our proposal with the upstream maintainers before we invest too much effort.

@DanMcLaughlin , @ville-k We are developing our implementation of SYCL, ComputeCpp (https://www.codeplay.com/products/computecpp). For more information, can you please contact me off-list via the email address on my profile?

@MikalaiDrabovich
Contributor

@lukeiwanski is there any update/estimate regarding plans?

@Sherif89

+1.
I've an AMD GPU and an Intel GPU in the laptop. I think both have OpenCL drivers and AMD's support seems to be much better. I'd have higher performance, because I've 2 OpenCL devices. I hope you make it scale with OpenCL devices.

@lukeiwanski
Contributor

Hi all,

Thanks for the interest!
At this point we are getting our testing infrastructure set up to make sure that nothing that we do introduces regression.
We are in touch with @benoitsteiner to make sure we are in sync with what he's done so far.

We are still in compiling a road map for the integration process - it should be done in couple weeks time, as there is a couple of business details to clarify.

Our goal is to bring the OpenCL to TensorFlow via Eigen by end of this year.

Thanks,

@strin
strin commented Jan 27, 2016

interested. would love to contribute.

@bhack
bhack commented Jan 28, 2016

Ok so actually seems that it is an effort of Codeplay with some kind of sync to Google internal. What are the role of AMD and Intel subscribers here?

@bhack
bhack commented Jan 28, 2016

/cc @keryell if you have any interest on this from SYCL/FPGA universe

@benoitsteiner
Member

My apologies for not contributing more to this discussion recently, my plate has been more than full these past 2 weeks.

I'll be coordinating the OpenCL effort on the TensorFlow side. Our current thinking is:

  • TensorFlow relies on c++11 and has taken a "single source" approach, so SYCL seems like a great fit.
  • We don't have a lot of OpenCL experience in house, so we're collaborating closely with Codeplay to bridge this gap. In particular, Codeplay is currently leading the effort to add support for SYCL to the Eigen tensor library.
  • TensorFlow relies on the cuDNN library to compute convolutions on NVidia GPUs. If somebody is interested in contributing an OpenCL equivalent, we'd be happy to help.

In order to help structure the effort, I created a mailing list: tensorflow-opencl@googlegroups.com.

@keryell
keryell commented Jan 29, 2016

@bhack sure I have some interest for high-end C++ on FPGA :-)
TensorFlow sounds like a good validation use-case for triSYCL too.
By the way, if some people here are looking for some internships on this subject, I have some positions. It looks like Codeplay is looking for some people too, if I trust their web site.

@bhack
bhack commented Jan 29, 2016

I'm really interested in @karlrupp and @hughperkins opinions. I hope they want to join in the discussion on the new google group.

@bhack
bhack commented Jan 29, 2016

@benoitsteiner Thank you for the update. It would be wonderful if all involved partners in @KhronosGroup (Google, Nvidia, Amd, Intel, Codeplay, Xilinx etc.) will promote a cudnn like API in a standardized way. A sort of Khronos openvx computer vision standardization effort but for deep learning.

@karlrupp

@bhack Which new Google group?

Other than that, OpenCL and CUDA are too different programming approaches. CUDA works the way it is because one company has full control over everything, so it can embed binary blobs and who knows what in the final executable. This cannot be done with OpenCL, unless one goes down the SyCL path (I have my concerns...) and the SyCL compiler vendor has full control over all possible target architectures (unlikely or impossible in practice). Overall, my opinion is that a good OpenCL-enabled library needs more than just a few tweaks here and there. Probably not what you wanted to hear, but you asked for my opinion :-)

@bhack
bhack commented Jan 29, 2016

@karlrupp See #22 (comment) at the end for the google group.
I asked your opinion cause you have a great experience with ViennaCL interfacing an algebra library with multiple backends (CPU, GPU, MIC). Tensorflow rely on Eigein library and its new tensor extension contributed by Google upstream (but only with CUDA backend). I think that they don't experienced much all the pitfall you have already encountered with ViennaCL in this years of development.

@keryell
keryell commented Jan 29, 2016

@bhack We are currently at the face-to-face meeting in Seattle this week but of course I cannot say whether we are talking about DNN libraries or not... :-)

@bhack
bhack commented Jan 29, 2016

@keryell Try to push the cause in Seattle ;)

@keryell
keryell commented Jan 29, 2016

@karlrupp You are right, OpenCL and CUDA are too different programming approaches. The single-source aspect found for example in CUDA and OpenMP 4.5 is extremely powerful from a software engineering perspective. This is why there is this SYCL standard for the real C++ programmers. SYCL can be seen as CUDA on steroids without any language extension and with some OpenMP aspects (the tasks). A typical SYCL device compiler is expected to generate SPIR-V kernels.

Your concerns about portability are less an issue with the SPIR-V standard (kind of portable equivalent of nVidia PTX/AMDIL/... in the Vulkan & OpenCL world) which is mandatory to accept in OpenCL 2.1 and Vulkan. So the beauty is that if you have a front-end that generates SPIR-V, you do not need special knowledge of the very details of the hardware to run on. There is a Khronos open-source bidirectional translator between LLVM IR and SPIR-V, so it opens quite new territories.

@karlrupp

@keryell I agree that SPIR-V is a step forward. However, it does not address all issues of exhaustive jitting.

you do not need special knowledge of the very details of the hardware to run on

Is this a copy&paste from OpenCL 1.0 marketing, which claimed exactly the same? You will always need to go down to the details of the underlying hardware if you aim for maximum performance. This is especially the case in the context of fast tensor contractions.

@bhack
bhack commented Jan 29, 2016

...as @scott-gray demonstrated with neon

@keryell
keryell commented Jan 29, 2016

@karlrupp

Is this a copy&paste from OpenCL 1.0 marketing, which claimed exactly the same?

Haha. :-)

You will always need to go down to the details of the underlying hardware if you aim for maximum performance. This is especially the case in the context of fast tensor contractions.

Of course, but before playing with the second-order optimization, it is useful to have the huge part of the whole templated C++ code running in some accelerated way.

For the optimization, either you stitch your optimized binary kernels à la NervanaSys or, since SYCL is pure C++, you can use asm("...") in it with a lot of #ifdef to test the target architecture. :-) That said, SPIR-V is itself extensible and I cannot see why we could not put inline VHDL or Verilog in it at some point. :-)

But more concretely, the recent introduction of sub-group operations should help to achieve good performance in a portable way and using simple built-in ad-hoc functions may help.

C++ adds interesting metaprogramming features that allows to replace most of the code generators used such as in clBLAS or other frameworks to generate code more adapted to X or Y hardware.

@bhack
bhack commented Jan 29, 2016

Also N4355 in c++17 could enter in the game soon or later

@benoitsteiner
Member

@karlrupp, @bhack The tensorflow approach is to rely on a hardware abstraction (the tensor module) for the majority of the operations needed in by a typical neural network, while relying on specialized libraries (such as cudnn) for the few operations that are really critical performance wise. The hardware abstraction enables us to implement most TensorFlow operations once and have them run on an accelerator with more than good enough performance.

@keryell
keryell commented Jan 29, 2016

@bhack Yes I love multidimensional arrays. Also in our domain of interest, there is the SG14 in the C++ committee that tries to have all the people interested in these issues to converge into the standard.
https://groups.google.com/a/isocpp.org/forum/#!forum/sg14
Of course SYCL is in the discussions. :-)

@bhack
bhack commented Jan 29, 2016

@benoitsteiner Mainly on cudnn for pooling and convolution. I think that if every vendor will produce an API with its own hardware for this operations with its own binary assembly will not be a so scalable approach. That is why I think some performance crucial API calls would be better to be standardized in some way.

@bhack
bhack commented Jan 31, 2016

@keryell There are really interesting topics for Matrix/Tensor in the new SG14 c++ specially in vector/SIMD calls agenda. But seems that nobody talked of convolution, pooling, and others useful "stabilized" deep learning interfaces. Also seems to me that in this specific standardization subgroups there are people from Nvidia, Intel, Amd, CodePlay etc.. but not from Google also if it is in others groups.

@Andyccs
Contributor
Andyccs commented Jan 31, 2016

👍

@keryell
keryell commented Feb 5, 2016

@bhack Yes there is no machine-learning style proposal in SG14 yet. But participation is open, so you can send some proposals. :-) But perhaps SG6 (numerics topics) is more relevant. I do not think they have their own mailing-list/forum yet.

@bhack bhack referenced this issue in arrayfire/arrayfire Feb 5, 2016
Open

Generic Graph based Automatic Differentiation #1098

@krikru
krikru commented Feb 6, 2016

@gujunli Does OpenCL Caffe run on Android? Sorry for asking this here but I didn't find anywhere else to ask it :) Would be great with a deep learning library that ran on Android devices and could use the GPU but it seems like there are no at the moment. (Correct me if I'm wrong!)

@naibaf7
naibaf7 commented Feb 6, 2016

@krikru
The official (but experimental) OpenCL Caffe branch can be made to run on Android GPUs, however the performance at the moment is far from optimal. See sh1r0/caffe-android-lib#23 and https://github.com/BVLC/caffe/tree/opencl.

@bhack
bhack commented Feb 7, 2016

A real alternative to cudnn could be the extension of OpenVx standard objects with support to Tensor, NdConvolution, NdPooling operators and (probably) some other operator that could be considered standardizable.
Also cudnn team need to make some choice on what new API and operators they will introduce in every release. Of course a standard can not move as fast as cudnn releases but I think some operations and objects has enough "citations history" to be standardized.

@krikru
krikru commented Feb 7, 2016

@hughperkins At the moment, I haven't tried any deep learning library; I'm just doing some scouting to see which library I could potentially use. Have you tried cltorch and DeepCL on Android? I just assumed cltorch did work on Android, since there is an implementation of Torch that is dedicated specifically for Android. And why would you have such an implementation if there already was one that both worked on Android and used OpenCL, right? But maybe I should have known better.

@krikru
krikru commented Feb 7, 2016

@hughperkins For some reason I imagined that torch-android was an official Torch implementation for Android, meaning that no other Torch implementation (at least not official) was likely to run smoothly on Android, including cltorch. I don't know why I thought that, it of course doesn't make any sense.

@hughperkins

Well... Soumith kind of coordinates torch development. He works at Facebook AI Research. So, since torch-android repo belongs to Soumith, I would say it's fairly close to official. But it maybe is not part of core for some reason. I guess you can ask the question as an issue in that repo, or in https://groups.google.com/forum/#!forum/torch7 Actually, since Soumith is kind of the main person that handles the requests in https://groups.google.com/forum/#!forum/torch7 , I reckon you probably want to post your question there.

@hughperkins

meaning that no other Torch implementation (at least not official) was likely to run smoothly on Android, including cltorch

Note that cltorch is not an implementatino of torch. It's a plugin, thta provides OpenCL. You need both.

@krikru
krikru commented Feb 7, 2016

Note that cltorch is not an implementatino of torch. It's a plugin, thta provides OpenCL. You need both.

Ah, thanks for the clarification.

@krikru
krikru commented Feb 7, 2016

@naibaf7 Do the OpenCL Caffe branch and the OpenCL Caffe implementation by AMD have anything more in common besides the name? Have you compared the two or do you know if there is any difference in performance? You write that the OpenCL branch is far from optimal performance. What does that mean and what would be necessary in order to improve it? It would be interesting to try it on Android.

@bhack
bhack commented Feb 7, 2016

We are going off topic

@krikru
krikru commented Feb 7, 2016

@bhack Yeah, sorry for hijacking this thread. I just didn't know where to ask the question.

@naibaf7
naibaf7 commented Feb 8, 2016

@krikru
please raise an issue about it on the Caffe branch, flag it with Android and OpenCL. Then we can discuss this further. Thanks.

@bhack
bhack commented Feb 13, 2016

@keryell Seems that the next f2f SG14 meeting in March will be hosted by Google. Will be any tensorflow internal there?

@bhack
bhack commented Feb 13, 2016
@keryell
keryell commented Feb 13, 2016

Perhaps @benoitsteiner could drop by, since he is local.
But before this event there is the full C++ F2F at the end of the month in Jacksonville, Florida.
https://isocpp.org/files/papers/N4568.pdf
Unfortunately I will not be able to attend any of them.

@bhack
bhack commented Feb 13, 2016

I don't know if CppCon 2015 talk C++ Multi-dimensional Arrays for Computational Physics and Applied Mathematics generated some paper follow-up.

@dimchansky

+1

@keryell
keryell commented Feb 16, 2016

@bhack Thank you for pointing the talk on multi-dimensional arrays. It is interesting and address the real issues but looks too ad-hoc to be ratified in C++ as is. Personally I use Boost.MultiArray and I am more confident in a polished version of Boost.MultiArray.

@bhack
bhack commented Feb 17, 2016

There are also some papers at WG21. As you can see @jfbastien at Google has some activity at WG21 and also helped to host the SG14 f2f meeting at Google in March.

@jfbastien

@bhack @keryell I think it would be worth taking this discussion to the SG14 mailing list as the details aren't related to OpenCL / tensorflow.

@bhack
bhack commented Feb 17, 2016

Yes probably it is no more so strictly confined here with all the details. Other than Eigen/sycl support Is there a plan for the cudnn calls?

@andyyehoo

+1 very interesting topic. Hope it coming soon.

@strin
strin commented Feb 24, 2016

This thread is very interesting. I've been trying to get caffe to work on android. The results seem to be surprising: caffe running with Mali gpu seems to be 2-3 slower than cpu, but about 4-5x more energy efficient. The test was run on Galaxy S6 (Mali T760, Peak Performance 200 GFlops).

Since GEMM is the core of convolution in caffe, I decided to profile its performance on Android. It seems that ViennaCL is not as efficient as some simple kernels. Now I am able to get GPU run as fast as CPU for large matrices (2k x 2k). This is still counter-intuitive, since normally we expect GPUs to be much faster.

See:
https://github.com/strin/mocha-profile

The kernel implementations can be found here:

OpenCL kernels for GEMM: https://github.com/strin/gemm-android

Any thoughts?

@bhack
bhack commented Feb 27, 2016

@strin Have you already followed this thread https://community.arm.com/thread/4935?

@strin
strin commented Feb 29, 2016

@bhack thanks for sharing. this thread looks very interesting. i tried to turn of the DVFS as suggested, but no significant performance was seen for sgemm in ViennaCL.

@naibaf7
naibaf7 commented Sep 25, 2016 edited

@hughperkins
Yes can be an issue, but I think parts such as im2col/col2im and other convolution implementations could also be plugged in as external APIs if it's really an issue with the GCLA. This may also be better for the original authors of such work.

@lukeiwanski
Contributor

@hughperkins We are working on bringing the OpenCL to the TensorFlow via the SYCL for OpenCL 1.2.
Please have a look at https://docs.google.com/spreadsheets/d/1YbHn7dAFPPG_PgTtgCJlWhMGorUPYsF681TsZ4Y4LP0/edit#gid=1625897530 for "todos" and progress.
Recently we released a compiler for SYCL https://www.codeplay.com/products/computesuite/computecpp called ComputeCpp Comunity Edition. People can try it out!
As well, we are focusing on the Eigen library https://bitbucket.org/benoitsteiner/opencl/branch/ComputeCpp - getting it to the stage required by TensorFlow's MNIST - there are a couple things remaining.
As for constraints, the current ComputeCpp CE release has been tested for Intel (CPU, GPU) and AMD (CPU, GPU) as for the platforms we support Ubuntu 14.04 64bit and CentOS 64bit.
ComptueCpp is downloadable for free and can be used in commercial and open source projects.
Because we <3 open communities :)

@Daniel451
Daniel451 commented Sep 28, 2016 edited

@lukeiwanski Sorry for discussing/asking this here in the thread, but I think it may be of interest to others as well: I understand that Codeplay is highly interested in the SYCL for OpenCL implementation and I already heard others being interested in this work of you, too. I read some post by a Movidius official for example. However, I would like to ask what Google's contribution to this really is? Since Movidius, besides AMD and others, are listed as Codeplay's partners I can understand that they encourage or even support SYCL for OpenCL, but as far as I am aware of it, Google is not your partner and has not contributed so far?!

Do not get me wrong, I really like your work, but wouldn't it be a good idea to consolidate your efforts, pool the resources and try to work together with Google? To me it looks like many different parties are interested in OpenCL for TensorFlow, but a huge potential is not used, because these parties do not develop together?!

I may be wrong and please apologize myself if this has been discussed sufficiently, but I am still unaware of any major attempts by Google (or other parties) to work together on this and, as a result, I am still unaware of how the community could help or support (like single individuals), either via direct contributions, testing or other things.

@patrikhuber patrikhuber referenced this issue in tiny-dnn/tiny-dnn Sep 28, 2016
Closed

add Tensor class #319

@benoitsteiner
Member

@ascenator We at Google have been working closely with Luke and his Codeplay colleagues on this project for almost 12 months now. Codeplay's contribution to this effort has been tremendous, so we felt that we should let them take the lead when it comes down to communicating updates related to OpenCL. This is why you haven't heard much from us on the topic :)

Now that the ComputeCpp compiler is broadly available, we are planning to merge the work that has been done so far. But first we want to put together a comprehensive test infrastructure to make sure that we don't destabilize the exiting codebase.

We welcome all contributions to this effort, so feel free to contact me if you want to help. We're especially interested in high performance OpenCL kernels for matrix multiplication and convolutions. Several candidates have been suggested, but we haven't started looking into the pros and cons of each one or how to integrate them.

@Daniel451

@benoitsteiner thank you very much for the clarification & sorry for my misinformation! This sounds very good & promising! I will definitely have a look at ComputeCpp then. I am really looking forward to OpenCL support for TensorFlow, because this offers a lot of new possiblities for robotics (which is the field where I am researching and using TensorFlow for deep learning applications). I will at least have a look at early releases and try to test / debug. We have some Intel Chips plus a number of ARM CPUs that are waiting for tests ;)

@naibaf7
naibaf7 commented Sep 29, 2016

@hughperkins... sorry but isn't this completely off topic here? I don't see how this is relevant in OpenCL TF?

@bhack
bhack commented Sep 29, 2016 edited

I'm more interested here to know if will be taken a tuning approach to matrix multiplication and convolution kernels and if will be a valid open source alternative to CompiteCpp that will produce SPIR-V.

@lissyx lissyx referenced this issue in lissyx/tensorflow Oct 6, 2016
Alexandre Lissy Small doc explaining setup and build
Closes #22
f6ca8ad
@ptillet
ptillet commented Oct 8, 2016

If it helps, a better version of isaac is out: https://github.com/ptillet/isaac, and provides significant speed-ups over clBLAS and cuBLAS on Maxwell, Pascal and Fiji. Also provides faster (input-aware) kernels than Tensorflow for 1D and 2D reductions.

@inferrna
inferrna commented Oct 8, 2016

@hughperkins seems you have more chances to write CUDA compiler for any OpenCL device, rather than CUDA-OpenCL translator.

@marty1885
marty1885 commented Oct 8, 2016 edited

@hughperkins Maybe OpenCL 2.0's SVM feature could solve the pointer issue? Since everyone besides Nvidia(AMD, Intel, ARM, Qualcomm) is starting to support OpenCL 2.0. Maybe it's a good solution?

@ptillet
ptillet commented Oct 8, 2016

@hughperkins it's a blas implementation itself. It implements some of the symbols in clblas and cublas headers so no recompilation and code modification. is necessary. I could also implement some of the symbols for clblast.h, since it uses a different header. Some advantages of Isaac are:

  • Entirely dynamic, so that it can use either/both CUDA or OpenCL without recompilation.
  • Input-aware , it doesn't tune kernels for large square matrices. It should perform well on all shapes you can think of without retuning.
  • C++ API similar to numpy/arrayfire. Some fusion for combining elementwise operation with reductions
@naibaf7
naibaf7 commented Oct 8, 2016

@marty1885
Not really. AMD went back to 1.2 support on the AMDGPU-PRO drivers. Might be a while until full 2.0 support is widespread. Definitely not a short-term solution there.

@ptillet
ptillet commented Oct 8, 2016
  • Yes
  • I could hack compatibility for a bunch of operations if needed (e.g., forward **MV to GEMV). Complex support will be tricky. Double support is already here but no architecture is tuned for it.
@inferrna

@hughperkins

Seems like my code doesnt violate any obvious OpenCL rules

Yes, plainly passing any __global structure (like array or struct) containing pointers is incorrect just because those pointers can point to memory of another device (OpenCL supports multi-device paradigm where one device can't access memory of another). But it seems to be possible to overcome on IR level, w/o intermediate translation to OpenCL code - that's what I assumed :)

@steniu01

@benoitsteiner, @henline , from the https://github.com/henline/streamexecutordoc, it suggests the streamexecutor has supported the CL version canned operation(like DNN, BLAS) out-of-box. Does it suggest google has already has the clDNN, clBLAS implementation ready for Tensorflow, but just not open source it yet?

@jyegerlehner jyegerlehner referenced this issue in GPUOpen-ProfessionalCompute-Tools/HIP Oct 14, 2016
Closed

High-impact application of HIP #45

@keryell
keryell commented Oct 15, 2016

Otherwise OpenCL 2.0+ and SYCL 2.2 support SVM, if you want to keep the same software architecture.
OpenCL 2.0+ is supported by AMD and Intel GPU for example. In the embedded world, it is often supported by side effect even with OpenCL 1.x, since the host and device memories are often the same for cost reasons.

@naibaf7
naibaf7 commented Oct 15, 2016 edited

@keryell
But the most notable platforms, Linux + the new AMD GPUs (RX 480, upcoming Vega) do only support OpenCL 1.2 for now... and who knows when that's gonna change (my bet is on in a year). Beignet (opensource Linux Intel) for OpenCL 2.0 is also still a buggy mess; the stable version has 1.2.
Also considering all the smaller companies that make OpenCL compatible chips are barely pulling 1.2 support. So I guess anything relying on OpenCL 2.0 will see very bad adaption rates in practice.

@bhack
bhack commented Oct 15, 2016

I think.. any hardware vedor has the urgency of consuming SPIR-V? I think that Graphic/Shaders pressure on Vulkan could help Opencl side..

@keryell
keryell commented Oct 22, 2016

@naibaf7 to go back to the discussion on OpenCL 2 or not, at some point real things have to be delivered... Otherwise there is already nVidia GPU and CUDA with TensorFlow running... :-)
But of course, a version of TensorFlow without SVM has some interest.

@bhack
bhack commented Oct 22, 2016

@keryell How much of the Vulkan SPIR-V work on drivers (that has already a good devices coverage) do you think will push modern Opencl versions?

@keryell
keryell commented Oct 22, 2016

@naibaf7 Khronos meeting is next week in Seoul with both OpenCL and Vulkan people, but discussions are not public. But that sounds like a good idea to have each world to improve the other, and at some point benefits to TensorFlow. :-)

@naibaf7
naibaf7 commented Oct 22, 2016

@keryell
Yes, I hope they discuss some DeepLearning beneficial stuff :)

@VincentSC

Congrats! Be sure to check the HIP project, as they tried to solve the same problem. They chose to create a new language called HIP, which defines what manually needs to be converted (like checking double precision support by checking compute level). While the project advances, the amound of manual translations would go down. See: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

My suggestion for you is to use HIP and fix some bugs that are blocking for advancing Tensorflow or your own goals, as you now have the understanding of LLVM to do it. This way you don't have to solve the problems they already fixed.

@inferrna

@hughperkins
can't build python module with your fork following this https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#create-the-pip-package-and-install

INFO: From Compiling tensorflow/core/kernels/gather_functor_gpu.cu.cc:
gpus/crosstool: -x cuda
gpus/crosstool: using cocl
gpus/crosstool: PATH=/usr/bin:/usr/local/bin /usr/local/bin/cocl -D_FORCE_INLINES -gencode=arch=compute_30,\"code=sm_30,compute_30\"   -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 -DNDEBUG -DEIGEN_MPL2_ONLY -std=c++11  -I. -Ibazel-out/local_linux-py3-opt/genfiles -Iexternal/bazel_tools -Ibazel-out/local_linux-py3-opt/genfiles/external/bazel_tools -Iexternal/eigen_archive -Ibazel-out/local_linux-py3-opt/genfiles/external/eigen_archive  --compiler-bindir=/usr/bin/gcc -I . -fPIC  -x cu  -O2 -c  -o bazel-out/local_linux-py3-opt/bin/tensorflow/core/kernels/_objs/gather_functor_gpu/tensorflow/core/kernels/gather_functor_gpu.cu.pic.o tensorflow/core/kernels/gather_functor_gpu.cu.cc
dirname: invalid option -- 'O'
Try 'dirname --help' for more information.

I'm on ubuntu 16.04, dirname is from coreutils-8.25-2ubuntu2

@bhack
bhack commented Oct 23, 2016

@hughperkins I think that tweaking the TF dockerfile on your repository with this istructions could easy the setup for others.

@bhack
bhack commented Oct 23, 2016

Yes, when there will be something more functional. Basically it is quite a copy and past of this istructions you have posted.

@rybskej
rybskej commented Oct 24, 2016 edited

I'm experimenting building this on MacOS 10.10.5 on a MacBook late 2015 with ATI 6770M (OpenCL 1.2).

I've installed Xcode 8, Anaconda (Python 3.5), and MacPorts equivalents of clang+llvm:
#instead of apt-get lines, do:
sudo port install clang-3.8 llvm-3.8
#Instead of using /proc/cpuinfo, do:
NUM_PROCS=$(system_profiler SPHardwareDataType | grep "Total Number of Cores" | cut -d ":" -f 2)
#Then modify Makefile to use macports and run make
perl -pi.bak -e 's|(CLANG)=.+|$1=/opt/local/libexec/llvm-3.8/bin/clag++|' Makefile
perl -pi -e 's|(LLVM_CONFIG)=.+|$1=/opt/local/bin/llvm-config-mp-3.8|' Makefile
perl -pi -e 's|(LLVM_INCLUDE)=.+|$1=/opt/local/libexec/llvm-3.8/include|' Makefile
#update to Macos OpenCL dirs; future: use /System/Library/Frameworks/OpenCL.framework/Versions/Current/Headers/cl.h '#ifdef APPLE' conditional
grep -Rl 'include "CL/' * | xargs perl -pi.bak -e 's|include "CL/|include "OpenCL/|g'
make -j ${NUM_PROCS}

This is as far as I get:

$ make -j ${NUM_PROCS}
mkdir -p build
mkdir -p build
mkdir -p build
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/hostside_opencl_funcs.o -std=c++11 -fPIC -g -O2 -Ipwd/include -Ipwd/src/EasyCL src/hostside_opencl_funcs.cpp
/opt/local/libexec/llvm-3.8/bin/clang++ -I/usr/lib/llvm-3.8/include -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -g -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fcxx-exceptions -c -o build/mutations.o -g -I/opt/local/libexec/llvm-3.8/include src/mutations.cpp
/opt/local/libexec/llvm-3.8/bin/clang++ -I/usr/lib/llvm-3.8/include -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -g -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fcxx-exceptions -c -o build/struct_clone.o -g -I/opt/local/libexec/llvm-3.8/include src/struct_clone.cpp
/opt/local/libexec/llvm-3.8/bin/clang++ -I/usr/lib/llvm-3.8/include -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -g -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fcxx-exceptions -c -o build/readIR.o -g -I/opt/local/libexec/llvm-3.8/include src/readIR.cpp
In file included from src/hostside_opencl_funcs.cpp:17:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl.h:91:16: warning: 'host' attribute ignored [-Wignored-attributes]
attribute((host)) inline unsigned long long atomicExch(volatile unsigned long long p, unsigned long long val) {
^
src/hostside_opencl_funcs.cpp:194:33: error: call to member function 'in' is ambiguous
launchConfiguration.kernel->in(offset);
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:101:15: note: candidate function
CLKernel *in(float value);
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:104:15: note: candidate function
CLKernel *in(int32_t value);
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:106:15: note: candidate function
CLKernel *in(int64_t value);
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:108:15: note: candidate function
CLKernel *in(uint64_t value);
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:110:15: note: candidate function
CLKernel *in(uint32_t value);
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:73:15: note: candidate function not viable: no known conversion from 'size_t' (aka 'unsigned long') to 'easycl::CLArray *'
for 1st argument
CLKernel *in(CLArray *clarray1d) { return input(clarray1d); }
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:83:15: note: candidate function not viable: no known conversion from 'size_t' (aka 'unsigned long') to
'easycl::CLWrapper *' for 1st argument
CLKernel *in(CLWrapper *wrapper) { return input(wrapper); }
^
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/src/EasyCL/CLKernel.h:91:36: note: candidate function template not viable: requires 2 arguments, but 1 was provided
template CLKernel *in(int N, const T *data);
^
1 warning and 1 error generated.
make: *
* [build/hostside_opencl_funcs.o] Error 1
make: *** Waiting for unfinished jobs....
src/struct_clone.cpp:245:12: warning: 11 enumeration values not handled in switch: 'HalfTyID', 'X86_FP80TyID', 'FP128TyID'... [-Wswitch]
switch(typeID) {
^
1 warning generated.

@rybskej
rybskej commented Oct 25, 2016 edited

launchConfiguration.kernel->in((int64_t)offset);

This patch worked. Thank you.

After applying this, continuing the build resulted in size_t namespace errors:

$ make -j ${NUM_PROCS}
mkdir -p build
mkdir -p build
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/hostside_opencl_funcs.o -std=c++11 -fPIC -g -O2 -Ipwd/include -Ipwd/src/EasyCL src/hostside_opencl_funcs.cpp
/opt/local/libexec/llvm-3.8/bin/clang++ -I/usr/lib/llvm-3.8/include -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -g -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fcxx-exceptions -o build/ir-to-opencl -g -I/opt/local/libexec/llvm-3.8/include src/ir-to-opencl.cpp build/struct_clone.o build/readIR.o src/ir-to-opencl-common.cpp build/mutations.o /opt/local/bin/llvm-config-mp-3.8 --ldflags --system-libs --libs all
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/cocl_events.o -std=c++11 -fPIC -g -O2 -Ipwd/src/CLBlast/include -Ipwd/include -Ipwd/src/EasyCL src/cocl_events.cpp
/opt/local/libexec/llvm-3.8/bin/clang++ -I/usr/lib/llvm-3.8/include -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -g -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fcxx-exceptions -o build/patch-hostside -g -I/opt/local/libexec/llvm-3.8/include src/patch-hostside.cpp build/readIR.o build/mutations.o build/struct_clone.o src/ir-to-opencl-common.cpp /opt/local/bin/llvm-config-mp-3.8 --ldflags --system-libs --libs all
In file included from src/hostside_opencl_funcs.cpp:17:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl.h:91:16: warning: 'host' attribute ignored [-Wignored-attributes]
attribute((host)) inline unsigned long long atomicExch(volatile unsigned long long p, unsigned long long val) {
^
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/cocl_blas.o -std=c++11 -fPIC -g -O2 -Ipwd/src/CLBlast/include -Ipwd/include -Ipwd/src/EasyCL src/cocl_blas.cpp
1 warning generated.
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/cocl_error.o -std=c++11 -fPIC -g -O2 -Ipwd/src/CLBlast/include -Ipwd/include -Ipwd/src/EasyCL src/cocl_error.cpp
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:8:9: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
typedef std::size_t cublasStatus_t;
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:17:5: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
std::size_t cublasCreate(cublasHandle_t *phandle);
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:18:5: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
std::size_t cublasDestroy(cublasHandle_t handle);
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:19:5: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
std::size_t cublasSgemm(cublasHandle_t blas, int transA, int transB, int M, int N, int K,
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:21:5: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
std::size_t cublasSetPointerMode(cublasHandle_t handle, cublasPointerMode_t mode);
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:22:5: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
std::size_t cublasGetPointerMode(cublasHandle_t handle, cublasPointerMode_t *mode);
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
In file included from src/cocl_blas.cpp:15:
/Users/erybski/git/tensorflow-cl/third_party/cuda-on-cl/include/cocl/cocl_blas.h:23:5: error: no type named 'size_t' in namespace 'std'; did you mean simply 'size_t'?
std::size_t cublasSetStream(cublasHandle_t handle, cudaStream_t streamId);
^~
size_t
/opt/local/libexec/llvm-3.8/bin/../lib/clang/3.8.1/include/stddef.h:62:23: note: 'size_t' declared here
typedef SIZE_TYPE size_t;
^
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/cocl_memory.o -std=c++11 -fPIC -g -O2 -Ipwd/src/CLBlast/include -Ipwd/include -Ipwd/src/EasyCL src/cocl_memory.cpp
/opt/local/libexec/llvm-3.8/bin/clang++ -c -o build/cocl_device.o -std=c++11 -fPIC -g -O2 -Ipwd/src/CLBlast/include -Ipwd/include -Ipwd/src/EasyCL src/cocl_device.cpp
7 errors generated.
make: *
* [build/cocl_blas.o] Error 1
make: *** Waiting for unfinished jobs....

@bhack
bhack commented Oct 25, 2016

Can we push long log on gist to let the thread to be still readable?

@lukeiwanski
Contributor

question: how are you guys solving the issue of address spaces?

@hughperkins The SYCL specs describe in section 5.8 ("Address-space deduction")
how an implementation needs to deal with different memory types. This
is similar to previous work done for PlayStation 3 and described in
this paper: Offload – Automating Code Migration to Heterogeneous
Multicore Systems
or C++ on Accelerators: Supporting Single-Source SYCL and HSA Programming Models Using Clang

hope that helps.

@alephman

@hughperkins Can I compile your tensorflow-opencl repo code to apply my ARM board? My ARM board has Imagination GPU which support opencl 1.2 .

@dcolley
dcolley commented Oct 25, 2016

I stumbled on this thread while searching for tf/intel support.

I have an intel MacBook Pro, how can I help? I don't know c/c++, but I can follow build/compile/test instructions and pass back (pastebin) results...

derek$ system_profiler SPDisplaysDataType
Graphics/Displays:

Intel Iris:

  Chipset Model: Intel Iris
  Type: GPU
  Bus: Built-In
  VRAM (Dynamic, Max): 1536 MB
  Vendor: Intel (0x8086)
  Device ID: 0x0a2e
  Revision ID: 0x0009
  Metal: Supported
  Displays:
    Color LCD:
      Display Type: Retina LCD
      Resolution: 2560 x 1600 Retina
      Retina: Yes
      Pixel Depth: 32-Bit Color (ARGB8888)
      Main Display: Yes
      Mirror: Off
      Online: Yes
      Automatically Adjust Brightness: Yes
      Built-In: Yes
    PL2202W:
      Resolution: 1680 x 1050 @ 60 Hz
      Pixel Depth: 32-Bit Color (ARGB8888)
      Display Serial Number: 05884C7A57014
      Mirror: Off
      Online: Yes
      Rotation: Supported
      Adapter Type: Apple Mini DisplayPort To VGA Adapter
      Automatically Adjust Brightness: No
      Adapter Firmware Version: 1.03
@alephman

@hughperkins Thanks for your instructions!
I try to compile your cuda-on-cl on arm platform. Following your cuda-on-cl's guide:
My ARM board info:
arm64, gcc 4.9 , clang and llvm 3.5, openCL 1.2

* Do I have to use clang++-3.8 version?*
git clone --recursive https://github.com/hughperkins/cuda-on-cl
make
error:
clang++-3.8: Command not found
I edit the Makefile like this: CLANG=clang++ LLVM_CONFIG=llvm-config LLVM_INCLUDE=/usr/include/llvm
then make again:
error:
src/mutations.h:3:10: fatal error: 'llvm/IR/Module.h' file not found

try run make run-test-cocl-cuda_sample:
make: cocl: Command not found

@dcolley
dcolley commented Oct 27, 2016

@hughperkins let me give it a try.

@inferrna

Got error while testing keras with tensorflow

keras$ KERAS_BACKEND=tensorflow pytest3

Output errors:

Invalid kernel name, code -46, kernel _ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi1ELi1EiEELi16ENS_11MakePointerEEEKNS_18TensorCwiseUnaryOpINS0_12scalar_rightIffNS0_17scalar_product_opIffEEEEKNS4_INS5_IKfLi1ELi1EiEELi16ES7_EEEEEENS_9GpuDeviceEEEiEEvT_T0_
__internal__ build log: 
"/tmp/OCL11307T1.cl", line 3: error: variable with automatic storage duration
          cannot be stored in the named address space
      local float mem[1024];

Code:

inline float __shfl_down_3(float v0, int v1, int v2) {
    local float mem[1024];
    int tid = get_local_id(0);
    int warpid = tid % 32;
    int warpstart = tid - warpid;
    mem[tid] = v0;
    //barrier(CLK_LOCAL_MEM_FENCE);
    int warpsrc = warpid + v1;
    warpsrc = warpsrc >= 32 ? warpid : warpsrc;
    return mem[warpstart + warpsrc];
}
@RicardoRodriguezArg

hi everyone, my name is ricardo , i am a C++ programmer with many years in C++ experience, and little on Cuda, i will be glade in contribute to this effort. How can i contribute to this job?

@RicardoRodriguezArg

Ok, i have an Odroid Xu3 with a Mali-T628 MP6(OpenGL ES 3.0/2.0/1.1 and OpenCL 1.1 Full profile)
running on OS: LUbuntu 1404 64 bits
I will make a complete installation and post the result on this platform.
About bugs, there is a list of bugs (something like Bugzilla?) or an spreadsheet with a list of bugs?
Cheers!

@dcolley dcolley referenced this issue in hughperkins/tensorflow-cl Oct 31, 2016
Open

Mac build instructions #2

@LifeIsStrange
LifeIsStrange commented Nov 3, 2016 edited

What about using HIP ?
https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#how-does-hip-compare-with-opencl
https://github.com/RadeonOpenCompute/hcc
GPUOpen-ProfessionalCompute-Tools/HIP#45
"Your wish is being granted, Eigen is being ported over AMD GPU via HIP. The second part of your request is can we bring standardized tool supporting FLOAT16 that ships with all our GFX8 GPU's, wish granted."
Our development branch of AMDGPU compiler now support's both Float16 and Int16 native instruction, instead of emulating FP16/Int16 with up convert & down convert instructions to convert from FP16/Int16 to Float and back.

This is f16 tests on Fiji hardware successfully executing a matrix multiplication with half types with conversion and with Native instructions."

Also, not related but you should use syCL/openCL 2.0 instead of 1.2, because nvidia is already supported via CUDA. And openCL 2.0 is supported on both AMD and Intel Windows drivers. Also AMD has said that they will soon opensource un openCL 2.0 driver for Linux (which could be used by Intel, opensource magic) (and Intel already has a Linux openCL 2.0 implementation which Just need maturation.) if you ask Intel and AMD, maybe they could speed up the work, because tensorflow is important for their economic interests. And they already have said in this comment section that they wanted to help. Also all the major ARM makers support openCL 2.0. This could open a lot of opportunitys for Android (which is in the economic interest of Google) , raspberry like, smart TVs, etc

And in mid term we could eventually develop an opencl 1.2 fallback layer for non supported hardware.
And the implementation should use also openVX (which is now supoorted by all major hardware makers, and AMD has an opensource implementation) and with https://www.khronos.org/news/press/khronos-launches-dual-neural-network-standard-initiatives
And the all with Spir-V (which can be use simultaneously by Vulkan and openGL).
You could say that I'm making a duplicate of what was already said, but synthetizing is important.
And finally, could tensorflow use HSA ?

http://www.hsafoundation.com
HSA would be awesome on Android.

@olesalscheider
Contributor

I don't know if HIP would be useful or not. It is only supported on some AMD cards so that we need an OpenCL implementation anyway if we want to support all devices. It might still be worth it if the HIP implementation is notably faster. This might be the case but I haven't seen many benchmarks (HIP vs. OpenCL) yet. Another reason might be MLOpen (which is written in HC) as an replacement for cudnn but again I have no idea how fast that is or which features it supports.

TensorFlow would not use HSA directly because it is quite low-level. But HC (and HIP) is implemented on top of it and you can also implement OpenCL on top of if (pocl does that).

@jrmuizel
jrmuizel commented Nov 6, 2016

Would the relooper algorithm be helpful here? http://mozakai.blogspot.ca/2012/05/reloop-all-blocks.html

@keryell
keryell commented Nov 6, 2016

@hughperkins Nice to see you have some progress with your compiler, but I think it becomes off-topic for TensorFlow. You should start many smaller discussion threads on the GitHub page of your compiler project instead. It would be more focused and productive I guess.

@bhack
bhack commented Nov 6, 2016
@bhack
bhack commented Nov 7, 2016

Initial OpenCL/SyCL support was merged in master with #5267

@olesalscheider
Contributor
olesalscheider commented Nov 7, 2016 edited

Congratulations!

@keryell Btw, what happened to the triSYCL repository? It seems to be gone and I can only find a reference to Khronos' Gitlab which is not publicly accessible.

EDIT: I found your private clone, only the one from amd is gone.

@alephman
alephman commented Nov 7, 2016

@bhack, does the opencl-docker support in mac platform?

@bhack
bhack commented Nov 7, 2016 edited

@alephman I don't have an OSX platform but I think that adapting a little bit the launching command could works.

@dcolley
dcolley commented Nov 7, 2016 edited

@bhack @alephman: see my comment about mac above, if you point me to the build instructions I'll have a go

@keryell
keryell commented Nov 7, 2016

@olesalscheider: yes, triSYCL moved from AMD to Xilinx https://github.com/Xilinx/triSYCL but you are right, the version on my GitHub workspace works too at https://github.com/keryell/triSYCL

We have not tried triSYCL on TensorFlow yet. There is already a big build config work to do just to try...

@bhack
bhack commented Nov 7, 2016

@keryell What is the triSYCL status?

@LifeIsStrange

Intel beignet opencl 2.0 support is almost done !
http://phoronix.com/scan.php?page=news_item&px=Beignet-Birthday-CL2

@keryell
keryell commented Nov 11, 2016

@bhack triSYCL is mainly developed at Xilinx now. Still adding more and more features. The Clang/LLVM-based outlining compiler is still in development to have a full single-source experience on a device. But the OpenCL compatibility mode, already implemented, has some value too, by simplifying the communications between host and kernels with the SYCL runtime doing the lazy transfers according to the dependencies expressed by the accessors.

@SunAriesCN

My mac is OpenCL compatible, so how can I run my tensorflow with openCL? I just found that opencl had been supported in tensorflow, when I configure the new codes.

@SunAriesCN

@hughperkins there is no clinfo instruction in my mac, what can I do for it? But I can compile the test code here for opencl with clang and result the following info:
clang -framework OpenCL dumpcl.c -o dumpcl && ./dumpcl Device Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz supports OpenCL 1.2 Device Intel(R) Iris(TM) Graphics 6100 supports OpenCL 1.2

@SunAriesCN
SunAriesCN commented Nov 14, 2016 edited

Thank you @hughperkins, but I think I had tried the computecpp yesterday, and it's seem that the macbook system is still not supported with computecpp. So, maybe keep waiting for new updates is the only thing I can do (T.T). BTW, my Iris 6100 is eight generation, which is good for OpenCL 1.2.

@keryell
keryell commented Nov 15, 2016

@hughperkins yes SYCL 1.2 is a priori for OpenCL 1.2 and SYCL 2.2 is a priori for OpenCL 2.2.
I said "a priori" since, if you do not use anything requiring the OpenCL-compatibility mode of SYCL, SYCL does not really require OpenCL at all. Actually SYCL is a very generic model for heterogeneous computing and can run on top of anything. But of course a real implementation may require OpenCL too.

@igerard
igerard commented Nov 17, 2016

Hello,

I am learning/working with TensorFlow and Keras for the time being and I would be interested to get the OpenCL support working under macOS ... Is there some news on the work done around macOS ?

I succeeded to compile TensorFlow but if I try to configure for OpenCL it ask me for the computeCpp 1.2 location, and there is no ComputeCpp for macOS it seems to me.

@vade
vade commented Nov 17, 2016

Hello. By no means an expert in ML / Tensorflow / or even OpenCL, but I'm an experienced Mac graphics dev who desperately wants faster performance of Tensorflow on systems with integrated and AMD GPU's using built in libraries and simple dependencies :)

How can I help?

@vade
vade commented Nov 17, 2016

Looking at the last compile failure on OS X in the travis log @hughperkins - it looks like running 'xcode-select --install' might a solve? It should re-link the /usr/include directory. I had this issue myself when updating Xcode beta to release and had issues compiling some C++ code.

@ptillet
ptillet commented Nov 17, 2016

It seems like the XLA compiler (https://www.tensorflow.org/versions/master/resources/xla_prerelease.html) will provide LLVM code generation from dataflow graphs. This means very easy access to spir-v and therefore Vulkan's compute API. With code generation sorted out, I can't imagine Google not providing Vulkan compatibility given the high number of unused integrated GPUs running on Android.

@vade
vade commented Nov 17, 2016

@hughperkins

Quickly: Right now I am running Inception v3 on a custom C++ / Object-C codebase and passing in decoded video frames to the network. I don't know enough about TF to know low level needs, but high level: load models, run session, expect stuff to work. I think that means 100% compatibility to be really honest. I know thats of no help in prioritizing. Basically the C++ Image Recognition using TF /InceptionV3 was my starting point.

cuda-on-cl running on Mac: I've checked out the repo and can help debug and run builds on my systems and verify results on a variety of hardware:I have access to AMD Mac Pros with Dual D700s, Nvidia Mac Laptops and Desktop systems.

Thanks for your detailed feedback. I'll monitor the repo, try to follow along, and try to help best I can.

@VincentSC

Hugh, you might want to look at http://chrec.cs.vt.edu/cu2cl/ to learn how some functions are mapped.

@VincentSC

At my company StreamComputing we have various GPUs for build-testing and benchmarking, which we use for our customer-projects. I could hook your Github into our Jenkins to do a weekly run.

@igerard
igerard commented Nov 20, 2016

Thank you for the answer, I will go back on the subject at work this week, with specific scripts.

My use cases are around text/syntaxic matching analysis, using Gensim and Keras/tensorflow in my experiments.

@igerard
igerard commented Nov 21, 2016

I am wiling to help you for testing

I hava a Windows PC with an AMD card
A MBP with an AMD card
An MB with an Intel integrated GPU

@cathalgarvey

Hey @hughperkins - I am going through the test set above, this evening, on an AMD R9 390 8GB. So far I've already got one different result; logistic_regression.py trains and doesn't return nan. So, good! It segfaults at the end, so I'll investigate whether the script or the cl code is at fault.

Where should I push my results, where they can be most useful to you?
Perhaps we could get a standard "test script" that generates a standard set of results that volunteers can push to you (or set up on local CIs or whatever)?

@cathalgarvey

py.test is as good a solution as any; it's just a pip away and that's part of the process for installing tensorflow anyway.

I've discovered a few interesting things since starting my tests, and they may not be debuggable using Python output alone, however:

  • Different calls to the same script may crash early, or may "hang" (no output, no progress, no response to Ctrl-C, process needs to be pkill -9'd), or may crash late either at the validation part or after the script completes successfully. Crashes (segfaults) may take down Xorg.
  • The results vary for seemingly no reason: I may call a script and have it segfault, then call it again and it will work.
  • Hangs can occur in portions of code that were working literally moments ago, I've had one hang occur within or after a training batch, after several hundred batches just happened successfully.

So, it might be that there's unresolved stuff on the GPU side, and that a good segfault is needed to clear it out? I don't know much about the GPU model or OpenCL yet, so I can't contribute much here. But, it might be that GPU debugging output is needed to properly explore what's happening.

@cathalgarvey

Also, I thought you were with AMD from your github, but it seems you're a "rogue agent" doing this whole CUDA-on-CL thing on your own time. Thanks sincerely for spearheading this! Is there some way that I and others can contribute for your efforts, perhaps by crowdfunding you a GPU? Or you could set up a Patreon, I'm happy to sign up for a monthly contribution to the project?

@VincentSC

Concerning AMD GPUs, we're a partner of AMD. See my message of 8 days ago, which you might have missed:

At my company StreamComputing we have various GPUs for build-testing and benchmarking, which we use for our customer-projects. I could hook your Github into our Jenkins to do a weekly run.

@VincentSC

I wonder if you might have the possibility of setting up a CI server, that runs on each commit?

No problem. I probably need write-access to the project, so Jenkins can write the log-file into a build-log directory. I just spammend you, so we can discuss.

@lukeiwanski
Contributor

Hi all,

As you probably see already, a bunch of SYCL stuff has been pushed to TensorFlow. We are not complete yet, and there is plenty to do. But we are progressing to get there.

If you are interested in contributing or just curious on the current state, check the breakdown below.

Infrastructure
Google kindly donated two machines that are set up to test @benoitsteiner's fork of TensorFlow (https://github.com/benoitsteiner/tensorflow-opencl) periodically

Both have AMD GPUs:

CL_DEVICE_NAME : Hawaii
CL_DRIVER_VERSION : 1912.5 (VM)

and

CL_DEVICE_NAME : Fiji
CL_DRIVER_VERSION : 1912.5 (VM)

We at Codeplay are looking to dedicate machine(s) next year too. To improve the OpenCL device diversity coverage.

We are looking for contributors on that front if anyone is interested in providing a test build server for relevant platforms that we support.
Currently, the requirements are:
- Ubuntu 14.04
- OpenCL drivers that support SPIR ( Intel CPU / GPU or AMD GPU )

@VincentSC perhaps you could help out with that?

Tests
On the Fiji machine ( https://ci.tensorflow.org/job/tensorflow-opencl/127/consoleFull ) we are facing 164 fails.

On the Hawaii machine ( https://ci.tensorflow.org/job/tensorflow-opencl/129/consoleFull ) we are down to 56 fails.

We are looking into fixing the failing gradient tests and investigating the origins of the additional fails on the Fiji machine.

Eigen
For the past few months we have been actively implementing features needed by TensorFlow including: Reshaping, Slicing, Basic Reduction etc. Currently we are implementing Contraction. A detailed breakdown can be found in the Eigen Tensor tab of https://docs.google.com/spreadsheets/d/1YbHn7dAFPPG_PgTtgCJlWhMGorUPYsF681TsZ4Y4LP0/edit#gid=0.

TensorFlow
A lot of Coefficient-wise operations have been implemented including Abs, Floor, IsFinite, Log, Pow, Mul, etc., as well as Tensor Manipulations like Reshape, Shape, Identity, Fill etc.
A detailed breakdown can be found in the TensorFlow Kernels tab of https://docs.google.com/spreadsheets/d/1YbHn7dAFPPG_PgTtgCJlWhMGorUPYsF681TsZ4Y4LP0/edit#gid=1719702219

Organisation
The above spreadsheed has several tabs that categorise the efforts of the project like: Overall Plan, Eigen Tensor, TensorFlow Kernels, Models.

If you would like to get involved, please put your name next to the item you are working on or add anything important that is missing.
Thanks,
Luke

@sushisushix

Is this roadmap active?

@VincentSC

@lukeiwanski Yes, no problem. Contact us via info@streamcomputing.eu

@benoitsteiner benoitsteiner added a commit to benoitsteiner/tensorflow that referenced this issue Dec 19, 2016
@lukeiwanski @benoitsteiner lukeiwanski + benoitsteiner Added missing BUILD dummy file to third_party/sycl/crosstool/BUILD (#22)
* Added missing BUILD dummy file to third_party/sycl/crosstool/BUILD

* Use floor_div_real for SYCL device.

* Cleaned up SYCL crosstool.
2f59c83
@benoitsteiner benoitsteiner added a commit to benoitsteiner/tensorflow that referenced this issue Dec 21, 2016
@lukeiwanski @benoitsteiner lukeiwanski + benoitsteiner Added missing BUILD dummy file to third_party/sycl/crosstool/BUILD (#22)
* Added missing BUILD dummy file to third_party/sycl/crosstool/BUILD

* Use floor_div_real for SYCL device.

* Cleaned up SYCL crosstool.
7aec15c
@dylib
dylib commented Dec 25, 2016 edited

After reading through all this I'm guessing there's no solid solution yet for using OpenCL on macOS/OS X? I tried to compile Tensorflow C++ with OpenCL support ( which I assume requires ComputeCpp for SYCL 1.2 as someone pointed out ).

I looked around and couldn't seem to locate where to download, compile or build the SYCL library. Is it here https://www.codeplay.com/ ? I'm unsure really how to proceed, thanks...

@inflation
Contributor

@dylib As far as I know that there is still not a ComputeCpp for macOS. So that means OpenCL for macOS is not ready.

@inferrna

Still can't make it working on Ubuntu 16.04 with AMD card and catalyst driver #6497. Is there any howto?

@inferrna

I had to look at /usr/local/computecpp/bin/computecpp_info output before trying to use TF compiled with OpenCL support. In my case it showing

  Device is supported                     : NO - Unsupported vendor
  CL_DEVICE_NAME                          : Pitcairn
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.

now there is 2 choices for running TF on GPU:
good working on limited (by vendor) number of devices, but proprietary CUDA
bad working on limited (by computecpp developers) number of devices and also proprietary computecpp
Still no OpenCL support.

@benoitsteiner
Member

@inferrna There in an OpenCL specific section in the overall TensorFlow documentation. This will be published on the tensorflow.org site soon.

@atlury
atlury commented Jan 8, 2017 edited

@benoitsteiner What is the current state on opencl convolutions support? Are you planning on leveraging the exiting kernels directly? What about matrix multiplications?

Any ETA?

@olesalscheider
Contributor

It seems AMD is working on that: GPUOpen-ProfessionalCompute-Tools/HIP#45 (comment)

@bhack
bhack commented Jan 13, 2017

Can XLA backends LLVM IR be converted to SPIR-V with https://github.com/KhronosGroup/SPIRV-LLVM?

@k-hashimoto

How about this? I think this package can work on Radeon GPU.

https://github.com/RadeonOpenCompute/ROCm

@lukeiwanski
Contributor

@bhack From #6449 (comment)

@lukeiwanski Will XLA impact also your effort?

XLA and SYCL solutions are complementary for different situations: SYCL is designed to provide full programmability and customizability. XLA is for optimizing well defined patterns in graphs.

My understanding of XLA is that it optimizes some existing TensorFlow graphs at runtime using the LLVM compiler. It requires optimization passes to be implemented in the compiler for each algorithm used in the graph.
The SYCL approach is the only approach that will deliver a CUDA level of programming - which is what developers need.

With SYCL we are aiming to provide support for all the TensorFlow Ops and ease development of new operations.

This means SYCL lets you write new high performance operations very easily, while XLA can optimize whole graphs if it supports all the ops in the graph.

Can XLA backends LLVM IR be converted to SPIR-V with https://github.com/KhronosGroup/SPIRV-LLVM?

I don't see any reason why that wouldn't be possible.

@keryell
keryell commented Jan 17, 2017

@k-hashimoto: we are discussing here about porting TensorFlow to OpenCL, a standard from Khronos Group, and actually more OpenCL SYCL, the post-modern C++ single source standard from Khronos Group.
ROCm looks like yet-another-non-standard-solution-from-some-vendor.
If you are interested in proprietary solutions, there is already a CUDA version of TensorFlow which looks working well. :-)

@cathalgarvey
@vlerenc
vlerenc commented Jan 18, 2017

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment