New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL Support #488

Closed
cathalgarvey opened this Issue Jan 18, 2017 · 40 comments

Comments

Projects
None yet
@cathalgarvey

cathalgarvey commented Jan 18, 2017

Torch7 has OpenCL support via @hughperkins' work - if PyTorch is based on the same backends as Lua Torch, how hard would it be to port OpenCL over and get this working on virtually all modern GPUs and integrated graphics?

Deep learning needs more accessible beginners' experience, so integrated graphics would help win early mindshare. DL also needs cheaper hardware; NVidia's monopoly and crazy prices are a hard and unnecessary tax.

Further; there are higher-abstractions like Keras that currently only support CUDA because the lower abstractions only support CUDA; if PyTorch ported Torch7's OpenCL backend, then building a Keras back-end would be a step towards OpenCL-ising a large codebase written for Keras, also.

The first Python OpenCL framework for DL will win a lot of credibility at beginners' workshops and in the creative-AI space where ubiquitous hardware is a must. OpenCL support should be a priority, in my opinion, for any new framework in this space. Tensorflow is already working on OpenCL, so perhaps PyTorch will miss this window of opportunity. I'd love to see both hit the milestone at once, to encourage healthy competition.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Jan 18, 2017

Member

I understand where you are coming from.

We officially are not planning any OpenCL work because:

  • AMD itself seems to be moving towards HIP / GPUOpen which has a CUDA transpiler (and they've done some work on transpiling Torch's backend)
  • Intel is moving it's speed and optimization value into MKLDNN
  • Generic OpenCL support has strictly worse performance than using CUDA/HIP/MKLDNN where appropriate.

Hugh's OpenCL port is limited to the most popular layers, and does not port the 250+ C functions that need to be ported to get a fully functional OpenCL backend going. It makes sense for an extension to port part of the core value as there is no expectation of full parity, but the official / core distribution cannot get away with this -- users expect full API parity.

So, considering that it's a humongous effort to do OpenCL support, and considering the weight of the upside, we wont be working on it.

However, if anyone in the community wants to give a go, feel free to :)

Member

soumith commented Jan 18, 2017

I understand where you are coming from.

We officially are not planning any OpenCL work because:

  • AMD itself seems to be moving towards HIP / GPUOpen which has a CUDA transpiler (and they've done some work on transpiling Torch's backend)
  • Intel is moving it's speed and optimization value into MKLDNN
  • Generic OpenCL support has strictly worse performance than using CUDA/HIP/MKLDNN where appropriate.

Hugh's OpenCL port is limited to the most popular layers, and does not port the 250+ C functions that need to be ported to get a fully functional OpenCL backend going. It makes sense for an extension to port part of the core value as there is no expectation of full parity, but the official / core distribution cannot get away with this -- users expect full API parity.

So, considering that it's a humongous effort to do OpenCL support, and considering the weight of the upside, we wont be working on it.

However, if anyone in the community wants to give a go, feel free to :)

@cathalgarvey

This comment has been minimized.

Show comment
Hide comment
@cathalgarvey

cathalgarvey Jan 18, 2017

Hm, I understand your position, even if I consider it misguided and unfortunate.

As Tensorflow already tentatively supports OpenCL, albeit through a proprietary compiler framework, it looks like Tensorflow is the way to go for anyone with an AMD card or, er, any CPU. But as things mature I hope that this changes; I'll keep an eager eye on PyTorch in the meantime.

cathalgarvey commented Jan 18, 2017

Hm, I understand your position, even if I consider it misguided and unfortunate.

As Tensorflow already tentatively supports OpenCL, albeit through a proprietary compiler framework, it looks like Tensorflow is the way to go for anyone with an AMD card or, er, any CPU. But as things mature I hope that this changes; I'll keep an eager eye on PyTorch in the meantime.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Jan 18, 2017

Member

when HIP is ready, we will definitely look into supporting AMD cards via that path. We had spent some time trying to build and bench AMD GPUs using OpenCL, and it did not work for us from various angles. For the CPU, we will continue optimizing our CPU backend as much as we can.
Thanks for opening the discussion up, it was a good question to ask.

Member

soumith commented Jan 18, 2017

when HIP is ready, we will definitely look into supporting AMD cards via that path. We had spent some time trying to build and bench AMD GPUs using OpenCL, and it did not work for us from various angles. For the CPU, we will continue optimizing our CPU backend as much as we can.
Thanks for opening the discussion up, it was a good question to ask.

@cathalgarvey

This comment has been minimized.

Show comment
Hide comment
@cathalgarvey

cathalgarvey Jan 18, 2017

cathalgarvey commented Jan 18, 2017

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Jan 19, 2017

Contributor

Porting code to OpenCL by hand is not very maintainable. I think a more automated approach could be good.

Here is a table of how I see things:

What Who Input Backend Comments
coriander Me :-) NVIDIA® CUDA™ OpenCL 1.2 Works on Mac :-) Opensource
HIP AMD HIP AMD
ComputeCpp Codeplay® SYCL SPIR 1.2 Official Tensorflow approach to OpenCL.
triSYCL Keryell SYCL "SPIR 2.0" Opensource
OpenCL™ by hand OpenCL OpenCL High maintenance, unportable, means forking the code ...
NVIDIA® CUDA™ NVIDIA NVIDIA® CUDA™ CUDA/PTX/SASS Reference implementation for most/all projects

Note: quick introduction to 'SPIR', well I will just quote https://www.khronos.org/spir:

"SPIR (Standard Portable Intermediate Representation) was initially developed for use by OpenCL and SPIR versions 1.2 and 2.0 were based on LLVM. SPIR has now evolved into a true cross-API standard that is fully defined by Khronos with native support for shader and kernel features – called SPIR-V. [...]

"For developers, using SPIR-V means that kernel source code no longer has to be directly exposed, kernel load times can be accelerated and developers can choose the use of a common language front-end, improving kernel reliability and portability across multiple hardware implementations."

Contributor

hughperkins commented Jan 19, 2017

Porting code to OpenCL by hand is not very maintainable. I think a more automated approach could be good.

Here is a table of how I see things:

What Who Input Backend Comments
coriander Me :-) NVIDIA® CUDA™ OpenCL 1.2 Works on Mac :-) Opensource
HIP AMD HIP AMD
ComputeCpp Codeplay® SYCL SPIR 1.2 Official Tensorflow approach to OpenCL.
triSYCL Keryell SYCL "SPIR 2.0" Opensource
OpenCL™ by hand OpenCL OpenCL High maintenance, unportable, means forking the code ...
NVIDIA® CUDA™ NVIDIA NVIDIA® CUDA™ CUDA/PTX/SASS Reference implementation for most/all projects

Note: quick introduction to 'SPIR', well I will just quote https://www.khronos.org/spir:

"SPIR (Standard Portable Intermediate Representation) was initially developed for use by OpenCL and SPIR versions 1.2 and 2.0 were based on LLVM. SPIR has now evolved into a true cross-API standard that is fully defined by Khronos with native support for shader and kernel features – called SPIR-V. [...]

"For developers, using SPIR-V means that kernel source code no longer has to be directly exposed, kernel load times can be accelerated and developers can choose the use of a common language front-end, improving kernel reliability and portability across multiple hardware implementations."

@lukeiwanski

This comment has been minimized.

Show comment
Hide comment
@lukeiwanski

lukeiwanski Jan 20, 2017

hi all,
Just to clarify the ComputeCpp entry in @hughperkins' table.
ComputeCpp implements the SYCL 1.2 open standard by Khronos ( https://www.khronos.org/registry/SYCL/specs/sycl-1.2.pdf ). This runs on top of OpenCL 1.2.
Codeplay's ComputeCpp CE requires the SPIR 1.2 extension.

Current state of:

  • Devices:

    • AMD GPU ( tested for FIJI and HAWAII ) with specific drivers version
    • Intel CPU / GPU
    • Working towards mobile GPU support
  • Platforms:

    • Ubuntu 14.04 + Ubuntu 16.04 has also been tested
    • Work on Windows support is continuing

Thanks,

lukeiwanski commented Jan 20, 2017

hi all,
Just to clarify the ComputeCpp entry in @hughperkins' table.
ComputeCpp implements the SYCL 1.2 open standard by Khronos ( https://www.khronos.org/registry/SYCL/specs/sycl-1.2.pdf ). This runs on top of OpenCL 1.2.
Codeplay's ComputeCpp CE requires the SPIR 1.2 extension.

Current state of:

  • Devices:

    • AMD GPU ( tested for FIJI and HAWAII ) with specific drivers version
    • Intel CPU / GPU
    • Working towards mobile GPU support
  • Platforms:

    • Ubuntu 14.04 + Ubuntu 16.04 has also been tested
    • Work on Windows support is continuing

Thanks,

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Jan 20, 2017

Contributor

ComputeCpp implements the SYCL 1.2 open standard by Khronos ( https://www.khronos.org/registry/SYCL/specs/sycl-1.2.pdf ). This runs on top of OpenCL 1.2.

This is a little misleading. The 'opencl 1.2' bit, in the SYCL standard, relates to OpenCL-compatibility mode, eg see:

tensorflow/tensorflow#22 (comment)

sycl_opencl

Contributor

hughperkins commented Jan 20, 2017

ComputeCpp implements the SYCL 1.2 open standard by Khronos ( https://www.khronos.org/registry/SYCL/specs/sycl-1.2.pdf ). This runs on top of OpenCL 1.2.

This is a little misleading. The 'opencl 1.2' bit, in the SYCL standard, relates to OpenCL-compatibility mode, eg see:

tensorflow/tensorflow#22 (comment)

sycl_opencl

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Jan 20, 2017

Contributor

(I have updated the table to state OpenCL 1.2 + SPIR 1.2 extension. @lukeiwanski Can you confirm this approximately seems correct-ish?)

Contributor

hughperkins commented Jan 20, 2017

(I have updated the table to state OpenCL 1.2 + SPIR 1.2 extension. @lukeiwanski Can you confirm this approximately seems correct-ish?)

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Jan 20, 2017

Contributor

(since I've quoted Ronan, and since I've cited his project too, I suppose it is only courteous for me to cc @keryell )

Contributor

hughperkins commented Jan 20, 2017

(since I've quoted Ronan, and since I've cited his project too, I suppose it is only courteous for me to cc @keryell )

@lukeiwanski

This comment has been minimized.

Show comment
Hide comment
@lukeiwanski

lukeiwanski Jan 20, 2017

@hughperkins yes the table looks OK. 👍

lukeiwanski commented Jan 20, 2017

@hughperkins yes the table looks OK. 👍

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Apr 15, 2017

Contributor

Update on cuda-on-cl: I've added in some cudnn api implementations, notably pooling, convolutions, sigmoid, tanh, relu activations, and softmax forward, which I think is sufficient now to build and run a cudnn implementation of lenet, with some minor CMakeLists.txt tweaks. I dont imagine this is sufficient for pytorch to simply build and run for OpenCL yet :-) , but it's a few more steps in that general direction.

(I think if pytorch is anything like luatorch, one key not-hard-but-need-to-do thing is linking. currently each cuda file is considered in a standalone way, whereas in lua torch, it's important to be able to run cuda functions from other modules. this is already partially implemented, by providing a facility for each module to register its source-code https://github.com/hughperkins/cuda-on-cl/blob/c15328d49dc8c5700230cd8ec4fd3efe2ada7a79/src/cocl_clsources.cpp , but I never quite got round to hooking this into each module's startup, I think. This would happen by patching the llvm hostside code to call the registration method in its global initializers. patchhostside.cpp is the code that handles hostside llvm patching, and would need to be extended to handle this.
)

Edit: realized that actually probably dont need cross-module linking in fact, since reduce etc are in templated headerfiles, which should be already handled ok.

Contributor

hughperkins commented Apr 15, 2017

Update on cuda-on-cl: I've added in some cudnn api implementations, notably pooling, convolutions, sigmoid, tanh, relu activations, and softmax forward, which I think is sufficient now to build and run a cudnn implementation of lenet, with some minor CMakeLists.txt tweaks. I dont imagine this is sufficient for pytorch to simply build and run for OpenCL yet :-) , but it's a few more steps in that general direction.

(I think if pytorch is anything like luatorch, one key not-hard-but-need-to-do thing is linking. currently each cuda file is considered in a standalone way, whereas in lua torch, it's important to be able to run cuda functions from other modules. this is already partially implemented, by providing a facility for each module to register its source-code https://github.com/hughperkins/cuda-on-cl/blob/c15328d49dc8c5700230cd8ec4fd3efe2ada7a79/src/cocl_clsources.cpp , but I never quite got round to hooking this into each module's startup, I think. This would happen by patching the llvm hostside code to call the registration method in its global initializers. patchhostside.cpp is the code that handles hostside llvm patching, and would need to be extended to handle this.
)

Edit: realized that actually probably dont need cross-module linking in fact, since reduce etc are in templated headerfiles, which should be already handled ok.

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Apr 15, 2017

Contributor

(well, looking at hte pytorch code, looks like activations etc are not via cudnn anyway. but the convolutional implementation is probably useful. I would think a first step to any attempt to use cuda-on-cl against pytorch would be to pick one of the files from https://github.com/pytorch/pytorch/tree/master/torch/lib/THCUNN , and simply try to compile it, without linking, something like:

cocl -c Sigmoid.cu

... or similar (presumably with a bunch of -I [some include directory] thrown in)

Contributor

hughperkins commented Apr 15, 2017

(well, looking at hte pytorch code, looks like activations etc are not via cudnn anyway. but the convolutional implementation is probably useful. I would think a first step to any attempt to use cuda-on-cl against pytorch would be to pick one of the files from https://github.com/pytorch/pytorch/tree/master/torch/lib/THCUNN , and simply try to compile it, without linking, something like:

cocl -c Sigmoid.cu

... or similar (presumably with a bunch of -I [some include directory] thrown in)

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Apr 15, 2017

Contributor

Oh, will need the issue with https://github.com/pytorch/pytorch/blob/master/torch/lib/TH/THGeneral.h.in handling somehow (ie, it gets macrod by things like https://github.com/pytorch/pytorch/blob/master/torch/lib/TH/THGenerateFloatTypes.h )

will_need_the_thcgeneral_h_in_handling_somehow

We need THGeneral.h, which we can can get like:

# assume cloned pytorch as ~/git/pytorch, so:
PTR=$HOME/git/pytorch

cd $PTR/torch/lib/TH
mkdir build
cd build
ccmake .. -D
# change CMAKE_INSTALL_PREFIX  to be ~/pytorch directory for you (change ~ to whatever your homedirectory is)
# c then c then g
make -j 4
make install

... and it'll be in ~/pytorch/include/TH. But we also need THCGeneral.h, and trying the same technique gets stuck on configure:

thc_cmake_fails

TorchConfig.cmake et al are not in $HOME/pytorch:

pytorch_no_torchconfig_files

We also need THCGeneral.h. We can get this by doing:

cp $PTR/torch/lib/THC/THCGeneral.h.in ~/pytorch/include/THCGeneral.h
# by hand, comment out line 13 of THCgeneral.h, ie this line:
# // #cmakedefine USE_MAGMA

Then retry as follows:

cd $PTR/torch/lib/THCUNN
cocl -I ~/pytorch/include/TH/ -I ~/pytorch/include -I .. -I ../THC/ -c Sigmoid.cu 

This gives a bunch of undefined errors, stuff that needs adding to cuda-on-cl, ie cudaTextureObject_t. These should be added to one of the include files in include/cocl directory of cuda-on-cl project. There are a bunch of similar defines there that can be used as a basis/model/template.

Current output:

output_1
output_2

Edit 2: we can fix that by simply commenting out line 130 of torch/lib/THC/generic/THCTensor.h:

comment_out_line_130

New output:

thcunn_h_not_found

Edit 3: Add -I for current directory:

cocl -I ~/pytorch/include/TH/ -I ~/pytorch/include -I .. -I ../THC/ -I . -c Sigmoid.cu

New output:

output_3_1
output_3_2

... reached some hacky cuda-on-cl stuff. The issue is it's tricky to come up with a definition of min that works in all the possible use-cases. There are a bunch of corner-cases to deal with. Some of the relevant codes/hacks is here:

https://github.com/hughperkins/cuda-on-cl/blob/master/include/cocl/fake_funcs.h#L106-L109

double max(double in1, double in2);
double min(double in1, double in2);
float max(float in1, float in2);
float min(float in1, float in2);

various other previous attempts: https://github.com/hughperkins/cuda-on-cl/blob/master/include/cocl/fake_funcs.h#L87-L99

namespace cocl {
   // double max(double in1, double in2);
   // double min(double in, double in2);
}
// using cocl::max;
// using cocl::min;

extern "C" {
// double our_pretend_tanh(double in);
// double our_pretend_log(double in);
// double our_pretend_exp(double in);
// double our_pretend_max(double in1, double in2);
// double our_pretend_min(double in1, double in2);

https://github.com/hughperkins/cuda-on-cl/blob/master/include/cocl/fake_funcs.h#L114-L115

// #define max cocl::max
// #define min cocl::min

So, first challenge is, figure out a horrible hack an elegant plan to get the mins to compile across all use-cases.

Edit 4: looks like this use-case is using min against long longs. So might be sufficient to create a declaration of min that uses long longs. I think I might leave this to someone else to try though :-)

Contributor

hughperkins commented Apr 15, 2017

Oh, will need the issue with https://github.com/pytorch/pytorch/blob/master/torch/lib/TH/THGeneral.h.in handling somehow (ie, it gets macrod by things like https://github.com/pytorch/pytorch/blob/master/torch/lib/TH/THGenerateFloatTypes.h )

will_need_the_thcgeneral_h_in_handling_somehow

We need THGeneral.h, which we can can get like:

# assume cloned pytorch as ~/git/pytorch, so:
PTR=$HOME/git/pytorch

cd $PTR/torch/lib/TH
mkdir build
cd build
ccmake .. -D
# change CMAKE_INSTALL_PREFIX  to be ~/pytorch directory for you (change ~ to whatever your homedirectory is)
# c then c then g
make -j 4
make install

... and it'll be in ~/pytorch/include/TH. But we also need THCGeneral.h, and trying the same technique gets stuck on configure:

thc_cmake_fails

TorchConfig.cmake et al are not in $HOME/pytorch:

pytorch_no_torchconfig_files

We also need THCGeneral.h. We can get this by doing:

cp $PTR/torch/lib/THC/THCGeneral.h.in ~/pytorch/include/THCGeneral.h
# by hand, comment out line 13 of THCgeneral.h, ie this line:
# // #cmakedefine USE_MAGMA

Then retry as follows:

cd $PTR/torch/lib/THCUNN
cocl -I ~/pytorch/include/TH/ -I ~/pytorch/include -I .. -I ../THC/ -c Sigmoid.cu 

This gives a bunch of undefined errors, stuff that needs adding to cuda-on-cl, ie cudaTextureObject_t. These should be added to one of the include files in include/cocl directory of cuda-on-cl project. There are a bunch of similar defines there that can be used as a basis/model/template.

Current output:

output_1
output_2

Edit 2: we can fix that by simply commenting out line 130 of torch/lib/THC/generic/THCTensor.h:

comment_out_line_130

New output:

thcunn_h_not_found

Edit 3: Add -I for current directory:

cocl -I ~/pytorch/include/TH/ -I ~/pytorch/include -I .. -I ../THC/ -I . -c Sigmoid.cu

New output:

output_3_1
output_3_2

... reached some hacky cuda-on-cl stuff. The issue is it's tricky to come up with a definition of min that works in all the possible use-cases. There are a bunch of corner-cases to deal with. Some of the relevant codes/hacks is here:

https://github.com/hughperkins/cuda-on-cl/blob/master/include/cocl/fake_funcs.h#L106-L109

double max(double in1, double in2);
double min(double in1, double in2);
float max(float in1, float in2);
float min(float in1, float in2);

various other previous attempts: https://github.com/hughperkins/cuda-on-cl/blob/master/include/cocl/fake_funcs.h#L87-L99

namespace cocl {
   // double max(double in1, double in2);
   // double min(double in, double in2);
}
// using cocl::max;
// using cocl::min;

extern "C" {
// double our_pretend_tanh(double in);
// double our_pretend_log(double in);
// double our_pretend_exp(double in);
// double our_pretend_max(double in1, double in2);
// double our_pretend_min(double in1, double in2);

https://github.com/hughperkins/cuda-on-cl/blob/master/include/cocl/fake_funcs.h#L114-L115

// #define max cocl::max
// #define min cocl::min

So, first challenge is, figure out a horrible hack an elegant plan to get the mins to compile across all use-cases.

Edit 4: looks like this use-case is using min against long longs. So might be sufficient to create a declaration of min that uses long longs. I think I might leave this to someone else to try though :-)

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins May 3, 2017

Contributor

well, I created a pytorch branch of cuda-on-cl tweaked cuda-on-cl master branch a bit, to add the min/max functions above, and also exp10 and exp10f.

That gets as far as:

7 warnings generated.
+ /usr/local/bin/patch-hostside --hostrawfile ./Sigmoid-hostraw.ll --devicellfile ./Sigmoid-device.ll --hostpatchedfile ./Sigmoid-hostpatched.ll
Assertion failed: (isa<X>(Val) && "cast<Ty>() argument of incompatible type!"), function cast, file /usr/local/opt/llvm-3.8/include/llvm/Support/Casting.h, line 237.
/usr/local/bin/cocl_wrapped: line 389: 17959 Abort trap: 6           ${COCL_BIN}/patch-hostside --hostrawfile ${OUTPUTBASEPATH}-hostraw.ll --devicellfile ${OUTPUTBASEPATH}-device.ll --hostpatchedfile ${OUTPUTBASEPATH}-hostpatched.ll

Edit: note, the command to try, after doing the earlier steps, like copying THCGeneral.h.in etc, is:

cd ~/git/pytorch/torch/lib/THCUNN
cocl -I ~/pytorch/include/TH/ --devicell-opt inline --devicell-opt mem2reg \
--devicell-opt instcombine --devicell-opt O2 \
-I ~/pytorch/include -I .. -I ../THC/ -I . -c Sigmoid.cu

(assuming pytorch is cloned as ~/git/pytorch)

(by the way, I'm testing on a Mac. It's totally possible this will get mildly further on Ubuntu 16.04. Or not. Hard to say :-) )

Contributor

hughperkins commented May 3, 2017

well, I created a pytorch branch of cuda-on-cl tweaked cuda-on-cl master branch a bit, to add the min/max functions above, and also exp10 and exp10f.

That gets as far as:

7 warnings generated.
+ /usr/local/bin/patch-hostside --hostrawfile ./Sigmoid-hostraw.ll --devicellfile ./Sigmoid-device.ll --hostpatchedfile ./Sigmoid-hostpatched.ll
Assertion failed: (isa<X>(Val) && "cast<Ty>() argument of incompatible type!"), function cast, file /usr/local/opt/llvm-3.8/include/llvm/Support/Casting.h, line 237.
/usr/local/bin/cocl_wrapped: line 389: 17959 Abort trap: 6           ${COCL_BIN}/patch-hostside --hostrawfile ${OUTPUTBASEPATH}-hostraw.ll --devicellfile ${OUTPUTBASEPATH}-device.ll --hostpatchedfile ${OUTPUTBASEPATH}-hostpatched.ll

Edit: note, the command to try, after doing the earlier steps, like copying THCGeneral.h.in etc, is:

cd ~/git/pytorch/torch/lib/THCUNN
cocl -I ~/pytorch/include/TH/ --devicell-opt inline --devicell-opt mem2reg \
--devicell-opt instcombine --devicell-opt O2 \
-I ~/pytorch/include -I .. -I ../THC/ -I . -c Sigmoid.cu

(assuming pytorch is cloned as ~/git/pytorch)

(by the way, I'm testing on a Mac. It's totally possible this will get mildly further on Ubuntu 16.04. Or not. Hard to say :-) )

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins May 21, 2017

Contributor

(Note: I've updated the earlier table. Removed the 'hardware' column, and simplified the 'input' and 'backend' column. Also corrected trisycl: it uses SPIR-1.2 backend, not SPIR-V, as I incorrectly stated earlier).

Contributor

hughperkins commented May 21, 2017

(Note: I've updated the earlier table. Removed the 'hardware' column, and simplified the 'input' and 'backend' column. Also corrected trisycl: it uses SPIR-1.2 backend, not SPIR-V, as I incorrectly stated earlier).

@keryell

This comment has been minimized.

Show comment
Hide comment
@keryell

keryell May 22, 2017

Actually triSYCL generates SPIR 2.0 "de facto", which is what is output by current Clang/LLVM. It looks like SPIR but it is not exactly SPIR since the bitcode is not encoded with the right LLVM version for SPIR 2.0, which is LLVM 3.4.
So you need some SPIR consumer which is tolerant enough...

keryell commented May 22, 2017

Actually triSYCL generates SPIR 2.0 "de facto", which is what is output by current Clang/LLVM. It looks like SPIR but it is not exactly SPIR since the bitcode is not encoded with the right LLVM version for SPIR 2.0, which is LLVM 3.4.
So you need some SPIR consumer which is tolerant enough...

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins May 22, 2017

Contributor

"tolerant" is not normally a word I'd associate with GPU drivers. But I have updated trisycl to say '"SPIR 2.0"', in inverted commas. Ok-ish?

Contributor

hughperkins commented May 22, 2017

"tolerant" is not normally a word I'd associate with GPU drivers. But I have updated trisycl to say '"SPIR 2.0"', in inverted commas. Ok-ish?

@keryell

This comment has been minimized.

Show comment
Hide comment
@keryell

keryell May 22, 2017

Yes that sounds good-ish. :-) Thanks.

keryell commented May 22, 2017

Yes that sounds good-ish. :-) Thanks.

@beatthem

This comment has been minimized.

Show comment
Hide comment
@beatthem

beatthem Aug 4, 2017

@hughperkins could you say, does pytorch with opencl work now? Could you help me with porting deepspeech.pytorch to opencl?

beatthem commented Aug 4, 2017

@hughperkins could you say, does pytorch with opencl work now? Could you help me with porting deepspeech.pytorch to opencl?

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Aug 6, 2017

Contributor

@beatthem It does not. And I don't have time to look at this right now. Happy to provide assistance/guidance/youtube-videos/etc to anyone who does have some time to dabble a bit though.

I'll make my current branch public.

Ploof! There you go: https://github.com/hughperkins/pytorch-coriander/compare/orig-master...master?expand=1

Note that most of the work is going to go into getting Thrust working, which is in Coriander itself, not in pytorch-cl. This is kind of close-ish, but will needs some address-space love.

A basic thrust program will compile now, but wont quite run, because of address-space issues. On branch https://github.com/hughperkins/coriander/tree/multiple-walks , build coriander etc, and then do:

cd third_party/thrust/examples
cocl -g -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence

You can actually make this run, if you use the dump/load kernel functionality to dump out the kernel, which will look like https://gist.github.com/hughperkins/c55bb2161a30291fd438b4c4feaf1ed2 , and modify https://gist.github.com/hughperkins/c55bb2161a30291fd438b4c4feaf1ed2#file-gistfile1-txt-L254 ... I dont remember how, but one of the address-spaces is slightly wrong, and if you add/remove a global or local, in one of the casts, I think the one on the left, it's actually possible to get it to run ok.

Contributor

hughperkins commented Aug 6, 2017

@beatthem It does not. And I don't have time to look at this right now. Happy to provide assistance/guidance/youtube-videos/etc to anyone who does have some time to dabble a bit though.

I'll make my current branch public.

Ploof! There you go: https://github.com/hughperkins/pytorch-coriander/compare/orig-master...master?expand=1

Note that most of the work is going to go into getting Thrust working, which is in Coriander itself, not in pytorch-cl. This is kind of close-ish, but will needs some address-space love.

A basic thrust program will compile now, but wont quite run, because of address-space issues. On branch https://github.com/hughperkins/coriander/tree/multiple-walks , build coriander etc, and then do:

cd third_party/thrust/examples
cocl -g -D__CUDACC__ -D__thrust_hd_warning_disable__ -DCUDA_VERSION=3000 -I . -I .. fill_copy_sequence.cu
./fill_copy_sequence

You can actually make this run, if you use the dump/load kernel functionality to dump out the kernel, which will look like https://gist.github.com/hughperkins/c55bb2161a30291fd438b4c4feaf1ed2 , and modify https://gist.github.com/hughperkins/c55bb2161a30291fd438b4c4feaf1ed2#file-gistfile1-txt-L254 ... I dont remember how, but one of the address-spaces is slightly wrong, and if you add/remove a global or local, in one of the casts, I think the one on the left, it's actually possible to get it to run ok.

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Aug 6, 2017

Contributor

(Note: If anyone is intersted in having a ~one-hour weekly meeting, in google hangouts or similar, to discuss coriander-based opencl pytorch, or any opencl/spir-xxx pytorch, then please add a comment below, and I can probably set something up. Probably would be on a Saturday (somewhat likely) or a Sunday (more likely), but fairly flexible on this point.

Edit: just to be clear, you would need to be someone who is actually going to carry out actual development on said opencl/spir pytorch. I dont have time to work on actual development of this myself personally right now)

Contributor

hughperkins commented Aug 6, 2017

(Note: If anyone is intersted in having a ~one-hour weekly meeting, in google hangouts or similar, to discuss coriander-based opencl pytorch, or any opencl/spir-xxx pytorch, then please add a comment below, and I can probably set something up. Probably would be on a Saturday (somewhat likely) or a Sunday (more likely), but fairly flexible on this point.

Edit: just to be clear, you would need to be someone who is actually going to carry out actual development on said opencl/spir pytorch. I dont have time to work on actual development of this myself personally right now)

@kawing-chiu

This comment has been minimized.

Show comment
Hide comment
@kawing-chiu

kawing-chiu Dec 27, 2017

Arrived at this issue after reading the news that nvidia kindly forbids the usage of geforce gpus in datacenter :)

kawing-chiu commented Dec 27, 2017

Arrived at this issue after reading the news that nvidia kindly forbids the usage of geforce gpus in datacenter :)

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Dec 28, 2017

Contributor

Arrived at this issue after reading the news that nvidia kindly forbids the usage of geforce gpus in datacenter :)

Haha :)

So, there are two approaches that I know of. Well, a few more if you include eg AMD's HIP. Actually, that should probably be your first port of call I reckon. Have you tried it? How far does it get you?

Contributor

hughperkins commented Dec 28, 2017

Arrived at this issue after reading the news that nvidia kindly forbids the usage of geforce gpus in datacenter :)

Haha :)

So, there are two approaches that I know of. Well, a few more if you include eg AMD's HIP. Actually, that should probably be your first port of call I reckon. Have you tried it? How far does it get you?

@mirh

This comment has been minimized.

Show comment
Hide comment
@mirh

mirh Dec 28, 2017

HIP just works over ROCm - which is just newest amd gpus (like 1-2 years at most) and some other requirements.
So I wouldn't be so sure about that.

Also, for the love of me I cannot see who ever brought up mkl-dnn. It simply isn't supposed to target gpus (intel uses clDNN for that, if any).

mirh commented Dec 28, 2017

HIP just works over ROCm - which is just newest amd gpus (like 1-2 years at most) and some other requirements.
So I wouldn't be so sure about that.

Also, for the love of me I cannot see who ever brought up mkl-dnn. It simply isn't supposed to target gpus (intel uses clDNN for that, if any).

@choongng

This comment has been minimized.

Show comment
Hide comment
@choongng

choongng Dec 28, 2017

Internally (at Vertex.AI) we have discussed what it would take to do PyTorch on PlaidML. Short version is it should work well, if someone wants to do the work we can coordinate with PyTorch team what the right technical approach is. Payoff is GPU-accelerated PyTorch on Linux/Mac/Win for all cards with at least OpenCL 1.2 (and other APIs we might add in the future).

choongng commented Dec 28, 2017

Internally (at Vertex.AI) we have discussed what it would take to do PyTorch on PlaidML. Short version is it should work well, if someone wants to do the work we can coordinate with PyTorch team what the right technical approach is. Payoff is GPU-accelerated PyTorch on Linux/Mac/Win for all cards with at least OpenCL 1.2 (and other APIs we might add in the future).

@fmassa

This comment has been minimized.

Show comment
Hide comment
@fmassa

fmassa Dec 28, 2017

Member

I believe there was some work that was being done to make Pytorch HIPS compatible. @soumith is that still being worked on?

Member

fmassa commented Dec 28, 2017

I believe there was some work that was being done to make Pytorch HIPS compatible. @soumith is that still being worked on?

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Dec 28, 2017

Member

@choongng can you reach out to me at [redacted], we should detail an approach and a plan, and I want to make sure it all looks clean. Ideally we'd want this to be activated as a plugin to PyTorch or something, but there are real maintenance costs then, so we have to figure out Continuous Integration.
On our side, we are likely going to finish HIP support.

Member

soumith commented Dec 28, 2017

@choongng can you reach out to me at [redacted], we should detail an approach and a plan, and I want to make sure it all looks clean. Ideally we'd want this to be activated as a plugin to PyTorch or something, but there are real maintenance costs then, so we have to figure out Continuous Integration.
On our side, we are likely going to finish HIP support.

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Dec 29, 2017

Contributor

@mirh wrote:

HIP just works over ROCm - which is just newest amd gpus (like 1-2 years at most) and some other requirements.

To be fair, if you're going to set up a new data centre full of AMD GPUs, you'd probably use the very latest ones?

Ditto if you're replacing NVIDIA GPUs with AMD ones.

So, hip seems viable in both cases?

I guess there could be people who have a gaming box with a 1-2 year old AMD card in, but they're not really the target market of NVIDIA's data center segmentation: they can perfectly legally and happily use a 1080Ti or Titan V etc, per my understanding (though I could be wrong?). This is in my opinion exactly how NVIDIA's market segmentation is designed: Titan V, and 1080Tis are in my opinion something akin to 'Community Edition', and Tesla cards are in my opinion something more like 'Enterprise Edition' (disclaimer: IANAL, I could be wrong, I may have misread, misunderstood. You need to do your own due diligence to ensure you are compliant with licensing agreements, eg consult a lawyer etc).

Contributor

hughperkins commented Dec 29, 2017

@mirh wrote:

HIP just works over ROCm - which is just newest amd gpus (like 1-2 years at most) and some other requirements.

To be fair, if you're going to set up a new data centre full of AMD GPUs, you'd probably use the very latest ones?

Ditto if you're replacing NVIDIA GPUs with AMD ones.

So, hip seems viable in both cases?

I guess there could be people who have a gaming box with a 1-2 year old AMD card in, but they're not really the target market of NVIDIA's data center segmentation: they can perfectly legally and happily use a 1080Ti or Titan V etc, per my understanding (though I could be wrong?). This is in my opinion exactly how NVIDIA's market segmentation is designed: Titan V, and 1080Tis are in my opinion something akin to 'Community Edition', and Tesla cards are in my opinion something more like 'Enterprise Edition' (disclaimer: IANAL, I could be wrong, I may have misread, misunderstood. You need to do your own due diligence to ensure you are compliant with licensing agreements, eg consult a lawyer etc).

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Dec 29, 2017

Contributor

(as far as Mac support, I mean, you can run pytorch on Mac already, just not on the gpu. but if you're going to run training, I dont really see running on a Mac gpu as terribly viable really... Spin up a few v100s on aws, and away you go...)

(edit: I actually have one of those fancy-smancy radeon mbps, the touchbar ones. How often do I enable the Radeon? When I play League of Legends. Thats the only time :P)

Contributor

hughperkins commented Dec 29, 2017

(as far as Mac support, I mean, you can run pytorch on Mac already, just not on the gpu. but if you're going to run training, I dont really see running on a Mac gpu as terribly viable really... Spin up a few v100s on aws, and away you go...)

(edit: I actually have one of those fancy-smancy radeon mbps, the touchbar ones. How often do I enable the Radeon? When I play League of Legends. Thats the only time :P)

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Dec 29, 2017

Contributor

(in my opinion, for AMD to make their gpus more viable for ml, they might consider throwing up a bunch of cloud gpus, even at a loss: I dont really see anyone buying AMD gpus for ml until the libraries exist. And no-one is going to write libraries for cards they dont have.)

(edit: actually, Facebook or Google could do the same thing actually, in the spirit of diversity. I'm not sure if AMD really 'deserve' such help but ... maybe. A large part of me thinks what's needed is some aggressive young startup, Nirvana or similar. After all, I dont think you need to own your own fab, just have enough to design a tape-out and run it, I think? (which is a fair amount of money, but likely a lot less than a fab ... :P ))

Contributor

hughperkins commented Dec 29, 2017

(in my opinion, for AMD to make their gpus more viable for ml, they might consider throwing up a bunch of cloud gpus, even at a loss: I dont really see anyone buying AMD gpus for ml until the libraries exist. And no-one is going to write libraries for cards they dont have.)

(edit: actually, Facebook or Google could do the same thing actually, in the spirit of diversity. I'm not sure if AMD really 'deserve' such help but ... maybe. A large part of me thinks what's needed is some aggressive young startup, Nirvana or similar. After all, I dont think you need to own your own fab, just have enough to design a tape-out and run it, I think? (which is a fair amount of money, but likely a lot less than a fab ... :P ))

@mirh

This comment has been minimized.

Show comment
Hide comment
@mirh

mirh Dec 29, 2017

To be fair, if you're going to set up a new data centre full of AMD GPUs, you'd probably use the very latest ones?

If you are bringing up a datacenter and all, surely. (Trivia: afaiu if you port something to HIP, it should still nonetheless retain compatibility with nvidia cards)

But I thought this issue was about something a tad more generic?
Maybe, say, I want to experiment with my intel igp, or with some arm board perhaps.

mirh commented Dec 29, 2017

To be fair, if you're going to set up a new data centre full of AMD GPUs, you'd probably use the very latest ones?

If you are bringing up a datacenter and all, surely. (Trivia: afaiu if you port something to HIP, it should still nonetheless retain compatibility with nvidia cards)

But I thought this issue was about something a tad more generic?
Maybe, say, I want to experiment with my intel igp, or with some arm board perhaps.

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Dec 29, 2017

Contributor
Contributor

hughperkins commented Dec 29, 2017

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Dec 29, 2017

Contributor

Oh, Arm. I see, you (@mirh) want to run pytorch models on arm, eg/ie on cell-phones?

That makes sense. As to how to achieve that ... I reckon that:

    1. port pytorch to work on arm gpus, and
    1. get pytorch to work on cellphones (maybe it does? but seems like something that at least ios would (strongly) discourage?)

... sounds hard. Training a model on a cell-phone sounds quite battery-draining. I wonder whether it might be worth focusing on runnign prediction on the phone, using a pre-trained model, and handle training in the cloud? Running networks on phones seems fairly standard. For example, Huawei is pretty into doing so, https://www.androidauthority.com/huawei-announces-kirin-970-797788/ .

There are probably ways and means to export trained models into various formats that can be run on a phone, maybe using something that the manufacturer provides. For example, I think that this is how one would use a Movidius usb key https://developer.movidius.com/ . And then there's also nnef https://www.khronos.org/nnef (edit: and onnx :) )

Contributor

hughperkins commented Dec 29, 2017

Oh, Arm. I see, you (@mirh) want to run pytorch models on arm, eg/ie on cell-phones?

That makes sense. As to how to achieve that ... I reckon that:

    1. port pytorch to work on arm gpus, and
    1. get pytorch to work on cellphones (maybe it does? but seems like something that at least ios would (strongly) discourage?)

... sounds hard. Training a model on a cell-phone sounds quite battery-draining. I wonder whether it might be worth focusing on runnign prediction on the phone, using a pre-trained model, and handle training in the cloud? Running networks on phones seems fairly standard. For example, Huawei is pretty into doing so, https://www.androidauthority.com/huawei-announces-kirin-970-797788/ .

There are probably ways and means to export trained models into various formats that can be run on a phone, maybe using something that the manufacturer provides. For example, I think that this is how one would use a Movidius usb key https://developer.movidius.com/ . And then there's also nnef https://www.khronos.org/nnef (edit: and onnx :) )

@mirh

This comment has been minimized.

Show comment
Hide comment
@mirh

mirh Dec 29, 2017

Those were just ideas (and I was talking about arm servers if any, for as much as the newly introduced android nnapi certainly looks interesting)
Anyway, sorry for the divagation. Better if I don't steal any more of your time.
Best wishes to you and facebook/vertexai guys.

mirh commented Dec 29, 2017

Those were just ideas (and I was talking about arm servers if any, for as much as the newly introduced android nnapi certainly looks interesting)
Anyway, sorry for the divagation. Better if I don't steal any more of your time.
Best wishes to you and facebook/vertexai guys.

@dborgesr

This comment has been minimized.

Show comment
Hide comment
@dborgesr

dborgesr Feb 12, 2018

So i'm pretty new to more systems level coding but was wondering if I could be of help, I have a 1-2 year old workstation that I built to try and take advantage of the overall larger throughput across the GPU and CPU, so Ryzen CPU and new GPU (Vega 56), in the interest of trying to have everything train and run in a VM I would be super willing to help it work (and test) with linux.

dborgesr commented Feb 12, 2018

So i'm pretty new to more systems level coding but was wondering if I could be of help, I have a 1-2 year old workstation that I built to try and take advantage of the overall larger throughput across the GPU and CPU, so Ryzen CPU and new GPU (Vega 56), in the interest of trying to have everything train and run in a VM I would be super willing to help it work (and test) with linux.

@hughperkins

This comment has been minimized.

Show comment
Hide comment
@hughperkins

hughperkins Feb 12, 2018

Contributor

in the interest of trying to have everything train and run in a VM I would be super willing to help it work (and test) with linux.

Note that I'm not intending to work on this for the foreseeable future. In a sense, you could consider that bad news, since it means, well, it means I'm not going to work on it in the foreseeable future :) In another sense, it's an opportunity. The bar is pretty low to having impact in the OpenCL and similar world. Just start small, and see where you get to. I'm going to say that starting with Coriander is probably not the way forward; at least, it might be, but it's got a learning curve more cliff-shaped than curvey. I think a workable way forward is just to pick something simple in pytorch, and opencl-ize it (or sycl-ize it, or spir-vize it, or whatever seems fun/cool to you :) ). Just do things that seem fun/cool/useful to you, and keep going till it stops seeming fun/cool, or you are suddenly known as the guy who got pytorch working on opencl etc :)

Contributor

hughperkins commented Feb 12, 2018

in the interest of trying to have everything train and run in a VM I would be super willing to help it work (and test) with linux.

Note that I'm not intending to work on this for the foreseeable future. In a sense, you could consider that bad news, since it means, well, it means I'm not going to work on it in the foreseeable future :) In another sense, it's an opportunity. The bar is pretty low to having impact in the OpenCL and similar world. Just start small, and see where you get to. I'm going to say that starting with Coriander is probably not the way forward; at least, it might be, but it's got a learning curve more cliff-shaped than curvey. I think a workable way forward is just to pick something simple in pytorch, and opencl-ize it (or sycl-ize it, or spir-vize it, or whatever seems fun/cool to you :) ). Just do things that seem fun/cool/useful to you, and keep going till it stops seeming fun/cool, or you are suddenly known as the guy who got pytorch working on opencl etc :)

@bionick87

This comment has been minimized.

Show comment
Hide comment
@bionick87

bionick87 Mar 10, 2018

Hi guys,

Anyone could be help me to convert pytorch nn.Conv2d in OpenCL for FPGA/No-CUDA applications?

We need to build a group of work, anyone interested in participating?

PS: for more complex idea I want to use the output of pytorch model and pass it in a compiler for automatic translation of OpenCL language.

Best,

Nico

bionick87 commented Mar 10, 2018

Hi guys,

Anyone could be help me to convert pytorch nn.Conv2d in OpenCL for FPGA/No-CUDA applications?

We need to build a group of work, anyone interested in participating?

PS: for more complex idea I want to use the output of pytorch model and pass it in a compiler for automatic translation of OpenCL language.

Best,

Nico

@rajatpundir

This comment has been minimized.

Show comment
Hide comment
@rajatpundir

rajatpundir Jun 16, 2018

Everyone is like we can't port opencl, it is hard, we are incompetent, we are not going to do it, someone else should do it, and story goes on, blah. We need brave people, I would be willing to take a leave from my current work if my job weren't keeping me alive, learn opencl and port the damn thing, the whole fault really lies with AMD and the likes.

rajatpundir commented Jun 16, 2018

Everyone is like we can't port opencl, it is hard, we are incompetent, we are not going to do it, someone else should do it, and story goes on, blah. We need brave people, I would be willing to take a leave from my current work if my job weren't keeping me alive, learn opencl and port the damn thing, the whole fault really lies with AMD and the likes.

@fmassa

This comment has been minimized.

Show comment
Hide comment
@fmassa

fmassa Jun 16, 2018

Member

@rajatpundir de started adding AMD support in #6625
There were several followup PRs extending or fixing the support. AMD support is coming:-)

Member

fmassa commented Jun 16, 2018

@rajatpundir de started adding AMD support in #6625
There were several followup PRs extending or fixing the support. AMD support is coming:-)

@mirh

This comment has been minimized.

Show comment
Hide comment
@mirh

mirh Jun 16, 2018

Mhh.. I thought newest OpenCL had nothing to envy to CUDA?
Then sure thing amd has some lag in there.. (though one could argue that's due to ROCm, rather than in spite of)

mirh commented Jun 16, 2018

Mhh.. I thought newest OpenCL had nothing to envy to CUDA?
Then sure thing amd has some lag in there.. (though one could argue that's due to ROCm, rather than in spite of)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment