Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to run native tests of neanderthal sucessfull #127

Closed
behrica opened this issue Jun 22, 2022 · 60 comments
Closed

Not able to run native tests of neanderthal sucessfull #127

behrica opened this issue Jun 22, 2022 · 60 comments

Comments

@behrica
Copy link

behrica commented Jun 22, 2022

While we discussed
uncomplicate/deep-diamond#15
the issue that Neanderthal does not find any more the libmkl_rt.so (even when globaly installed) came up as an other issue.

I prepared a Dockerfile which exposed the issue, maybe useful.

# failing with
# Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
#/tmp/libneanderthal-mkl-0.33.07653633467081296505.so: libmkl_rt.so: cannot open shared object file: No such file or directory

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN sh ./l_onemkl_p_2022.1.0.223.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3
RUN lein test uncomplicate.neanderthal.mkl-test
@behrica
Copy link
Author

behrica commented Jun 22, 2022

Even adding the bytecode dependency does it not make working:

lein update-in :dependencies conj "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]" -- test uncomplicate.neanderthal.mkl-test

@behrica
Copy link
Author

behrica commented Jun 22, 2022

So it seems to me that neither with lastet global MKL nor latest [org.bytedeco/mkl-platform-redist "] neanderthal native is working.

@behrica
Copy link
Author

behrica commented Jun 22, 2022

Using older bytedeco (as documented here: https://neanderthal.uncomplicate.org/articles/getting_started.html=
(but keeping MKL globally installed) fails with an othe errror)

# failing with
# actual result:
# clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
#  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN sh ./l_onemkl_p_2022.1.0.223.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD  ["lein", "update-in" ,":dependencies", "conj" ,"[org.bytedeco/mkl-platform-redist \"2020.3-1.5.4\"]" ,"--", "test" ,"uncomplicate.neanderthal.mkl-test" ]

-->

actual result did not agree with the checking function.
Actual result:
clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)
  uncomplicate.neanderthal.internal.host.buffer_block.RealUploMatrix.host(buffer_block.clj:1243)

@behrica
Copy link
Author

behrica commented Jun 22, 2022

Same without MKL installed:

# failing with
# actual result:
# clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
#  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD  ["lein", "update-in" ,":dependencies", "conj" ,"[org.bytedeco/mkl-platform-redist \"2020.3-1.5.4\"]" ,"--", "test" ,"uncomplicate.neanderthal.mkl-test" ]

@behrica behrica changed the title neanderthal is not able to find libmkl_rt.so with latest MKL neanderthal is not able to find libmkl_rt.so Jun 22, 2022
@blueberry
Copy link
Member

What I don't really get is the whole .1 and .2 suffix to libmkl_rt.so in newer versions. I understand that it's a versioning thing, but the official documentation in these newer version (https://www.intel.com/content/dam/develop/external/us/en/documents/onemkl-developerguide-linux.pdf) explicitly states `libmkl_rt" as the build dependency, exactly what I was always using to build neanderthal-mkl...

I guess that I'll have to see how to re-build neanderthal to the latest MKL, and distribute that version as the "official" one. This should probably require users to upgrade their MKL to the recent one, too.

@behrica
Copy link
Author

behrica commented Jun 22, 2022

I have seen that they symlink to each other ".1 and .2 suffix to libmkl_rt.so in" in the lastet version.
So I passed the point and now it finds it.
But I amgeeting other error know, each configuration I take, an other error.

@behrica
Copy link
Author

behrica commented Jun 22, 2022

Doing this, finds the libmkl_rt.so , but it fails on something else, see in comment.

# failing with
#Actual result did not agree with the checking function.
#Actual result:
#clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
#  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)


FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
#RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18483/l_onemkl_p_2022.0.2.136.sh
RUN sh ./l_onemkl_p_2022.0.2.136.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

ENV LD_LIBRARY_PATH="/opt/intel/oneapi/mkl/2022.0.2/lib/intel64"
CMD [ "lein", "update-in", ":dependencies" ,"conj", "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]", "--", "test", "uncomplicate.neanderthal.mkl-test" ]

@behrica
Copy link
Author

behrica commented Jun 22, 2022

If I do not use "bytedeco", I get an other error:
Here I found an a bit older version of MKL which matches the "bytedeco" version number

# failing with
#java: symbol lookup error: /opt/intel/oneapi/mkl/2022.0.2/lib/intel64/libmkl_intel_thread.so.2: undefined symbol: omp_get_num_procs
#Tests failed.


FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
#RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18483/l_onemkl_p_2022.0.2.136.sh
RUN sh ./l_onemkl_p_2022.0.2.136.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

ENV LD_LIBRARY_PATH="/opt/intel/oneapi/mkl/2022.0.2/lib/intel64"
CMD [ "lein", "test", "uncomplicate.neanderthal.mkl-test" ]

@behrica
Copy link
Author

behrica commented Jun 22, 2022

This version of MKL l_onemkl_p_2022.1.0.223.sh
has in directory:

/opt/intel/oneapi/mkl/2022.1.0/lib/intel64

lrwxrwxrwx 1 root root        14 Mar 29 15:07 libmkl_rt.so -> libmkl_rt.so.2
-rwxr-xr-x 1 root root  11300224 Mar 11 08:07 libmkl_rt.so.2
-rw-r--r-- 1 root root  12244638 Mar 11 08:07 libmkl_scalapack_ilp64.a

@behrica
Copy link
Author

behrica commented Jun 22, 2022

and indeed , by setting LD_LIBRY_PATH to "/opt/intel/oneapi/mkl/2022.1.0/lib/intel64", it seems to find it.
But gives an other error:
java: symbol lookup error: /opt/intel/oneapi/mkl/2022.1.0/lib/intel64/libmkl_intel_thread.so.2: undefined symbol: omp_get_num_procs

@behrica
Copy link
Author

behrica commented Jun 22, 2022

I give up at this point in time.
I cannot find a setup of MKL, "org.bytedeco/mkl-platform-redist" and config following the instructions which makes the neanderthal tests pass in "native".
So I am wondering what setup the people here are using.

Probably some "old" setup, which "today" cannot be re-created anymore. (as library versions are gone)

@behrica behrica changed the title neanderthal is not able to find libmkl_rt.so Not able to run native tests of neanderthal sucessfull Jun 22, 2022
@blueberry
Copy link
Member

blueberry commented Jun 22, 2022

It seems to me that your MKL distribution misses libiomp5.so, or you have a wrong iomp5 installation when you set up that manually? Please see the mention of that lib in https://neanderthal.uncomplicate.org/articles/getting_started.html.

This should and usually is automatically in the right place, but it might be broken in some installations, as people create countless variations.

@behrica
Copy link
Author

behrica commented Jun 22, 2022

I am just looking at that. It is true that my Dockerimage has less things then a "normal OS".

@behrica
Copy link
Author

behrica commented Jun 22, 2022

I switched know to install mkl as debian package, does not make it better:
->

Actual result did not agree with the checking function.
Actual result:
clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)
  uncomplicate.neanderthal.internal.host.buffer_block.RealUploMatrix.host(buffer
FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3 intel-mkl
RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

#ENV LD_LIBRARY_PATH="/opt/intel/oneapi/mkl/2022.0.2/lib/intel64"
CMD [ "lein", "update-in", ":dependencies" ,"conj", "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]", "--", "test", "uncomplicate.neanderthal.mkl-test" ]

@behrica
Copy link
Author

behrica commented Jun 22, 2022

Ok, I finally found a working solution, in teh form of a Dockerfile:

FROM clojure:lein-2.9.8-focal
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install git wget python3 intel-mkl

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD [ "lein", "test", "uncomplicate.neanderthal.mkl-test" ]

Very simple, even.

@behrica
Copy link
Author

behrica commented Jun 22, 2022

But it is not true, in my view, as the instructions here suggest,
that no mkl installation is needed, when adding the "bytedeco"


Add a MKL distribution jar [org.bytedeco/mkl-platform-redist "2020.3-1.5.4"] as your project’s dependency.

Neanderhtal will use the native CPU MKL binaries from that jar automatically, so you don’t need to do anything else


This does fail:

FROM clojure:lein-2.9.8-focal
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install git

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD [ "lein", "update-in", ":dependencies" ,"conj", "[org.bytedeco/mkl-platform-redist \"2020.3-1.5.4\"]", "--", "test", "uncomplicate.neanderthal.mkl-test" ]

with

Actual result:
clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)
  uncomplicate.neanderthal.internal.host.buffer_block.RealUploMatrix.host(buffer_block.clj:124

So I would say it requires "a lot of luck", if adding "[org.bytedeco/mkl-platform-redist "2020.3-1.5.4"]" and "doing nothing else" is indeed working.

@behrica
Copy link
Author

behrica commented Jun 22, 2022

It seem to me that the installation of "intel-mkl" via "apt" does more then only putting the required ".so" files somewhere. (which the bytedeco jar can only do)
It saw a lot of things happening during installation of "intel-mkl" via apt about replace lpack related libraries with some other stuff. The interactive installation is asking 3 or 4 questions about this.

@blueberry
Copy link
Member

I always recommend installing intel mkl globally as this is what I use. Everything else is something that people ask me to support and I am trying to satisfy these demands as much as I can. Any help in that regard is always welcome, but there the ground moves from time to time.

@blueberry
Copy link
Member

blueberry commented Jun 22, 2022

It seem to me that the installation of "intel-mkl" via "apt" does more then only putting the required ".so" files somewhere. (which the bytedeco jar can only do) It saw a lot of things happening during installation of "intel-mkl" via apt about replace lpack related libraries with some other stuff. The interactive installation is asking 3 or 4 questions about this.

The stuff that you see is needed only for building native dependencies, which is what I need. For using neanderthal, only the visibility of the appropriate .so files should be enough (I've tested this multiple times on multiple OSes, but who knows ;)

@behrica
Copy link
Author

behrica commented Jun 22, 2022

One way to address this is to try to maintain a single Docker image for the Clojure Data Science community.
I do this in some form here:
https://github.com/behrica/clj-py-r-template/blob/master/docker-base/Dockerfile

It is setup to allow the R and python bindings to Clojure to work out of the box.

I know that the Clojure community is not a very big fan of Docker based development, but maybe it is worth to extend the above docker image to explicitly support neanderthal and therefore deep diamond out of the box.

What to do you think ?

I could give it a go and try to setup all needed stuff for deep-diamond in there as well.

@blueberry
Copy link
Member

Of course it would be good to have it as an option. I don't use docker, but some people certainly prefer it, so I don't see how sharing 3rd party setups could hurt. It would be best if you could set it up as a github repo, and link it here.

@jsa-aerial
Copy link

@behrica Hi Carsten, I have the needed MKL libs for Linux, Mac, and Win that I created for installation for Saite. They are all in compressed archives. These have always worked for me across various machines, and OS versions (only Intel Mac - no new Arm stuff) and Win10 for Windows. For Linux and (Intel) Mac, aerosaite, the self installing uberjar variant, comes with scripts for running it that setup the paths for the MKL. This too, has always worked for various users. Aerosaite automatically downloads and installs the MKL libs to a local directory relative to the .saite home directory. BUT, you could manually grab these if you wish and install them in some similar location that makes sense for you.

Linux
Mac
Win

I am unsure about how to automatically set the path for Win (someone recently gave me an idea of what it should be so maybe the next release the Win scripts will have that as well).

I'm not sure if your setup is 'special' in some way that would keep this from working, but it may be worth a try. As I say, this has always worked. The scripts are in the resources folder at the aerosaite github (link above).

@blueberry
Copy link
Member

blueberry commented Jun 23, 2022

Hi @jsa-aerial that is really helpful. Maybe we can make this or some more focused standalone version of this an official recommendation for people that for some reason or another can't make the official vendor binaries work on their system?

@blueberry
Copy link
Member

blueberry commented Jun 23, 2022

Just a quick note: Neanderthal's MKL dependency does not need any installation other than the lib files being in any location where the appropriate OS looks for shared libraries. Even copy/paste works.

@behrica
Copy link
Author

behrica commented Jun 23, 2022

@jsa-aerial I do agree that we should have more "instructions" / variants to get MKL installed (and deep-diamond working)
I "nearly" managed to extend the polyglot Dockerfile to have all working:
With "all" I mean deep-diamond running in a Dockercontainer supporting "native = MKL", "CUDA GPL" and "OpenCL GPU".

"Working" I measure by have the all deep-diamond tests passing.

Even for non Docker users, reading the Dockerfile can be useful:
See last sections here:
https://github.com/behrica/clj-py-r-template/blob/master/docker-base/Dockerfile

It is nearly working...
I dived very deep into the issues, and tried a lot of different things.

It would be very helpful, if somebody with more knowledge on CUDA / OpenCL / Linux would have a look.

The dockerfile can be build as usual with docker build . and run on (hopefully) any machine with a GPU,
as docker run --gpus all -w /tmp/deep-diamond <imgae-id> lein test

Currently I get this error, and I am not sure what to try next.

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so: /usr/lib/x86_64-linux-gnu/libOpenCL.so: version `OPENCL_2.2' not found (required by /root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so)

Could somebody help out with this ?

@behrica
Copy link
Author

behrica commented Jun 23, 2022

As you see in the Dockerfile I settled on Cuda 11.4, with 11.6 I had even more weiredd issues and did not come "this far".

That won't work as ClojureCUDA is tied to specific CUDA version that should be installed on your machine in addition to nvidia drivers. This is currently 11.6.1

Additionally, Deep Diamond requires Nvidia's cuDNN too. On Arch Linux, both are available as packages (cuda and cudnn) through pacman. On other systems, they are fairly widely available, and nvidia offer click-through installers too on their main website.

Pleasesee details at clojurecuda web page.

@behrica
Copy link
Author

behrica commented Jun 23, 2022

Just a quick note: Neanderthal's MKL dependency does not need any installation other than the lib files being in any location where the appropriate OS looks for shared libraries. Even copy/paste works.

The question is "which precise shared libraries" it needs. It seems to me, that given on how MKL is installed, "different libraries" do get installed in the appropriate places. And I had issues with wrong GLIBC versions and so on.

@blueberry
Copy link
Member

blueberry commented Jun 23, 2022

This means your setup should be ok. Your only implementation is nvidia, which supports Opencl 1.2. OpenCL 3 is basically 1.2 repackaged, And Opencl 2 has most features, but is left as a vestige as Nvidia and Apple sabotaged it. Complicated, I know...

@behrica
Copy link
Author

behrica commented Jun 23, 2022

Thanks, that helps.
So what can I do about:

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so: /usr/lib/x86_64-linux-gnu/libOpenCL.so: version `OPENCL_2.2' not found (required by /root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so)

It happens during "lein test", I suppose during the tests which are using OpenCL.
So "somthing" is still wrong in my setup, I guess.

So I assume correctly that "opencl-3.0-1.5.7" requires "openCL 3.0" (or at least more then 2.1) ?
The dependency to it comes from deep-diamond

@blueberry
Copy link
Member

It means that javacpp has some problems finding opencl on your system. However, note that Neanderthal/ClojureCL does not use javacpp for that, but another unrelated library. Javacpp dependency on OpenCL is probably coincidental as I don't directly use it, and javacpp dnnl library tries to load it on its own (there is an old solved issue at javacpp github that might give more info). IF neanderthal opencl tests pass it means everything should be ok with your system's opencl. Why javacpp has problems? My hunch is because your docker setup misses something, but I can't be sure since I don't use docker.

@behrica
Copy link
Author

behrica commented Jun 23, 2022

My Dockerimage is an Ubuntu 20.04 image. So it is Ubuntu in most regards.

I would like to promote usage of 'neanderthal', but the installation of it (or better said it's dependencies MKL and CUDA / OpenCL) are a gigantic hurdle.
I consider myself a very experienced Java / Clojure / Linux / Docker user but I have no idea what I am doing here to get it work.

I am thinking somehow as well, that 'Docker' is the only way out, but that is not shared by lots of people, unfortunately.

I think that "maintaining and publishing" a Dockerfile and Image with a "working deep-diamond" where the user only need to type "docker run --gpus all xxxx" is important in this.

I thought I can do this on my own, but I think this is not the case. I know too little on CUDA, OpenCL and "extending Java with native code" in order to bring this forward myself.

The installation instructions are too general to allow me to further work on the Dockerfile efficiently.

@behrica
Copy link
Author

behrica commented Jun 23, 2022

I propose to go a step back, and I work on a "Minimal" Dockerfile (ubuntu based) which has only the goal to setup MKL, Cuda, OpenCL to get the "neanderthal" test suite working inside of it.

Maybe I could contribute that Dockerfile to the "neanderthal" GitHub. I am not sure, if I can get it "working" by myself, but maybe we could collaborate on it in some form.

At least by "reviewing" it and trying to see, if I do something which cannot work. What do you think ?
Are you interested in supporting this, even when you are not a Docker user yourself.

@blueberry
Copy link
Member

Yes, sure.

Fortunately, Neanderthal is a Java library, so it does not care whether it runs in docker or wherever else. As for the github, that's why I think the best home for the docker setup is a separate github repository.

I understand that it looks overwhelming, but I believe it is mostly because you're trying to fit together 10 moving parts of which you don't have experience with half of them. In reality, it is MUCH simpler:

For Neanderthal MKL to work, you ONLY need MKL .so files somewhere on your LD_LIBRARY_PATH. That's it. If other software using MKL work (pytorch or whatever) Neanderthal should work.

For Neanderthal CUDA backend, you ONLY need properly installed CUDA by Nvidia. If other CUDA-based software works, Neanderthal should (assuming you're not using some 3-rd party package system such as anaconda etc. that set their own local CUDA etc.)

For OpenCL it's similar...

Basically, there should not be any specific requirement by Neanderthal et al. other than having vanilla installations of these technologies as prescribed by their vendors, or simpler.

I would definitely recommend either following the setup recommended in Getting Started until you understand these moving parts, or at least following @jsa-aerial Saite setup, which seem to help in this regard.

@jsa-aerial
Copy link

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

Frankly, if you want just works automatically "out of the box", aerosaite is the quickest and easiest route. Certainly for Linux users this is pretty much guaranteed to work. For CPU.

I think you are being naive about putting something together for automatic GPU use. There you are up against all the issues about getting the GPU usable completely aside from Neanderthal/DeepDiamond. There are just way too many variations, requirements and dependencies.

@blueberry
Copy link
Member

... and, of course, for GPU computing to work, you'd have to have recent vendor drivers installed properly. That, usually, is not automatic anyway.

@jsa-aerial
Copy link

Hi @jsa-aerial that is really helpful. Maybe we can make this or some more focused standalone version of this an official recommendation for people that for some reason or another can't make the official vendor binaries work on their system?

That sounds like a reasonable/good idea. Suggestion on how to proceed?

@blueberry
Copy link
Member

blueberry commented Jun 23, 2022

I'm not familiar with how saite works, so I don't know precisely, but is there a way to provide the basic MKL and/or CUDA distribution without other parts of saite and even without Neanderthal?

Anyway, it might be a good option for people who can't or don't want to follow my official guides to have the scripts you provide as an option, and if it works sufficiently predictable, we can link to your repository as and option from the getting started guide.

The only drawback I see is that it would make users read these guide even less, and it would appear more complicated.
I'm specifically referring to this:

I would like to promote usage of 'neanderthal', but the installation of it (or better said it's dependencies MKL and CUDA / >OpenCL) are a gigantic hurdle.
I consider myself a very experienced Java / Clojure / Linux / Docker user but I have no idea what I am doing here to get it work.

Perhaps if I have written: the user has to copy these 7 .so files at folder X, and must add this folder to LD_LIBRARY_PATH, and must restart shell, it would have been simpler. Instead, I opted to write a more versatile guide with all popular options, and users being impatient get lost in the sea of choices...

@behrica
Copy link
Author

behrica commented Jun 23, 2022

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

Frankly, if you want just works automatically "out of the box", aerosaite is the quickest and easiest route. Certainly for Linux users this is pretty much guaranteed to work. For CPU.

I think you are being naive about putting something together for automatic GPU use. There you are up against all the issues about getting the GPU usable completely aside from Neanderthal/DeepDiamond. There are just way too many variations, requirements and dependencies.

This could be. But is this even true when using Docker ? Have you tried it ? Or does Docker at least help ?
I am not convinced, that it is "not possible" to at least make one single Dockerimage which just works most of the time.
But yes, similar to aerosaite.

Or at least that Dockerfile can be "parametrized" (so not assuming one fixed one for every situation, but a template)

So that the Dockerfile is at least a "base or template" which then a user can modify , which is hopefully easier then "installing from scratch"

@jsa-aerial
Copy link

I'm not familiar with how saite works, so I don't know precisely, but is there a way to provide the basic MKL and/or CUDA distribution without other parts of saite and even without Neanderthal?

For MKL, the links I quoted above satisfy this - they are just (g)zipped archives of the necessary sharable libs for each platform. That's it. So, no Saite and no Neanderthal and no DeepDiamond.

For the reasons I mentioned above, I decided to not support GPU, because it depends on way more than just the base platform just to get the GPU itself working for computation for you. Basically, in that case you are on your own for getting and installing the correct drivers and any other requirements.

Anyway, it might be a good option for people who can't or don't want to follow my official guides to have the scripts you provide as an option, and if it works sufficiently predictable, we can link to your repository as and option from the getting started guide.

That sounds fine - the scripts for Linux and (Intel) Mac have worked fine for several users - out of the box. If you are not using Saite, you would just need to grab the bits for running your stuff. These things are very small as there is in fact, very little that needs to be done.

The only drawback I see is that it would make users read these guide even less, and it would appear more complicated. I'm specifically referring to this:

Yes, that would be a drawback - any black box route will keep people from understanding what is really going on.

Perhaps if I have written: the user has to copy these 7 .so files at folder X, and must add this folder to LD_LIBRARY_PATH, and must restart shell, it would have been simpler. Instead, I opted to write a more versatile guide with all popular options, and users being impatient get lost in the sea of choices...

Maybe you can have a "TL;DR" section where you state this and then refer others to the details?

@behrica
Copy link
Author

behrica commented Jun 23, 2022

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

Frankly, if you want just works automatically "out of the box", aerosaite is the quickest and easiest route. Certainly for Linux users this is pretty much guaranteed to work. For CPU.

I think you are being naive about putting something together for automatic GPU use. There you are up against all the issues about getting the GPU usable completely aside from Neanderthal/DeepDiamond. There are just way too many variations, requirements and dependencies.

I have seen this in Python land. "pip install tensorflow-gpu" was working for me out-of-the-box.

@jsa-aerial
Copy link

jsa-aerial commented Jun 23, 2022

This could be. But is this even true when using Docker ?

Of course it is true using Docker - Docker is not some magic thing that somehow automatically knows what type (vendor, model, version) GPU, how many, and what the drivers are and if they are properly installed.

You'd have to have a Docker image for all the combinations

Myself, I don't much like Docker, but understand those who do...

@behrica
Copy link
Author

behrica commented Jun 23, 2022

This could be. But is this even true when using Docker ?

Of course it is true using Docker - Docker is not some magic thing that somehow automatically knows what type (vendor, model, version) GPU, how many, and what the drivers are and if they are properly installed.

Agree, but I would hope that over time all vendors will produce "one driver", which work for all their GPUs.

The we could have a parameterized Dockerfile, which just gets the "vendor".
And from inside Docker I can even "read" what GPU I have, and could make decisions accordingly what to install.

So I still think that only a "view people maintain a Dockerfile" needed to know all nifty details, while the majority of user could just "use" the Docker file or image.

Similar to the JVM abstraction.

@blueberry
Copy link
Member

"pip install tensorflow-gpu"

As far as I know, pip install tensorflow-gpu does not install cuda, it expects cuda to be available on your system. Exactly as Neanderthal. But the difference is that Neanderthal will throw an exception if you call absent cuda backend, while tensorflow might automatically fall back to the default engine, whatever it is?

OTOH, conda does (AFAIK) install CUDA, but an internal one. I could do that, if you comit to my (hypothetical) proprietary environment such is conda. You still have to make sure that the right GPU drivers are present.

@behrica
Copy link
Author

behrica commented Jun 23, 2022

@blueberry One more confusing point in the instructions is the required (or workable) CUDA version.

From my experience it does "for example", not work to use CUDA 11.4 and "[org.jcuda/jcuda "11.6.1"]" (which we get by default).

I just had this case and got an uggly.

/tmp/libJCudaDriver-11.6.1-linux-x86_64.so: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /tmp/libJCudaDr
iver-11.6.1-linux-x86_64.so)    

Explicite downgrading to "[org.jcuda/jcuda "11.4.1"]" solved it.

So it seems that "versions of native libraries" and "Clojure/Java dependencies" need to match more precisely then the instructions suggest. (at least from my understanding)

Again the only "more user friendly" form to help users in this I can see, is Docker. Which can be setup in a way that it "freezes" both, native libraries and deps.edn in a known state (at least for documentation purpose)

@blueberry
Copy link
Member

Each version of Neanderthal CUDA backend is tied to the CUDA version specified in its dependency to JCuda. So, for the latest version, it is 11.6. If it says 11.4, that's because I missed to update the docs.

@behrica
Copy link
Author

behrica commented Jun 23, 2022

Thanks , ClojureCUDA docu says currently this, which seems to say "any CUDA 11.x"

Minimum requirements
Java 8
CUDA Toolkit 11.0 (prefer 11.4)
Linux or Windows. macOS doesn’t allow CUDA from version 11 and up. You can only use an old release of ClojureCUDA on macOS.

I hope I help with this comments, if not let me know...

@blueberry
Copy link
Member

I updated the docs of ClojureCUDA to clarify this.

You might use any CUDA version with ClojureCUDA. However, if the CUDA version on your system does not match the one that ClojureCUDA depends on in project.clj, you have to specify explicit dependency to the matching JCuda version in YOUR project.clj

It's similar for Neanderthal, but it might be that a very outdated CUDA does not support all features that I use, and break at will.

Ditto for DD, but I don't expect old CUDA versions so work successfully.

@jsa-aerial
Copy link

jsa-aerial commented Jun 23, 2022

Agree, but I would hope that over time all vendors will produce "one driver", which work for all their GPUs.

I would say there is zero chance of this happening. Not a small chance, but no chance. There are too many legitimate reasons for them to not do this.

@blueberry
Copy link
Member

Thanks , ClojureCUDA docu says currently this, which seems to say "any CUDA 11.x"

Minimum requirements
Java 8
CUDA Toolkit 11.0 (prefer 11.4)
Linux or Windows. macOS doesn’t allow CUDA from version 11 and up. You can only use an old release of ClojureCUDA on macOS.

I hope I help with this comments, if not let me know...

And it DOES (give or take a detail or two) *but you have to state that explicit version, and versions in your project.clj has to match the version installed on your machine. If you specify 11.4 in project.clj, while you install whatever CUDA is shipped with Arch (11.7 currently I believe) it will not work.

Which brings us to one detail: If you do this today, the default JCuda version that Neanderthal/ClojureCUDA uses is 11.6. You have to have that on your OS. CUDA 11.7 is not supported yet (although it might now be what arch installs by default).

@blueberry
Copy link
Member

As far as I can see, your system has CUDA 11.6.1, which is exactly what is expected, so you should not change any default. 11.4.1 generally shouldn't work on your machine (or if it does, it's more luck than anything else).

@behrica
Copy link
Author

behrica commented Jun 23, 2022

yeah, one reason for me to insist in Docker is "multiple computers".
I use GPU on Azure Cloud VMS, and I would like to avoid to "configure each of them individually".
They are kind of "temporary resources".
But I understood know better the "precise matching requirement", thanks for clarification.

@blueberry
Copy link
Member

/tmp/libJCudaDriver-11.6.1-linux-x86_64.so: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /tmp/libJCudaDr
iver-11.6.1-linux-x86_64.so)


Explicite downgrading to "[org.jcuda/jcuda "11.4.1"]" solved it.

This is a known gotcha in JCuda. Your system has an old GLIBC. Not your Arch Linux, that one is up-to-date, but your Docker-provided system, which is, if I remember correctly, an Ubuntu one, v 20-or-so. That one ships with a bit older GLIBC. The trouble with GLIBC is that it's version is so fundamentally hard-coded in your environment that it's very difficult to use another one, you have to use the one provided by your system. And your system provides an old one, which breaks JCuda, which was compiled with a recent one.

Your Arch Linux should work, am I correct?

Native dependencies are tricky ;)

@behrica
Copy link
Author

behrica commented Jun 23, 2022

yes, It is about that. Took me a while to figure it out,

@blueberry
Copy link
Member

Fortunately, you can help solve it. That would require that you build JCuda on your (older) system, and these binaries will then work on newer systems too!

Please check out this issue:
jcuda/jcuda-main#51

@behrica
Copy link
Author

behrica commented Jun 23, 2022

I created PR #128
with a minimal example Docker setup. So for me we can close this issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants