Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Java Tensorflow library does not seem to be using GPU #140

Open
tmichniewski opened this issue Nov 3, 2020 · 61 comments
Open

The Java Tensorflow library does not seem to be using GPU #140

tmichniewski opened this issue Nov 3, 2020 · 61 comments

Comments

@tmichniewski
Copy link

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NO
  • TensorFlow installed from (source or binary): from https://oss.sonatype.org/
  • TensorFlow version (use command below): 2.3.1
  • Python version: 3.7.7
  • Bazel version (if compiling from source): NO
  • GCC/Compiler version (if compiling from source): NO
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: Tesla K80, compute capability 3.7 (but we also tested this on Tesla V100 7.0 compute capability)

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Here is the result of the capture script:
tf_env.txt

Describe the current behavior
We tested the new Tensorflow Java API (not the legacy one). The brand new version released in October 2020. We tested it on some machines including Azure Databricks NC6_v3 and Azure Virtual Machines (the capture log is from the virtual machine). I noticed that in case of no GPU available the library falls back to CPU. And this is fine. However we also measured the time for some example processing (a few vector operations). And we see that there is no significant difference between processing time on GPU and on CPU. It looks as it is not using GPU, even if this is present (we tried two graphic cards: Tesla K80 with compute compatibility 3.7 and Tesla V100 with compute compatibility 7.0). In both cases we do not see any difference in processing time.

Describe the expected behavior
Expected behaviour is to get execution times much better if the program is executed on a machine with GPU present.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

We used the following java program:

HelloTensorFlow_java.txt
pom_xml.txt

The source was compiled to class file and it was run via the following command:
java -classpath protobuf-java-3.8.0.jar:ndarray-0.2.0.jar:javacpp-1.5.4.jar:javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.2.0.jar:tensorflow-core-api-0.2.0-linux-x86_64-gpu.jar:tensorflow-core-platform-gpu-0.2.0.jar:. HelloTensorFlow

The listed libraries were downloaded from https://oss.sonatype.org/.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The enclosed program issues the following log:
log.txt

From the log you may see that the GPU was present and recognized.
However the execution time did not differ, when we started it with GPU and without.

@karllessard
Copy link
Collaborator

karllessard commented Nov 3, 2020

Hi @tmichniewski ,

Can you assign explicitly an operation to a GPU device and see if that works?

Now, there is an ongoing effort to make device selection easier on TF Java, as right now you need unfortunately to use the lower-level API for building an operation.

For example, instead of simply call tf.math.add(x, y), you need to do something like:

DeviceSpec gpuSpec = DeviceSpec.newBuilder().deviceType(DeviceType.GPU).deviceIndex(0).build();
...
graph.opBuilder(Add.OP_NAME, "Add1").setDevice(gpuSpec.toString()).addInput(x.asOutput()).addInput(y.asOutput()).build();

@karllessard
Copy link
Collaborator

You can also log device placement by passing explicitly a ConfigProto message when instantiating your Session, using this constructor

@tmichniewski
Copy link
Author

Hello @karllessard,
Referring to the first recommendation: are you sure that we could call deviceType on DeviceSpec.newBuilder()? As my IntelliJ complains about this.

@tmichniewski
Copy link
Author

Basically I am looking for simplest solution, just for testing the performance.

@tmichniewski
Copy link
Author

BTW - I am using version 0.2.0.

@karllessard
Copy link
Collaborator

@tmichniewski , to be honest I haven't tested the code snippet I've posted you but something around that should work. You can also simply pass directly the device spec as a string for now, as described here. I'll be curious to see what device placement logging is telling you as well, if you get a chance to activate it.

@tmichniewski
Copy link
Author

I already used hardcoded string "/GPU:0".

@tmichniewski
Copy link
Author

Hello @karllessard,
Well, I made a tests on Azure Databricks machine NC6s_v3 with graphic card Tesla V100 compute comtability 7.0.
With device set to "/CPU:0" the execution time is 0.23s.
With device set to "/GPU:0" the execution time is 0.62s.
So not only it does not reduce the execution time but also it increases it.
The current code is as follows:
HelloTensorFlow_java.txt

@tmichniewski
Copy link
Author

BTW - on this cluster in Python we got following times:
CPU 1.00s.
GPU 0.25s.
PythonHelloTensorFlow_py.txt
So in Python GPU is 4 times faster in this exercise. But in Java GPU version is slower than CPU version.

@tmichniewski
Copy link
Author

Maybe I should also set the device to the session or graph?

@tmichniewski
Copy link
Author

But then it would make no sense to set it on operation level.
Do you have any working example of how to perform let say vector addition on GPU?

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 3, 2020

Why not try running the Java version in eager mode like the python version is?

@tmichniewski
Copy link
Author

Firstly, because eager mode is only for development, not production.
Secondly - well - for tests of course I might use it, but do you have some working example with suggestion how to pass the device on which TF is being executed? Originally I used it, but Karl (in his first comment) sugested to use low level API.

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 3, 2020

But the python example is in eager mode, so let's rule out the difference in the TF runtime engine first, before moving on to issues inside the Java API. I'm sure there are speed issues in the Java API, but best start with equivalent metrics first. Either run the Java one in eager mode, or the python one as a tf.function.

@tmichniewski
Copy link
Author

tmichniewski commented Nov 3, 2020

@Craigacp Hello Adam,

I am not comparing Python times to Java times. I try to compare GPU time to CPU time. Python code was only to show that GPU version was roughly 4x faster then CPU. Or in other words - I try to run TF Java API on GPU. So far without success.

Right - in Python it was in eager mode. The same in original Java version I guess. But this issue is about how to make java API to execute on GPU. So far I was pointed to the low level API, but this does not work. That is why I ask for:

  • either some hint of how to correctly pass the device to the API to make it work on GPU (in eager mode or static one),
  • or some link to some working example (the API is quite fresh and there are no that many examples/tutorials in the net).

@tmichniewski
Copy link
Author

Moreover, Karl also pointed me in his third comment to the following description, which clearly states that if there is a choice, then GPU takes precedence:

"If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul has both CPU and GPU kernels. On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device."

@karllessard
Copy link
Collaborator

I cannot think of any reason why running graph ops on a GPU in Java would be slower than in Python, since this is all handled by the C++ core library shared by both languages. So if it is really what we are observing, then it has to be a misconfiguration somewhere.

Another idea, can you check @tmichniewski if the performance goes back to normal if you remove the line inside the loop where you fetch the result from the tensor? Maybe you are fetching data from the GPU memory to the JVM and that is doing extra copies that are not required when using Python?

@tmichniewski
Copy link
Author

I also do not believe that Java version could be slover than Python one. But this is not the issue here. The problem is that it seems that GPU is not used at all.

The previous version of this library 2.3.0, the so called lagacy (since last weekend) was saying that the kernel image was incorect and there was suggestion to build TF from sources. But here the library silently falls back to CPU and you never know whether GPU is used except by measuring time.

Refering to second paragraph, well, I thought that it is always necessary to get the data back from tensor. BTW, we read only one double value, because the last operation in graph is reduce_sum, so we take 8 bytes instead of big vector. Therefore it cannot be reason of slow performance.

PS.
Do you test it on GPU? I guess so, therefore I think you should have some observations and know-how of how to make it work on GPU.

@tmichniewski
Copy link
Author

tmichniewski commented Nov 3, 2020

Maybe there is some issue with graphic card compute capability. In our case it is 7.0 and was also 3.7. Does this library work with cards with such parameters? If not then this would explain why it silently falls back to CPU.

Or maybe the issue is that the choice of GPU device is not defined correctly. So I ask for some example of how to do this. For example how to perform vector addition on GPU. Very simple example. So far I provided two versions, eager and graph, but both are not using GPU.

@karllessard
Copy link
Collaborator

Sorry @tmichniewski but I'm not personally using GPUs nor have access to one right now so I might not be the best person to provide you a concrete working example of this here, maybe @zaleslaw or @saudet can chime in?

Again, I think logging device placement might reveal some explanations as well, see the Session constructor link I've provided earlier.

@saudet
Copy link
Contributor

saudet commented Nov 4, 2020

I haven't been testing it really closely, but I know at least @roywei and @lanking520 have been using the GPU builds without too much trouble.

@tmichniewski
Copy link
Author

Hello @karllessard, @saudet, @roywei and zaleslaw and lankin520,

I also could confirm that it is possible to use GPU builds without trouble. But the key question here is whether the GPU builds really use GPU if it is present on a machine. My tests seems to be telling that NO, they are not using GPU. At least so far I don't know how to make it use GPU.

I provided two sample Java programs:
The first with simple vector computation where I expected that providing gpu versions of libraries should make TF use GPU, but it was not the case.
The second example is after Karl's recommendation, where I additionally specified the device in the graph operations. Also without success.

Now, since this library silently falls back to CPU if GPU is not present or for some reason the library is not willing to use it, then how do you perform the unit tests?

One possibility it by comparing the execution times, but this is really delicate. What are the other possibilities? Because I think all the tests may be successful after the library falls back to CPU, while all of them should fail in case of GPU present on a machine and requested.

In my opinion there should be a possibility in the API to request computation on GPU and if this is not satisfied, then the unit tests and the whole API should raise an exception saying that cannot be executed on requested GPU due to some reason.

Concluding I repeate my original issue - How to make Tensorflow Java API convince to use GPU?

Please point me to any working example or provide any working hint. Could you?
PS. By working I understand not only working, but also using GPU if present.

@lanking520
Copy link

@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You can also try with DJL that wrap TF Java package and provide some ease to automatically do GPU switch during runtime.

You can play around with our Jupiter notebook in Java http://docs.djl.ai/jupyter/tensorflow/pneumonia_detection.html

If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes.

@tmichniewski
Copy link
Author

Hello @lanking520,

Refering to:

"we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device."
seems to be contradicting to your next statement:
"If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes."

So far I added setDevice("/GPU:0") to every graph operation like in the code below:
HelloTensorFlow_java_2.txt
just to be 100% sure where the computation is being executed.

But the results I get on my test computation on 100 mln entries vectors (so big enough) executed 1000 times do not show any time difference when executed on GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.).

On the contrary, when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU, so the difference is huge and visible. Therefore I wonder why in case of using Java API I do not see the same difference? Could you explain it?
python_test_2.txt

BTW - I am not comparing Python to Java or eager to graph execution. I am comparing GPU to CPU execution, where in Python I see huge defference while in Java I see no difference.

PS.
I do not want to run these DJL samples which bring a lot of new elements to the picture and new changing parameters, so in general this may only complicate my tests instead of simplifying them. Moreover, if DJL wraps TF Java package, then it will change nothing if I do not see that TF java package is working well on GPU.

Maybe it is like this that TF Java API is well optimized to Deep Learning computations while not exactly to very simple vector operations.

@tmichniewski
Copy link
Author

In general I think that there might be three possibilities:

  1. First possibility is that I might be doing something wrong in my test program, however I think you had a chance to look at it. Any way, if this is the case, then I simply ask for some support or just a reference to some simple example program, but with vector operations, not with sophisticated neural networks computations.
  2. Second option is that the TF Java API might not be using GPU. In this case I would ask for some hints of how to force it to use GPU.
  3. Finally there also might be some bug in the API which sees GPU and tries to use it but silently falls back to CPU. In this case you will know best how to handle this.

So my question is which situation we are dealing with? Or how to diagnose it?

@tmichniewski
Copy link
Author

Hello @karllessard,
Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it.
PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-)

@zaleslaw
Copy link
Contributor

zaleslaw commented Nov 4, 2020 via email

@tmichniewski
Copy link
Author

Hello Alex @zaleslaw,
1.15.0 - so the API of Tensorflow 1 which had recent commit in Maven Repository in October 2019 (we skipped this from the very beginning as obsolete),
2.3.0 - so called Lagacy API since last weekend (in Python and Java API+Scala, but this version did not work in Java/Scala on our machines with GPU compute capability 3.7 and 7.0 - it was always saying that image in incorrect and we should build TF from sources),
0.2.0 - the newest API produced on October 2020 (in Java API).
I experimented on Azure Databricks clusters, but also on Azure Virtual Machines instantiated on request.

So far I had not a single result with Java API where I could see - OK, computing on GPU is X times faster. Not a single result.

We managed to see such results only in Python, but never in Java API. That is why I started wordering what is wrong. Because basically we should be able to observe that GPU processing is sometimes faster, the same as in Python we see.

At the moment we focus only on this new API, as last week we spent a lot of time on this legacy version 2.3.0. We even started to analyse how to build it from sources.

Well - maybe there is also the same issue like in 2.3.0 - I mean the Java API recognizes that the graphic card compute capability is too low and silently falls back to CPU.

In my opinion I managed to force the API to use the specified GPU. But apparently it is not being used. Maybe this logic to fall back to CPU should be guarded somehow and if the requirement is to process on GPU and GPU cannot be used (is not present or has wrong compute capability), then maybe the error should be thrown, not a warning or nothing. You know, the developer/user requests to process on GPU, then if this is not possible, then "blue screen". Otherwise we end up in situations like this that we discuss what might be going on here.

@lanking520
Copy link

lanking520 commented Nov 4, 2020

In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However, if you compare with CPU (AWS EC2 C5.2xlarge) it's around 40ms and GPU (AWS EC2 p3.2xlarge) is around 5.7ms. GPU is in fact much faster than CPU.

We use TF 2.3.1 build with 0.2.0

But it seemed the TF2 doing something weird that causing everything slow. We also did some benchmark on the cpu performance and see TF1 CPU latency is 2/3 of TF2 on Resnet 50 model with Java api. But same problems on Python too... So nothing to blame. :)

In fact based on your comments;

"GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.)."

"when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU"

It seemed you run everything on GPU with TF Java :).

@tmichniewski
Copy link
Author

tmichniewski commented Nov 4, 2020

Refering to your last statement, well, I setup another cluster without GPU, so only CPU with similiar core class and got 7.1s.

Then I run this on a laptop with CPU only and also had similiar results like 6.6s. So these results in Java on machines with CPU only are much more close to Python on GPU.

But of course these are just comments on these results. Let us focus on facts. To get the facts I need some method to be able to check where my processing is executed. Is it possible to turn some logging, where the API would tell with 100% certainty on which device it us executing?

You know what, I believe that in your test you have better results on GPU than CPU. But you have a different machine than I. Could you confirm that in my case it is impossible that TF Java API falls back to CPU due to some reason, for example because of compute compatibility. Former Java API version 2.3.0 had problems with such compute capabilities. But new version 0.2.0 might just fall back to CPU instead of issuing error. This is how I interpret my test results. Because GPU and CPU results are almost the same.

Now, the question is how to diagnose it. So far it is far from intuitive.

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 4, 2020

TF Java uses the same placement engine and fallback logic as Python, because it all happens in the C layer. I'm trying to replicate your issue at the moment, but it triggered a full TF GPU recompile and it's taking a little while on the test VM I've got.

@tmichniewski
Copy link
Author

Hello Adam @Craigacp,
Have you got any results?

@karllessard
Copy link
Collaborator

karllessard commented Nov 5, 2020

Hi @tmichniewski , sorry if I continue to ask but can you please enable log device placement in your session and share with us the output logs? I think it is more reliable than comparing the speed to know what has been placed on a CPU or GPU.

For example, I took your original code and changed try (Graph g = new Graph(); Session s = new Session(g)) {... for this:

ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance()).setLogDevicePlacement(true).build();
try (Graph g = new Graph(); Session s = new Session(g, config)) { ...

and now it gives me the following output:

2020-11-05 09:08:44.176590: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-05 09:08:44.176822: I external/org_tensorflow/tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Hello TensorFlow 2.3.1
2020-11-05 09:08:44.406991: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-05 09:08:44.412857: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-05 09:08:44.418527: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-05 09:08:44.418553: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (xps): /proc/driver/nvidia/version does not exist
2020-11-05 09:08:44.419212: I external/org_tensorflow/tensorflow/core/common_runtime/direct_session.cc:360] Device mapping: no known devices.
Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539650: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539674: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539685: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539694: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539704: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539715: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539724: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539734: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539744: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539753: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
9900320.000000034
Time [s.] 0.16198432
Bye bye

Note that I'm running this on a CPU-only machine.

@tmichniewski
Copy link
Author

Hello Karl @karllessard,
Sure, I will test it tomorow.
PS. Thanks for this snippet, this is the kind of support I was hoping for.
Tomek

@tmichniewski
Copy link
Author

tmichniewski commented Nov 6, 2020

Hello Karl @karllessard,

I made two tests, both with vectors of 100 mln elements and the computation is executed 1000 times. Here are the results:
GPU.txt
CPU.txt

Now - does it prove that in first case the computation was executed on GPU?
If so, then looking at the results: GPU 6.05s., CPU 5.74s. I worder why CPU is even faster than GPU, or at least GPU is not faster.
What is wrong with this test?
Maybe the computation is too simple? Maybe vectors are too small. What makes GPU version work as long as CPU one?

OK, I know that TF originally was designed for bigger workloads, for example ML models. But ML computation consists of such numerical computations, but in much bigger scale. So is the scale the problem here? Maybe using GPU generates additional overhead to pass the computation to GPU? Maybe this consumes the advantages of using GPU?

What is wrong with TF Java API used like this example?

BTW - we also noticed that TF is a few times faster than popular NumPy on Python. But in Python we also observed that GPU computation was 4-5 times faster than on CPU. So why in case of Java API the times are roughly the same (time on GPU is more or less the same as on CPU)?

@tmichniewski
Copy link
Author

Well, in fact in Python this test (1000 times executed the given expression in eager mode on vectors of 1mln elements) lasted:
0.23s. on GPU
1.13s. on CPU
But, when we increased the vector size to 100mln (like in Java example), then we got:
4.50s. on GPU
259.4s. on CPU
Isn't it a proof that GPU version is much faster?
So why on the same machine we have so strange results in case of Java API?

Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)?

Moreover, when I run these tests with Java API 10000 times I got:
8.0s. on GPU
6.5s. on CPU

How to interpret these results?

@tmichniewski
Copy link
Author

Hello Adam @Craigacp,
Did you manage to finish your test on this sample program?

@saudet
Copy link
Contributor

saudet commented Nov 6, 2020

Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)?

Make sure XLA is enabled, see issue #134 (comment)

@tmichniewski
Copy link
Author

How to use XLA in Java API?
In Python this is simple: @tf.function(experimental_compile=True) according to:
https://www.tensorflow.org/xla/tutorials/compile?hl=en

@karllessard
Copy link
Collaborator

karllessard commented Nov 6, 2020

Hi @tmichniewski , we can drill down in the problem a little bit deeper to continue our investigation, thanks for helping. What I would suggest now is to compare the performances if you stop fetching the outputs, to give us a better idea on the impacts of doing memory transfers between CPU and GPU. Maybe Python have better optimizations for doing this, we'll see.

Can you please change your loop for something like this and try again on both GPU and CPU?

 for (int i = 0; i < 1000; i++) {
        s.runner().addTarget(z).run();
 }

@karllessard
Copy link
Collaborator

Also, like @Craigacp suggested previously, if your Python comparable is this script, then maybe you want to run in eager mode in Java as well. That would mean that your code would look like something this:

import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.types.TFloat64;

import java.util.Arrays;

public class HelloTensorFlow {
  public static void main(String[] args) {
    System.out.println("Hello TensorFlow " + TensorFlow.version());

    double[] pi = new double[1_000_000];
    Arrays.fill(pi, 3.14); // Pi
    double[] e = new double[1_000_000];
    Arrays.fill(e, 2.72); // e
    Ops tf = Ops.create();
      
    // Tensor<TFloat64> aTensor = TFloat64.scalarOf(3.456);
    Constant<TFloat64> a = tf.constant(3.456);
    Constant<TFloat64> b = tf.constant(0.5);
    Constant<TFloat64> x = tf.constant(pi);
    Constant<TFloat64> y = tf.constant(e);

    long startTime = System.nanoTime();
    
    for (int i = 0; i < 1000; i++) {
      // x * y * a / x + b where a = 3.456 and b = 0.5
      Operand<TFloat64> z = tf.reduceSum(tf.math.add(tf.math.div(tf.math.mul(tf.math.mul(x, y), a), x), b), tf.constant(0));

      // Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
      // System.out.println(z.data().getDouble());
    }
    long stopTime = System.nanoTime();
    System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
  }
}

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 6, 2020

Running @karllessard's snippet in eager mode is definitely using the GPU as it takes up all the GPU memory and errors out (because there hasn't been anything to free the tensors all those operations create).

The Python version you put is using fp32 and the Java is using fp64, so that's not a fair comparison (as CPUs don't slow down as much as GPUs when doing fp64 math).

Comparing the CPU and GPU in Java in graph mode I get roughly the same time for both of them, as with a simple matmul of two 10000x10000 matrices. The device placement fails if you give it a device that doesn't exist, and that's our view into the TF C library, so it might be doing something below the Java API level. The fact that the placement fails suggests it is building the nodes in the right place. Using CUDA_VISIBLE_DEVICES= on the CPU run doesn't make that slower, so it's not ignoring the CPU placement and putting things on the GPU anyway.

Turning off the fetching of the result out of the loop makes both CPU and GPU faster by roughly the same amount.

My recommendation would be to try running a real workload like ResNet inference with and without CUDA_VISIBLE_DEVICES= (note the space after the equals), as that turns on and off the GPU. I don't have time to do such an experiment at the moment, but I might get some time next week.

@tmichniewski
Copy link
Author

tmichniewski commented Nov 7, 2020

Hello Karl @karllessard,
Referring to first test with "s.runner().addTarget(z).run();". I made it twice (in both cases I increased the size of the vectors to 100_000_000):

  1. First test when I referenced the GraphOperation sum - these operations with direct setDevice. I got following results:
    FirstTestJava.txt
    FirstTestGPU.txt
    FirstTestCPU.txt
    As you can see the GPU (6.23s.) version is even slightly slower than CPU (6.06s.). Moreover, when I started this test on my Laptop I got 7.02s. (only CPU).
  2. Second test when I referenced the z Operand from eager mode. Then I got following results:
    SecondTestJava.txt
    SecondTestGPU.txt
    SecondTestCPU.txt
    In this test GPU time 6.26s., CPU time 6.26s., and laptop time 6.80s.
    Generally GPU is more or less the same as CPU.

@tmichniewski
Copy link
Author

tmichniewski commented Nov 7, 2020

Hello Karl @karllessard,
Referring to your next longer test, where "Operand z" is instantiated inside the for loop - well, in this case on both my laptop and Databricks machine with GPU I got OutOfMemory error even though there were vectors of 1_000_000 size, while in my previous tests I used 100_000_000 vectors, so 100x bigger.
OOM.txt
I suspect it might be due to the fact that "z" is instantiated 1000 times inside a loop.

Referring to:

// Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
// System.out.println(z.data().getDouble());

I think this test would test time of copying one double value (8 bytes) plus cost of println. In case of copying data between memory of GPU and CPU copying 8 bytes is nothing really serious. That is why the test expression had final reduce to avoid copying vector data between processors.

@tmichniewski
Copy link
Author

Hello Adam @Craigacp,
I don't want to do bigger tests because bigger and more complicated tests means more changing parameters which influence the results. I want to verify whether it makes sense to use Tensorflow Java API to perform vector operations. And what I expect is that I will see some improvement after running on GPU (as in Python case). But in case of Java API so far my expectations were not confirmed.
All these tests seem to prove only that indeed my computation is executed on GPU if such is present, but execution time is more or less the same as on CPU. And I still think that this is strange.

@Craigacp
Copy link
Collaborator

Craigacp commented Nov 10, 2020

Larger tests will allow us to rule out misconfiguration of our compiled version of libtensorflow vs inefficiencies in the way the Java API passes in results to the graphs or the way the graphs/eager functions are called. Running a few batches of inference of a reasonably sized model should amortize all the TF Java calling overheads, and hopefully show a speedup, which would indicate that it's a problem with how we call the libtensorflow C library (or how we pass in arguments). If it doesn't show a speedup, then the way we build libtensorflow might be causing some problems.

At the moment it's hard to say what the root cause of the inefficiency is, so more data will help us narrow it down.

@karllessard
Copy link
Collaborator

karllessard commented Nov 12, 2020

Ok, I did another test, results confirm what we observed previously, where running on a GPU takes as long as on the CPU (even longer in this case, in fact). I've run the previous graph but within a loop of 5M iterations. See the following code and the results below:

package mypackage;

import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.op.core.ReduceSum;
import org.tensorflow.proto.framework.ConfigProto;
import org.tensorflow.proto.framework.GPUOptions;
import org.tensorflow.types.TFloat64;

import java.util.Arrays;
import org.tensorflow.types.TInt32;

public class GPUTest {
  public static void main(String[] args) {
    System.out.println("Hello TensorFlow " + TensorFlow.version());

    double[] pi = new double[1_000_000];
    Arrays.fill(pi, 3.14); // Pi
    double[] e = new double[1_000_000];
    Arrays.fill(e, 2.72); // e

    ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance())
        .setLogDevicePlacement(true)
        .build();

    try (Graph g = new Graph(); Session s = new Session(g, config)) {
      Ops tf = Ops.create(g);

      Output<?> loopInput = tf.placeholder(TInt32.DTYPE).output();

      Output<?> loopOutputs[] = g.whileLoop(
           new Output<?>[] { loopInput },
          (cgraph, inputs, outputs) -> {
            Ops ctf = Ops.create(cgraph);
            outputs[0] = ctf.math.less((Output<TInt32>)inputs[0], ctf.constant(1_000_000)).z();
          },
          (bgraph, inputs, outputs) -> {
            Ops btf = Ops.create(bgraph);

            Constant<TFloat64> a = btf.constant(3.456);
            Constant<TFloat64> b = btf.constant(0.5);
            Constant<TFloat64> x = btf.constant(pi);
            Constant<TFloat64> y = btf.constant(e);

            ReduceSum<TFloat64> r = btf.reduceSum(
                btf.math.add(
                    btf.math.div(
                        btf.math.mul(
                            btf.math.mul(x, y), a),
                        x),
                    b),
                btf.constant(0)
            );
            outputs[0] = btf.withControlDependencies(Arrays.asList(r)).math.add((Operand)inputs[0], btf.constant(1)).asOutput();
          },
          "test_gpu_loop"
      );
      long startTime = System.nanoTime();
      try (Tensor<TInt32> c = TInt32.scalarOf(0);
        Tensor<?> tensor = s.runner()
            .feed(loopInput, c)
            .fetch(loopOutputs[0])
            .run()
            .get(0)) {}
      long stopTime = System.nanoTime();
      System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
    }

    System.out.println("Bye bye");
  }
}
CPU: 11.65383011 secs
GPU: 18.522389901 secs

Now, we can confirm that most of that time is totally spent in the native TF runtime library that has no interaction with the JVM. Device placement logs shown me as well that the ops were placed on the expected device in both cases. So it might be a misconfiguration of the native library build that prevents the GPU kernels to operate at full speed. @saudet already mentioned that XLA support is not enabled by default but there must something else missing.

@tmichniewski
Copy link
Author

tmichniewski commented Nov 13, 2020

Hello Karl @karllessard,

Is it possible that even the operations were placed on GPU, they are not executed there due to for example wrong (too small) compute capability of GPU?

My understanding of this whole situation is that it seems that the Java program is doing its best and we confirmed that it should use GPU, but at the moment the computation times prove that something weird is happening and apparently the computation is done on CPU or some other problem extends computation time on GPU. This questions the point of using TF Java API untill the problem is fixed.

@zaleslaw
Copy link
Contributor

I'll share some results from my side:
I ran on my local machine with NVIDIA card GeForce GTX 1650 the following example
I made minor changes:

  • 10 epochs instead 1 in example
  • batch_size = 100 instead of 1000 (for the best GPU memory utilization)

Also I set up the session with the following setting:

session = new Session(graph, ConfigProto.newBuilder().setGpuOptions(GPUOptions.newBuilder().setAllowGrowth(true).build()).build());

For 1 epoch:

GPU: 13367 +- 140 ms
CPU: 56156 +- 420 ms
Speed up ~ 4.2

For 10 epochs:

GPU: 106852 +- 1000 ms
CPU: 560519 +- 6900 ms
Speed up ~ 5,24

For 20 epochs:

GPU: 214016 +- 1800 ms
CPU: 1127356 +- 12500 ms
Speed up ~ 5,26

P.S. The speedup for this GPU on 1.15 was closer to 7.2 times on LeNet-like models

@karllessard
Copy link
Collaborator

Thanks a lot for the investigation @zaleslaw , that is very appreciated and these results are closer to what we can expect from a GPU with TF Java. So I guess the reason we were not observing this delta previously was just that the example that we've been using? My concern though is why in Python the same example shows a performance improvement while it does not in Java, there is still something that we are missing here.

@tmichniewski
Copy link
Author

Hello Alexey @zaleslaw,
Could perform additional test on your machine, but this time with our example, which Karl mentioned? We have plain vector computation instead of sophisticated machine learning algorithms.

@zaleslaw
Copy link
Contributor

@tmichniewski will try to reproduce and share results

@zaleslaw
Copy link
Contributor

zaleslaw commented Nov 26, 2020

@tmichniewski @karllessard There are a lot of different code snippets, I used one from here

My results:

GPU:

Time [s.] 2.402464902
Bye bye
test_gpu_loop/Enter: (Enter): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Merge: (Merge): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/cond/Less: (Less): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/LoopCond: (LoopCond): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Switch: (Switch): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Mul_1: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Div: (Div): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Add_1: (Add): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/NextIteration: (NextIteration): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Exit: (Exit): /job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/cond/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_2: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_3: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_4: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_5: (Const): /job:localhost/replica:0/task:0/device:GPU:0

Process finished with exit code 0

CPU:

Time [s.] 1.63166006
Bye bye
test_gpu_loop/Enter: (Enter): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Merge: (Merge): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/cond/Less: (Less): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/LoopCond: (LoopCond): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Switch: (Switch): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Add_1: (Add): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/NextIteration: (NextIteration): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Exit: (Exit): /job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/cond/Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_5: (Const): /job:localhost/replica:0/task:0/device:CPU:0

Looks like GPU is slower than CPU, but task is quite small and overhead is great.

I could test on my NVIDIA card yet something, if you need.

@tmichniewski
Copy link
Author

Hello @zaleslaw
If the task seems to be too simple it might be increased to vector of 100 mln. elements. In Python we observed that GPU version was much faster than CPU one. This was like seconds to minutes. Why noone observed it on Java API? What is wrong in Java API?

@Craigacp
Copy link
Collaborator

Expanding the vector size doesn't stress the right places. If (as it appears given the information we have) there's something going wrong with the initial setup and copy then increasing the vector size bumps that as well as the computational workload.

We're not sure what's happening yet, clearly something is wrong but we've not got a root cause.

@tmichniewski
Copy link
Author

Is there anyone who would be able to debug it, to check what is the cause of the problem? Why it is so difficult?

@karllessard
Copy link
Collaborator

Hi @tmichniewski , I think what's makes it a bit difficult for us to debug is that:

  1. GPUs are handled entirely by the TensorFlow C++ library and normally Java should have nothing to do with it, so we need to dig deeper into the TensorFlow runtime and Python client to understand what is going wrong with your test application, while our expertise is more on the Java bindings themselves.

  2. Not all of us, including me, has access to a local GPU machine.

  3. This is a community-driven project and all collaborators are working on it on their free will and time.

So yes I think we are all interested to understand better the problem and meanwhile if you can please continue your investigations, the project would surely benefit from it.

Thanks,
Karl

@tmichniewski
Copy link
Author

Hello Karl @karllessard,
Ad.1. Have you looked at Python TF client to see how it uses the c++ library and what are the differences. In fact Java TF API should mimic what Python client is doing, so it is worth analysing even without my issue reporting.
Ad.2. Sure, if you provide me with any source code (recommended improvements, configuration or logging) I may give it a go. But as I understand we tested all the improvements recommended by you and they only confirmed that something is wrong without giving the precise explanation. But still, we may repeat these tests, but I would need some source code input from you.
Ad.3. Finally, maybe it is worth trying to compare to version 1.15.0 or 2.3.1 of Java TF API, so right now the legacy version, but since last month only. Maybe these versions were working somehow differently? At least I got messages, that to make it working on my GPU compute compatibility 7.0 I would have to build this API from sources, as there was some issue with PTX files or something like this and only the most recent compute capabilities were supported by default. So maybe this is some clue?

@karllessard
Copy link
Collaborator

Have you looked at Python TF client to see how it uses the c++ library and what are the differences

One major difference is that the TensorFlow runtime has this thin ABI layer called "C API" which should be used by language binding libraries like TF Java to guarantee stability between TF releases but we all know that Python bypasses this layer in many cases and access directly the internal C++ classes. So maybe the C API is creating the overhead we observe or maybe it is not exposing everything that we need to enable GPU computation properly.

Finally, maybe it is worth trying to compare to version 1.15.0 or 2.3.1 of Java TF API

I completely agree, that would be an interesting test to do. The legacy version of TF Java is also calling exclusively the C API but the linkage is done differently (legacy version have custom JNI code to bind to the runtime library while the new version uses auto-generated JNI bindings using JavaCPP).

I don't expect the performances to be that different between the two versions but if they are, it will be much simpler to drill down to the actual cause of the problem.

On another note, the example from @zaleslaw shown that GPU is effectively supported by the new TF Java, and once the model has been build and ran in the C++ runtime, there is no interaction with the JVM that could impact the performances. So for now my suspicions are that:

  1. Most of the time in your example is spent before or after running the graph (in the C API maybe?)
  2. The graph session is not initialized properly (e.g. Python could use an internal tweak to enable more GPU kernels)

Karl

@unik00
Copy link

unik00 commented Sep 21, 2022

It my case, it was because: while libtensorflow.jar was the latest version, I used libtensorflow_jni.so of outdated version. Eventhough it still run without error, GPU utilization percentage was very low. Not sure if this is your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants