-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Java Tensorflow library does not seem to be using GPU #140
Comments
Hi @tmichniewski , Can you assign explicitly an operation to a GPU device and see if that works? Now, there is an ongoing effort to make device selection easier on TF Java, as right now you need unfortunately to use the lower-level API for building an operation. For example, instead of simply call DeviceSpec gpuSpec = DeviceSpec.newBuilder().deviceType(DeviceType.GPU).deviceIndex(0).build();
...
graph.opBuilder(Add.OP_NAME, "Add1").setDevice(gpuSpec.toString()).addInput(x.asOutput()).addInput(y.asOutput()).build(); |
You can also log device placement by passing explicitly a |
Hello @karllessard, |
Basically I am looking for simplest solution, just for testing the performance. |
BTW - I am using version 0.2.0. |
@tmichniewski , to be honest I haven't tested the code snippet I've posted you but something around that should work. You can also simply pass directly the device spec as a string for now, as described here. I'll be curious to see what device placement logging is telling you as well, if you get a chance to activate it. |
I already used hardcoded string "/GPU:0". |
Hello @karllessard, |
BTW - on this cluster in Python we got following times: |
Maybe I should also set the device to the session or graph? |
But then it would make no sense to set it on operation level. |
Why not try running the Java version in eager mode like the python version is? |
Firstly, because eager mode is only for development, not production. |
But the python example is in eager mode, so let's rule out the difference in the TF runtime engine first, before moving on to issues inside the Java API. I'm sure there are speed issues in the Java API, but best start with equivalent metrics first. Either run the Java one in eager mode, or the python one as a tf.function. |
@Craigacp Hello Adam, I am not comparing Python times to Java times. I try to compare GPU time to CPU time. Python code was only to show that GPU version was roughly 4x faster then CPU. Or in other words - I try to run TF Java API on GPU. So far without success. Right - in Python it was in eager mode. The same in original Java version I guess. But this issue is about how to make java API to execute on GPU. So far I was pointed to the low level API, but this does not work. That is why I ask for:
|
Moreover, Karl also pointed me in his third comment to the following description, which clearly states that if there is a choice, then GPU takes precedence: "If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul has both CPU and GPU kernels. On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device." |
I cannot think of any reason why running graph ops on a GPU in Java would be slower than in Python, since this is all handled by the C++ core library shared by both languages. So if it is really what we are observing, then it has to be a misconfiguration somewhere. Another idea, can you check @tmichniewski if the performance goes back to normal if you remove the line inside the loop where you fetch the result from the tensor? Maybe you are fetching data from the GPU memory to the JVM and that is doing extra copies that are not required when using Python? |
I also do not believe that Java version could be slover than Python one. But this is not the issue here. The problem is that it seems that GPU is not used at all. The previous version of this library 2.3.0, the so called lagacy (since last weekend) was saying that the kernel image was incorect and there was suggestion to build TF from sources. But here the library silently falls back to CPU and you never know whether GPU is used except by measuring time. Refering to second paragraph, well, I thought that it is always necessary to get the data back from tensor. BTW, we read only one double value, because the last operation in graph is reduce_sum, so we take 8 bytes instead of big vector. Therefore it cannot be reason of slow performance. PS. |
Maybe there is some issue with graphic card compute capability. In our case it is 7.0 and was also 3.7. Does this library work with cards with such parameters? If not then this would explain why it silently falls back to CPU. Or maybe the issue is that the choice of GPU device is not defined correctly. So I ask for some example of how to do this. For example how to perform vector addition on GPU. Very simple example. So far I provided two versions, eager and graph, but both are not using GPU. |
Sorry @tmichniewski but I'm not personally using GPUs nor have access to one right now so I might not be the best person to provide you a concrete working example of this here, maybe @zaleslaw or @saudet can chime in? Again, I think logging device placement might reveal some explanations as well, see the |
I haven't been testing it really closely, but I know at least @roywei and @lanking520 have been using the GPU builds without too much trouble. |
Hello @karllessard, @saudet, @roywei and zaleslaw and lankin520, I also could confirm that it is possible to use GPU builds without trouble. But the key question here is whether the GPU builds really use GPU if it is present on a machine. My tests seems to be telling that NO, they are not using GPU. At least so far I don't know how to make it use GPU. I provided two sample Java programs: Now, since this library silently falls back to CPU if GPU is not present or for some reason the library is not willing to use it, then how do you perform the unit tests? One possibility it by comparing the execution times, but this is really delicate. What are the other possibilities? Because I think all the tests may be successful after the library falls back to CPU, while all of them should fail in case of GPU present on a machine and requested. In my opinion there should be a possibility in the API to request computation on GPU and if this is not satisfied, then the unit tests and the whole API should raise an exception saying that cannot be executed on requested GPU due to some reason. Concluding I repeate my original issue - How to make Tensorflow Java API convince to use GPU? Please point me to any working example or provide any working hint. Could you? |
@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You can also try with DJL that wrap TF Java package and provide some ease to automatically do GPU switch during runtime. You can play around with our Jupiter notebook in Java http://docs.djl.ai/jupyter/tensorflow/pneumonia_detection.html If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes. |
Hello @lanking520, Refering to: "we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device." So far I added setDevice("/GPU:0") to every graph operation like in the code below: But the results I get on my test computation on 100 mln entries vectors (so big enough) executed 1000 times do not show any time difference when executed on GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.). On the contrary, when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU, so the difference is huge and visible. Therefore I wonder why in case of using Java API I do not see the same difference? Could you explain it? BTW - I am not comparing Python to Java or eager to graph execution. I am comparing GPU to CPU execution, where in Python I see huge defference while in Java I see no difference. PS. Maybe it is like this that TF Java API is well optimized to Deep Learning computations while not exactly to very simple vector operations. |
In general I think that there might be three possibilities:
So my question is which situation we are dealing with? Or how to diagnose it? |
Hello @karllessard, |
Mr. Tomasz, I'm also faced with a similar issue with the API (missed method
on operand level, only on builder).
Do you have a chance to experiment on the previous version of Java API
(1.15, for example)? I found a very big difference for LeNet model training
on GPU and without. It could be important to understand what was missed
when we switched from 1.x to 2.x.
Agree, that DeviceSpec could not be used directly in API, but could be used
as a builder to build a correct device string (if it's required).
Alex
ср, 4 нояб. 2020 г. в 17:02, Tomasz Michniewski <notifications@github.com>:
… Hello @karllessard <https://github.com/karllessard>,
Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this
construction because deviceType metod is called on object Builder, which is
package-private and cannot be used outside of org.tensorflow package. This
is why I could not use it.
PS. This is is just a comment, because the code snippet you provided was
perfectly valid, but only for you. :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJEUHI6MBMQP66UE4CCQWLSOFNGJANCNFSM4TIZUTNQ>
.
|
Hello Alex @zaleslaw, So far I had not a single result with Java API where I could see - OK, computing on GPU is X times faster. Not a single result. We managed to see such results only in Python, but never in Java API. That is why I started wordering what is wrong. Because basically we should be able to observe that GPU processing is sometimes faster, the same as in Python we see. At the moment we focus only on this new API, as last week we spent a lot of time on this legacy version 2.3.0. We even started to analyse how to build it from sources. Well - maybe there is also the same issue like in 2.3.0 - I mean the Java API recognizes that the graphic card compute capability is too low and silently falls back to CPU. In my opinion I managed to force the API to use the specified GPU. But apparently it is not being used. Maybe this logic to fall back to CPU should be guarded somehow and if the requirement is to process on GPU and GPU cannot be used (is not present or has wrong compute capability), then maybe the error should be thrown, not a warning or nothing. You know, the developer/user requests to process on GPU, then if this is not possible, then "blue screen". Otherwise we end up in situations like this that we discuss what might be going on here. |
In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However, if you compare with CPU (AWS EC2 C5.2xlarge) it's around 40ms and GPU (AWS EC2 p3.2xlarge) is around 5.7ms. GPU is in fact much faster than CPU. We use TF 2.3.1 build with 0.2.0 But it seemed the TF2 doing something weird that causing everything slow. We also did some benchmark on the cpu performance and see TF1 CPU latency is 2/3 of TF2 on Resnet 50 model with Java api. But same problems on Python too... So nothing to blame. :) In fact based on your comments; "GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.)." "when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU" It seemed you run everything on GPU with TF Java :). |
Refering to your last statement, well, I setup another cluster without GPU, so only CPU with similiar core class and got 7.1s. Then I run this on a laptop with CPU only and also had similiar results like 6.6s. So these results in Java on machines with CPU only are much more close to Python on GPU. But of course these are just comments on these results. Let us focus on facts. To get the facts I need some method to be able to check where my processing is executed. Is it possible to turn some logging, where the API would tell with 100% certainty on which device it us executing? You know what, I believe that in your test you have better results on GPU than CPU. But you have a different machine than I. Could you confirm that in my case it is impossible that TF Java API falls back to CPU due to some reason, for example because of compute compatibility. Former Java API version 2.3.0 had problems with such compute capabilities. But new version 0.2.0 might just fall back to CPU instead of issuing error. This is how I interpret my test results. Because GPU and CPU results are almost the same. Now, the question is how to diagnose it. So far it is far from intuitive. |
TF Java uses the same placement engine and fallback logic as Python, because it all happens in the C layer. I'm trying to replicate your issue at the moment, but it triggered a full TF GPU recompile and it's taking a little while on the test VM I've got. |
Hello Adam @Craigacp, |
Hi @tmichniewski , sorry if I continue to ask but can you please enable log device placement in your session and share with us the output logs? I think it is more reliable than comparing the speed to know what has been placed on a CPU or GPU. For example, I took your original code and changed ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance()).setLogDevicePlacement(true).build();
try (Graph g = new Graph(); Session s = new Session(g, config)) { ... and now it gives me the following output:
Note that I'm running this on a CPU-only machine. |
Hello Karl @karllessard, |
Hello Karl @karllessard, I made two tests, both with vectors of 100 mln elements and the computation is executed 1000 times. Here are the results: Now - does it prove that in first case the computation was executed on GPU? OK, I know that TF originally was designed for bigger workloads, for example ML models. But ML computation consists of such numerical computations, but in much bigger scale. So is the scale the problem here? Maybe using GPU generates additional overhead to pass the computation to GPU? Maybe this consumes the advantages of using GPU? What is wrong with TF Java API used like this example? BTW - we also noticed that TF is a few times faster than popular NumPy on Python. But in Python we also observed that GPU computation was 4-5 times faster than on CPU. So why in case of Java API the times are roughly the same (time on GPU is more or less the same as on CPU)? |
Well, in fact in Python this test (1000 times executed the given expression in eager mode on vectors of 1mln elements) lasted: Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)? Moreover, when I run these tests with Java API 10000 times I got: How to interpret these results? |
Hello Adam @Craigacp, |
Make sure XLA is enabled, see issue #134 (comment) |
How to use XLA in Java API? |
Hi @tmichniewski , we can drill down in the problem a little bit deeper to continue our investigation, thanks for helping. What I would suggest now is to compare the performances if you stop fetching the outputs, to give us a better idea on the impacts of doing memory transfers between CPU and GPU. Maybe Python have better optimizations for doing this, we'll see. Can you please change your loop for something like this and try again on both GPU and CPU? for (int i = 0; i < 1000; i++) {
s.runner().addTarget(z).run();
} |
Also, like @Craigacp suggested previously, if your Python comparable is this script, then maybe you want to run in eager mode in Java as well. That would mean that your code would look like something this: import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.types.TFloat64;
import java.util.Arrays;
public class HelloTensorFlow {
public static void main(String[] args) {
System.out.println("Hello TensorFlow " + TensorFlow.version());
double[] pi = new double[1_000_000];
Arrays.fill(pi, 3.14); // Pi
double[] e = new double[1_000_000];
Arrays.fill(e, 2.72); // e
Ops tf = Ops.create();
// Tensor<TFloat64> aTensor = TFloat64.scalarOf(3.456);
Constant<TFloat64> a = tf.constant(3.456);
Constant<TFloat64> b = tf.constant(0.5);
Constant<TFloat64> x = tf.constant(pi);
Constant<TFloat64> y = tf.constant(e);
long startTime = System.nanoTime();
for (int i = 0; i < 1000; i++) {
// x * y * a / x + b where a = 3.456 and b = 0.5
Operand<TFloat64> z = tf.reduceSum(tf.math.add(tf.math.div(tf.math.mul(tf.math.mul(x, y), a), x), b), tf.constant(0));
// Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
// System.out.println(z.data().getDouble());
}
long stopTime = System.nanoTime();
System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
}
} |
Running @karllessard's snippet in eager mode is definitely using the GPU as it takes up all the GPU memory and errors out (because there hasn't been anything to free the tensors all those operations create). The Python version you put is using fp32 and the Java is using fp64, so that's not a fair comparison (as CPUs don't slow down as much as GPUs when doing fp64 math). Comparing the CPU and GPU in Java in graph mode I get roughly the same time for both of them, as with a simple matmul of two 10000x10000 matrices. The device placement fails if you give it a device that doesn't exist, and that's our view into the TF C library, so it might be doing something below the Java API level. The fact that the placement fails suggests it is building the nodes in the right place. Using Turning off the fetching of the result out of the loop makes both CPU and GPU faster by roughly the same amount. My recommendation would be to try running a real workload like ResNet inference with and without |
Hello Karl @karllessard,
|
Hello Karl @karllessard, Referring to:
I think this test would test time of copying one double value (8 bytes) plus cost of println. In case of copying data between memory of GPU and CPU copying 8 bytes is nothing really serious. That is why the test expression had final reduce to avoid copying vector data between processors. |
Hello Adam @Craigacp, |
Larger tests will allow us to rule out misconfiguration of our compiled version of libtensorflow vs inefficiencies in the way the Java API passes in results to the graphs or the way the graphs/eager functions are called. Running a few batches of inference of a reasonably sized model should amortize all the TF Java calling overheads, and hopefully show a speedup, which would indicate that it's a problem with how we call the libtensorflow C library (or how we pass in arguments). If it doesn't show a speedup, then the way we build libtensorflow might be causing some problems. At the moment it's hard to say what the root cause of the inefficiency is, so more data will help us narrow it down. |
Ok, I did another test, results confirm what we observed previously, where running on a GPU takes as long as on the CPU (even longer in this case, in fact). I've run the previous graph but within a loop of 5M iterations. See the following code and the results below: package mypackage;
import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.op.core.ReduceSum;
import org.tensorflow.proto.framework.ConfigProto;
import org.tensorflow.proto.framework.GPUOptions;
import org.tensorflow.types.TFloat64;
import java.util.Arrays;
import org.tensorflow.types.TInt32;
public class GPUTest {
public static void main(String[] args) {
System.out.println("Hello TensorFlow " + TensorFlow.version());
double[] pi = new double[1_000_000];
Arrays.fill(pi, 3.14); // Pi
double[] e = new double[1_000_000];
Arrays.fill(e, 2.72); // e
ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance())
.setLogDevicePlacement(true)
.build();
try (Graph g = new Graph(); Session s = new Session(g, config)) {
Ops tf = Ops.create(g);
Output<?> loopInput = tf.placeholder(TInt32.DTYPE).output();
Output<?> loopOutputs[] = g.whileLoop(
new Output<?>[] { loopInput },
(cgraph, inputs, outputs) -> {
Ops ctf = Ops.create(cgraph);
outputs[0] = ctf.math.less((Output<TInt32>)inputs[0], ctf.constant(1_000_000)).z();
},
(bgraph, inputs, outputs) -> {
Ops btf = Ops.create(bgraph);
Constant<TFloat64> a = btf.constant(3.456);
Constant<TFloat64> b = btf.constant(0.5);
Constant<TFloat64> x = btf.constant(pi);
Constant<TFloat64> y = btf.constant(e);
ReduceSum<TFloat64> r = btf.reduceSum(
btf.math.add(
btf.math.div(
btf.math.mul(
btf.math.mul(x, y), a),
x),
b),
btf.constant(0)
);
outputs[0] = btf.withControlDependencies(Arrays.asList(r)).math.add((Operand)inputs[0], btf.constant(1)).asOutput();
},
"test_gpu_loop"
);
long startTime = System.nanoTime();
try (Tensor<TInt32> c = TInt32.scalarOf(0);
Tensor<?> tensor = s.runner()
.feed(loopInput, c)
.fetch(loopOutputs[0])
.run()
.get(0)) {}
long stopTime = System.nanoTime();
System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
}
System.out.println("Bye bye");
}
}
Now, we can confirm that most of that time is totally spent in the native TF runtime library that has no interaction with the JVM. Device placement logs shown me as well that the ops were placed on the expected device in both cases. So it might be a misconfiguration of the native library build that prevents the GPU kernels to operate at full speed. @saudet already mentioned that XLA support is not enabled by default but there must something else missing. |
Hello Karl @karllessard, Is it possible that even the operations were placed on GPU, they are not executed there due to for example wrong (too small) compute capability of GPU? My understanding of this whole situation is that it seems that the Java program is doing its best and we confirmed that it should use GPU, but at the moment the computation times prove that something weird is happening and apparently the computation is done on CPU or some other problem extends computation time on GPU. This questions the point of using TF Java API untill the problem is fixed. |
I'll share some results from my side:
Also I set up the session with the following setting:
For 1 epoch: GPU: 13367 +- 140 ms For 10 epochs: GPU: 106852 +- 1000 ms For 20 epochs: GPU: 214016 +- 1800 ms P.S. The speedup for this GPU on 1.15 was closer to 7.2 times on LeNet-like models |
Thanks a lot for the investigation @zaleslaw , that is very appreciated and these results are closer to what we can expect from a GPU with TF Java. So I guess the reason we were not observing this delta previously was just that the example that we've been using? My concern though is why in Python the same example shows a performance improvement while it does not in Java, there is still something that we are missing here. |
Hello Alexey @zaleslaw, |
@tmichniewski will try to reproduce and share results |
@tmichniewski @karllessard There are a lot of different code snippets, I used one from here My results: GPU:
CPU:
Looks like GPU is slower than CPU, but task is quite small and overhead is great. I could test on my NVIDIA card yet something, if you need. |
Hello @zaleslaw |
Expanding the vector size doesn't stress the right places. If (as it appears given the information we have) there's something going wrong with the initial setup and copy then increasing the vector size bumps that as well as the computational workload. We're not sure what's happening yet, clearly something is wrong but we've not got a root cause. |
Is there anyone who would be able to debug it, to check what is the cause of the problem? Why it is so difficult? |
Hi @tmichniewski , I think what's makes it a bit difficult for us to debug is that:
So yes I think we are all interested to understand better the problem and meanwhile if you can please continue your investigations, the project would surely benefit from it. Thanks, |
Hello Karl @karllessard, |
One major difference is that the TensorFlow runtime has this thin ABI layer called "C API" which should be used by language binding libraries like TF Java to guarantee stability between TF releases but we all know that Python bypasses this layer in many cases and access directly the internal C++ classes. So maybe the C API is creating the overhead we observe or maybe it is not exposing everything that we need to enable GPU computation properly.
I completely agree, that would be an interesting test to do. The legacy version of TF Java is also calling exclusively the C API but the linkage is done differently (legacy version have custom JNI code to bind to the runtime library while the new version uses auto-generated JNI bindings using JavaCPP). I don't expect the performances to be that different between the two versions but if they are, it will be much simpler to drill down to the actual cause of the problem. On another note, the example from @zaleslaw shown that GPU is effectively supported by the new TF Java, and once the model has been build and ran in the C++ runtime, there is no interaction with the JVM that could impact the performances. So for now my suspicions are that:
Karl |
It my case, it was because: while libtensorflow.jar was the latest version, I used libtensorflow_jni.so of outdated version. Eventhough it still run without error, GPU utilization percentage was very low. Not sure if this is your issue. |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Here is the result of the capture script:
tf_env.txt
Describe the current behavior
We tested the new Tensorflow Java API (not the legacy one). The brand new version released in October 2020. We tested it on some machines including Azure Databricks NC6_v3 and Azure Virtual Machines (the capture log is from the virtual machine). I noticed that in case of no GPU available the library falls back to CPU. And this is fine. However we also measured the time for some example processing (a few vector operations). And we see that there is no significant difference between processing time on GPU and on CPU. It looks as it is not using GPU, even if this is present (we tried two graphic cards: Tesla K80 with compute compatibility 3.7 and Tesla V100 with compute compatibility 7.0). In both cases we do not see any difference in processing time.
Describe the expected behavior
Expected behaviour is to get execution times much better if the program is executed on a machine with GPU present.
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
We used the following java program:
HelloTensorFlow_java.txt
pom_xml.txt
The source was compiled to class file and it was run via the following command:
java -classpath protobuf-java-3.8.0.jar:ndarray-0.2.0.jar:javacpp-1.5.4.jar:javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.2.0.jar:tensorflow-core-api-0.2.0-linux-x86_64-gpu.jar:tensorflow-core-platform-gpu-0.2.0.jar:. HelloTensorFlow
The listed libraries were downloaded from https://oss.sonatype.org/.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The enclosed program issues the following log:
log.txt
From the log you may see that the GPU was present and recognized.
However the execution time did not differ, when we started it with GPU and without.
The text was updated successfully, but these errors were encountered: