The Java Tensorflow library does not seem to be using GPU #140

tmichniewski · 2020-11-03T14:17:40Z

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NO
TensorFlow installed from (source or binary): from https://oss.sonatype.org/
TensorFlow version (use command below): 2.3.1
Python version: 3.7.7
Bazel version (if compiling from source): NO
GCC/Compiler version (if compiling from source): NO
CUDA/cuDNN version: 10.1
GPU model and memory: Tesla K80, compute capability 3.7 (but we also tested this on Tesla V100 7.0 compute capability)

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Here is the result of the capture script:
tf_env.txt

Describe the current behavior
We tested the new Tensorflow Java API (not the legacy one). The brand new version released in October 2020. We tested it on some machines including Azure Databricks NC6_v3 and Azure Virtual Machines (the capture log is from the virtual machine). I noticed that in case of no GPU available the library falls back to CPU. And this is fine. However we also measured the time for some example processing (a few vector operations). And we see that there is no significant difference between processing time on GPU and on CPU. It looks as it is not using GPU, even if this is present (we tried two graphic cards: Tesla K80 with compute compatibility 3.7 and Tesla V100 with compute compatibility 7.0). In both cases we do not see any difference in processing time.

Describe the expected behavior
Expected behaviour is to get execution times much better if the program is executed on a machine with GPU present.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

We used the following java program:

HelloTensorFlow_java.txt
pom_xml.txt

The source was compiled to class file and it was run via the following command:
java -classpath protobuf-java-3.8.0.jar:ndarray-0.2.0.jar:javacpp-1.5.4.jar:javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.2.0.jar:tensorflow-core-api-0.2.0-linux-x86_64-gpu.jar:tensorflow-core-platform-gpu-0.2.0.jar:. HelloTensorFlow

The listed libraries were downloaded from https://oss.sonatype.org/.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The enclosed program issues the following log:
log.txt

From the log you may see that the GPU was present and recognized.
However the execution time did not differ, when we started it with GPU and without.

karllessard · 2020-11-03T14:36:25Z

Hi @tmichniewski ,

Can you assign explicitly an operation to a GPU device and see if that works?

Now, there is an ongoing effort to make device selection easier on TF Java, as right now you need unfortunately to use the lower-level API for building an operation.

For example, instead of simply call tf.math.add(x, y), you need to do something like:

DeviceSpec gpuSpec = DeviceSpec.newBuilder().deviceType(DeviceType.GPU).deviceIndex(0).build();
...
graph.opBuilder(Add.OP_NAME, "Add1").setDevice(gpuSpec.toString()).addInput(x.asOutput()).addInput(y.asOutput()).build();

karllessard · 2020-11-03T14:44:36Z

You can also log device placement by passing explicitly a ConfigProto message when instantiating your Session, using this constructor

tmichniewski · 2020-11-03T14:53:14Z

Hello @karllessard,
Referring to the first recommendation: are you sure that we could call deviceType on DeviceSpec.newBuilder()? As my IntelliJ complains about this.

tmichniewski · 2020-11-03T14:53:54Z

Basically I am looking for simplest solution, just for testing the performance.

tmichniewski · 2020-11-03T14:55:09Z

BTW - I am using version 0.2.0.

karllessard · 2020-11-03T15:10:03Z

@tmichniewski , to be honest I haven't tested the code snippet I've posted you but something around that should work. You can also simply pass directly the device spec as a string for now, as described here. I'll be curious to see what device placement logging is telling you as well, if you get a chance to activate it.

tmichniewski · 2020-11-03T15:23:11Z

I already used hardcoded string "/GPU:0".

tmichniewski · 2020-11-03T15:56:16Z

Hello @karllessard,
Well, I made a tests on Azure Databricks machine NC6s_v3 with graphic card Tesla V100 compute comtability 7.0.
With device set to "/CPU:0" the execution time is 0.23s.
With device set to "/GPU:0" the execution time is 0.62s.
So not only it does not reduce the execution time but also it increases it.
The current code is as follows:
HelloTensorFlow_java.txt

tmichniewski · 2020-11-03T15:59:51Z

BTW - on this cluster in Python we got following times:
CPU 1.00s.
GPU 0.25s.
PythonHelloTensorFlow_py.txt
So in Python GPU is 4 times faster in this exercise. But in Java GPU version is slower than CPU version.

tmichniewski · 2020-11-03T16:01:47Z

Maybe I should also set the device to the session or graph?

tmichniewski · 2020-11-03T16:04:12Z

But then it would make no sense to set it on operation level.
Do you have any working example of how to perform let say vector addition on GPU?

Craigacp · 2020-11-03T16:22:09Z

Why not try running the Java version in eager mode like the python version is?

tmichniewski · 2020-11-03T16:36:34Z

Firstly, because eager mode is only for development, not production.
Secondly - well - for tests of course I might use it, but do you have some working example with suggestion how to pass the device on which TF is being executed? Originally I used it, but Karl (in his first comment) sugested to use low level API.

Craigacp · 2020-11-03T16:46:23Z

But the python example is in eager mode, so let's rule out the difference in the TF runtime engine first, before moving on to issues inside the Java API. I'm sure there are speed issues in the Java API, but best start with equivalent metrics first. Either run the Java one in eager mode, or the python one as a tf.function.

tmichniewski · 2020-11-03T17:06:57Z

@Craigacp Hello Adam,

I am not comparing Python times to Java times. I try to compare GPU time to CPU time. Python code was only to show that GPU version was roughly 4x faster then CPU. Or in other words - I try to run TF Java API on GPU. So far without success.

Right - in Python it was in eager mode. The same in original Java version I guess. But this issue is about how to make java API to execute on GPU. So far I was pointed to the low level API, but this does not work. That is why I ask for:

either some hint of how to correctly pass the device to the API to make it work on GPU (in eager mode or static one),
or some link to some working example (the API is quite fresh and there are no that many examples/tutorials in the net).

tmichniewski · 2020-11-03T18:01:46Z

Moreover, Karl also pointed me in his third comment to the following description, which clearly states that if there is a choice, then GPU takes precedence:

"If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul has both CPU and GPU kernels. On a system with devices CPU:0 and GPU:0, the GPU:0 device will be selected to run tf.matmul unless you explicitly request running it on another device."

karllessard · 2020-11-03T19:06:18Z

I cannot think of any reason why running graph ops on a GPU in Java would be slower than in Python, since this is all handled by the C++ core library shared by both languages. So if it is really what we are observing, then it has to be a misconfiguration somewhere.

Another idea, can you check @tmichniewski if the performance goes back to normal if you remove the line inside the loop where you fetch the result from the tensor? Maybe you are fetching data from the GPU memory to the JVM and that is doing extra copies that are not required when using Python?

tmichniewski · 2020-11-03T19:52:36Z

I also do not believe that Java version could be slover than Python one. But this is not the issue here. The problem is that it seems that GPU is not used at all.

The previous version of this library 2.3.0, the so called lagacy (since last weekend) was saying that the kernel image was incorect and there was suggestion to build TF from sources. But here the library silently falls back to CPU and you never know whether GPU is used except by measuring time.

Refering to second paragraph, well, I thought that it is always necessary to get the data back from tensor. BTW, we read only one double value, because the last operation in graph is reduce_sum, so we take 8 bytes instead of big vector. Therefore it cannot be reason of slow performance.

PS.
Do you test it on GPU? I guess so, therefore I think you should have some observations and know-how of how to make it work on GPU.

tmichniewski · 2020-11-03T20:00:34Z

Maybe there is some issue with graphic card compute capability. In our case it is 7.0 and was also 3.7. Does this library work with cards with such parameters? If not then this would explain why it silently falls back to CPU.

Or maybe the issue is that the choice of GPU device is not defined correctly. So I ask for some example of how to do this. For example how to perform vector addition on GPU. Very simple example. So far I provided two versions, eager and graph, but both are not using GPU.

karllessard · 2020-11-03T22:09:22Z

Sorry @tmichniewski but I'm not personally using GPUs nor have access to one right now so I might not be the best person to provide you a concrete working example of this here, maybe @zaleslaw or @saudet can chime in?

Again, I think logging device placement might reveal some explanations as well, see the Session constructor link I've provided earlier.

saudet · 2020-11-04T06:31:12Z

I haven't been testing it really closely, but I know at least @roywei and @lanking520 have been using the GPU builds without too much trouble.

tmichniewski · 2020-11-04T07:58:01Z

Hello @karllessard, @saudet, @roywei and zaleslaw and lankin520,

I also could confirm that it is possible to use GPU builds without trouble. But the key question here is whether the GPU builds really use GPU if it is present on a machine. My tests seems to be telling that NO, they are not using GPU. At least so far I don't know how to make it use GPU.

I provided two sample Java programs:
The first with simple vector computation where I expected that providing gpu versions of libraries should make TF use GPU, but it was not the case.
The second example is after Karl's recommendation, where I additionally specified the device in the graph operations. Also without success.

Now, since this library silently falls back to CPU if GPU is not present or for some reason the library is not willing to use it, then how do you perform the unit tests?

One possibility it by comparing the execution times, but this is really delicate. What are the other possibilities? Because I think all the tests may be successful after the library falls back to CPU, while all of them should fail in case of GPU present on a machine and requested.

In my opinion there should be a possibility in the API to request computation on GPU and if this is not satisfied, then the unit tests and the whole API should raise an exception saying that cannot be executed on requested GPU due to some reason.

Concluding I repeate my original issue - How to make Tensorflow Java API convince to use GPU?

Please point me to any working example or provide any working hint. Could you?
PS. By working I understand not only working, but also using GPU if present.

lanking520 · 2020-11-04T08:20:11Z

@tmichniewski hi, we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device. You can also try with DJL that wrap TF Java package and provide some ease to automatically do GPU switch during runtime.

You can play around with our Jupiter notebook in Java http://docs.djl.ai/jupyter/tensorflow/pneumonia_detection.html

If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes.

tmichniewski · 2020-11-04T13:50:50Z

Hello @lanking520,

Refering to:

"we were able to use GPU to run inference. For GPU use case, you need to make sure your operator and model are loaded on GPU device."
seems to be contradicting to your next statement:
"If you have GPU device and install CUDA 10.1, it will use GPU out of box without a single line of the code changes."

So far I added setDevice("/GPU:0") to every graph operation like in the code below:
HelloTensorFlow_java_2.txt
just to be 100% sure where the computation is being executed.

But the results I get on my test computation on 100 mln entries vectors (so big enough) executed 1000 times do not show any time difference when executed on GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.).

On the contrary, when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU, so the difference is huge and visible. Therefore I wonder why in case of using Java API I do not see the same difference? Could you explain it?
python_test_2.txt

BTW - I am not comparing Python to Java or eager to graph execution. I am comparing GPU to CPU execution, where in Python I see huge defference while in Java I see no difference.

PS.
I do not want to run these DJL samples which bring a lot of new elements to the picture and new changing parameters, so in general this may only complicate my tests instead of simplifying them. Moreover, if DJL wraps TF Java package, then it will change nothing if I do not see that TF java package is working well on GPU.

Maybe it is like this that TF Java API is well optimized to Deep Learning computations while not exactly to very simple vector operations.

tmichniewski · 2020-11-04T13:59:46Z

In general I think that there might be three possibilities:

First possibility is that I might be doing something wrong in my test program, however I think you had a chance to look at it. Any way, if this is the case, then I simply ask for some support or just a reference to some simple example program, but with vector operations, not with sophisticated neural networks computations.
Second option is that the TF Java API might not be using GPU. In this case I would ask for some hints of how to force it to use GPU.
Finally there also might be some bug in the API which sees GPU and tries to use it but silently falls back to CPU. In this case you will know best how to handle this.

So my question is which situation we are dealing with? Or how to diagnose it?

tmichniewski · 2020-11-04T14:01:55Z

Hello @karllessard,
Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it.
PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-)

zaleslaw · 2020-11-04T14:38:08Z

Mr. Tomasz, I'm also faced with a similar issue with the API (missed method on operand level, only on builder). Do you have a chance to experiment on the previous version of Java API (1.15, for example)? I found a very big difference for LeNet model training on GPU and without. It could be important to understand what was missed when we switched from 1.x to 2.x. Agree, that DeviceSpec could not be used directly in API, but could be used as a builder to build a correct device string (if it's required). Alex ср, 4 нояб. 2020 г. в 17:02, Tomasz Michniewski <notifications@github.com>:

…

Hello @karllessard <https://github.com/karllessard>, Refering to DeviceSpec.newBuilder().deviceType(...) - I cannot use this construction because deviceType metod is called on object Builder, which is package-private and cannot be used outside of org.tensorflow package. This is why I could not use it. PS. This is is just a comment, because the code snippet you provided was perfectly valid, but only for you. :-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJEUHI6MBMQP66UE4CCQWLSOFNGJANCNFSM4TIZUTNQ> .

tmichniewski · 2020-11-04T15:04:45Z

Hello Alex @zaleslaw,
1.15.0 - so the API of Tensorflow 1 which had recent commit in Maven Repository in October 2019 (we skipped this from the very beginning as obsolete),
2.3.0 - so called Lagacy API since last weekend (in Python and Java API+Scala, but this version did not work in Java/Scala on our machines with GPU compute capability 3.7 and 7.0 - it was always saying that image in incorrect and we should build TF from sources),
0.2.0 - the newest API produced on October 2020 (in Java API).
I experimented on Azure Databricks clusters, but also on Azure Virtual Machines instantiated on request.

So far I had not a single result with Java API where I could see - OK, computing on GPU is X times faster. Not a single result.

We managed to see such results only in Python, but never in Java API. That is why I started wordering what is wrong. Because basically we should be able to observe that GPU processing is sometimes faster, the same as in Python we see.

At the moment we focus only on this new API, as last week we spent a lot of time on this legacy version 2.3.0. We even started to analyse how to build it from sources.

Well - maybe there is also the same issue like in 2.3.0 - I mean the Java API recognizes that the graphic card compute capability is too low and silently falls back to CPU.

In my opinion I managed to force the API to use the specified GPU. But apparently it is not being used. Maybe this logic to fall back to CPU should be guarded somehow and if the requirement is to process on GPU and GPU cannot be used (is not present or has wrong compute capability), then maybe the error should be thrown, not a warning or nothing. You know, the developer/user requests to process on GPU, then if this is not possible, then "blue screen". Otherwise we end up in situations like this that we discuss what might be going on here.

lanking520 · 2020-11-04T17:46:45Z

In our test platform, we benchmarked ResNet50 Image classification model, there is a small gap between Python GPU and Java GPU (you can find my issue filed in here). However, if you compare with CPU (AWS EC2 C5.2xlarge) it's around 40ms and GPU (AWS EC2 p3.2xlarge) is around 5.7ms. GPU is in fact much faster than CPU.

We use TF 2.3.1 build with 0.2.0

But it seemed the TF2 doing something weird that causing everything slow. We also did some benchmark on the cpu performance and see TF1 CPU latency is 2/3 of TF2 on Resnet 50 model with Java api. But same problems on Python too... So nothing to blame. :)

In fact based on your comments;

"GPU or CPU. In fact the results are almost the same (6.0s. vs. 6.1s.)."

"when I executed the same test on Python in eager mode I got 4.5s. on GPU and 259s. on CPU"

It seemed you run everything on GPU with TF Java :).

tmichniewski · 2020-11-04T19:40:24Z

Refering to your last statement, well, I setup another cluster without GPU, so only CPU with similiar core class and got 7.1s.

Then I run this on a laptop with CPU only and also had similiar results like 6.6s. So these results in Java on machines with CPU only are much more close to Python on GPU.

But of course these are just comments on these results. Let us focus on facts. To get the facts I need some method to be able to check where my processing is executed. Is it possible to turn some logging, where the API would tell with 100% certainty on which device it us executing?

You know what, I believe that in your test you have better results on GPU than CPU. But you have a different machine than I. Could you confirm that in my case it is impossible that TF Java API falls back to CPU due to some reason, for example because of compute compatibility. Former Java API version 2.3.0 had problems with such compute capabilities. But new version 0.2.0 might just fall back to CPU instead of issuing error. This is how I interpret my test results. Because GPU and CPU results are almost the same.

Now, the question is how to diagnose it. So far it is far from intuitive.

Craigacp · 2020-11-04T20:29:17Z

TF Java uses the same placement engine and fallback logic as Python, because it all happens in the C layer. I'm trying to replicate your issue at the moment, but it triggered a full TF GPU recompile and it's taking a little while on the test VM I've got.

tmichniewski · 2020-11-05T07:25:09Z

Hello Adam @Craigacp,
Have you got any results?

karllessard · 2020-11-05T14:19:52Z

Hi @tmichniewski , sorry if I continue to ask but can you please enable log device placement in your session and share with us the output logs? I think it is more reliable than comparing the speed to know what has been placed on a CPU or GPU.

For example, I took your original code and changed try (Graph g = new Graph(); Session s = new Session(g)) {... for this:

ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance()).setLogDevicePlacement(true).build();
try (Graph g = new Graph(); Session s = new Session(g, config)) { ...

and now it gives me the following output:

2020-11-05 09:08:44.176590: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-05 09:08:44.176822: I external/org_tensorflow/tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Hello TensorFlow 2.3.1
2020-11-05 09:08:44.406991: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-05 09:08:44.412857: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-05 09:08:44.418527: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-05 09:08:44.418553: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (xps): /proc/driver/nvidia/version does not exist
2020-11-05 09:08:44.419212: I external/org_tensorflow/tensorflow/core/common_runtime/direct_session.cc:360] Device mapping: no known devices.
Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539650: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539674: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539685: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539694: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539704: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539715: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539724: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539734: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539744: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2020-11-05 09:08:44.539753: I external/org_tensorflow/tensorflow/core/common_runtime/placer.cc:114] Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
9900320.000000034
Time [s.] 0.16198432
Bye bye

Note that I'm running this on a CPU-only machine.

tmichniewski · 2020-11-05T16:28:12Z

Hello Karl @karllessard,
Sure, I will test it tomorow.
PS. Thanks for this snippet, this is the kind of support I was hoping for.
Tomek

tmichniewski · 2020-11-06T07:43:16Z

Hello Karl @karllessard,

I made two tests, both with vectors of 100 mln elements and the computation is executed 1000 times. Here are the results:
GPU.txt
CPU.txt

Now - does it prove that in first case the computation was executed on GPU?
If so, then looking at the results: GPU 6.05s., CPU 5.74s. I worder why CPU is even faster than GPU, or at least GPU is not faster.
What is wrong with this test?
Maybe the computation is too simple? Maybe vectors are too small. What makes GPU version work as long as CPU one?

OK, I know that TF originally was designed for bigger workloads, for example ML models. But ML computation consists of such numerical computations, but in much bigger scale. So is the scale the problem here? Maybe using GPU generates additional overhead to pass the computation to GPU? Maybe this consumes the advantages of using GPU?

What is wrong with TF Java API used like this example?

BTW - we also noticed that TF is a few times faster than popular NumPy on Python. But in Python we also observed that GPU computation was 4-5 times faster than on CPU. So why in case of Java API the times are roughly the same (time on GPU is more or less the same as on CPU)?

tmichniewski · 2020-11-06T07:54:39Z

Well, in fact in Python this test (1000 times executed the given expression in eager mode on vectors of 1mln elements) lasted:
0.23s. on GPU
1.13s. on CPU
But, when we increased the vector size to 100mln (like in Java example), then we got:
4.50s. on GPU
259.4s. on CPU
Isn't it a proof that GPU version is much faster?
So why on the same machine we have so strange results in case of Java API?

Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)?

Moreover, when I run these tests with Java API 10000 times I got:
8.0s. on GPU
6.5s. on CPU

How to interpret these results?

tmichniewski · 2020-11-06T07:55:13Z

Hello Adam @Craigacp,
Did you manage to finish your test on this sample program?

saudet · 2020-11-06T08:39:02Z

Firstly, even if there is a bug which reports using GPU or CPU, but under cover always uses GPU (this is also possible), then why Java API is slower than Python (6s. vs. 4.5s.)?

Make sure XLA is enabled, see issue #134 (comment)

tmichniewski · 2020-11-06T09:27:57Z

How to use XLA in Java API?
In Python this is simple: @tf.function(experimental_compile=True) according to:
https://www.tensorflow.org/xla/tutorials/compile?hl=en

karllessard · 2020-11-06T14:02:32Z

Hi @tmichniewski , we can drill down in the problem a little bit deeper to continue our investigation, thanks for helping. What I would suggest now is to compare the performances if you stop fetching the outputs, to give us a better idea on the impacts of doing memory transfers between CPU and GPU. Maybe Python have better optimizations for doing this, we'll see.

Can you please change your loop for something like this and try again on both GPU and CPU?

 for (int i = 0; i < 1000; i++) {
        s.runner().addTarget(z).run();
 }

karllessard · 2020-11-06T14:13:40Z

Also, like @Craigacp suggested previously, if your Python comparable is this script, then maybe you want to run in eager mode in Java as well. That would mean that your code would look like something this:

import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.types.TFloat64;

import java.util.Arrays;

public class HelloTensorFlow {
  public static void main(String[] args) {
    System.out.println("Hello TensorFlow " + TensorFlow.version());

    double[] pi = new double[1_000_000];
    Arrays.fill(pi, 3.14); // Pi
    double[] e = new double[1_000_000];
    Arrays.fill(e, 2.72); // e
    Ops tf = Ops.create();
      
    // Tensor<TFloat64> aTensor = TFloat64.scalarOf(3.456);
    Constant<TFloat64> a = tf.constant(3.456);
    Constant<TFloat64> b = tf.constant(0.5);
    Constant<TFloat64> x = tf.constant(pi);
    Constant<TFloat64> y = tf.constant(e);

    long startTime = System.nanoTime();
    
    for (int i = 0; i < 1000; i++) {
      // x * y * a / x + b where a = 3.456 and b = 0.5
      Operand<TFloat64> z = tf.reduceSum(tf.math.add(tf.math.div(tf.math.mul(tf.math.mul(x, y), a), x), b), tf.constant(0));

      // Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
      // System.out.println(z.data().getDouble());
    }
    long stopTime = System.nanoTime();
    System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
  }
}

Craigacp · 2020-11-06T16:42:31Z

Running @karllessard's snippet in eager mode is definitely using the GPU as it takes up all the GPU memory and errors out (because there hasn't been anything to free the tensors all those operations create).

The Python version you put is using fp32 and the Java is using fp64, so that's not a fair comparison (as CPUs don't slow down as much as GPUs when doing fp64 math).

Comparing the CPU and GPU in Java in graph mode I get roughly the same time for both of them, as with a simple matmul of two 10000x10000 matrices. The device placement fails if you give it a device that doesn't exist, and that's our view into the TF C library, so it might be doing something below the Java API level. The fact that the placement fails suggests it is building the nodes in the right place. Using CUDA_VISIBLE_DEVICES= on the CPU run doesn't make that slower, so it's not ignoring the CPU placement and putting things on the GPU anyway.

Turning off the fetching of the result out of the loop makes both CPU and GPU faster by roughly the same amount.

My recommendation would be to try running a real workload like ResNet inference with and without CUDA_VISIBLE_DEVICES= (note the space after the equals), as that turns on and off the GPU. I don't have time to do such an experiment at the moment, but I might get some time next week.

tmichniewski · 2020-11-07T22:12:10Z

Hello Karl @karllessard,
Referring to first test with "s.runner().addTarget(z).run();". I made it twice (in both cases I increased the size of the vectors to 100_000_000):

First test when I referenced the GraphOperation sum - these operations with direct setDevice. I got following results:
FirstTestJava.txt
FirstTestGPU.txt
FirstTestCPU.txt
As you can see the GPU (6.23s.) version is even slightly slower than CPU (6.06s.). Moreover, when I started this test on my Laptop I got 7.02s. (only CPU).
Second test when I referenced the z Operand from eager mode. Then I got following results:
SecondTestJava.txt
SecondTestGPU.txt
SecondTestCPU.txt
In this test GPU time 6.26s., CPU time 6.26s., and laptop time 6.80s.
Generally GPU is more or less the same as CPU.

tmichniewski · 2020-11-07T22:32:14Z

Hello Karl @karllessard,
Referring to your next longer test, where "Operand z" is instantiated inside the for loop - well, in this case on both my laptop and Databricks machine with GPU I got OutOfMemory error even though there were vectors of 1_000_000 size, while in my previous tests I used 100_000_000 vectors, so 100x bigger.
OOM.txt
I suspect it might be due to the fact that "z" is instantiated 1000 times inside a loop.

Referring to:

// Uncommenting this line will force additional I/O between GPU and CPU, testing both cases could be interesting
// System.out.println(z.data().getDouble());

I think this test would test time of copying one double value (8 bytes) plus cost of println. In case of copying data between memory of GPU and CPU copying 8 bytes is nothing really serious. That is why the test expression had final reduce to avoid copying vector data between processors.

tmichniewski · 2020-11-07T22:42:22Z

Hello Adam @Craigacp,
I don't want to do bigger tests because bigger and more complicated tests means more changing parameters which influence the results. I want to verify whether it makes sense to use Tensorflow Java API to perform vector operations. And what I expect is that I will see some improvement after running on GPU (as in Python case). But in case of Java API so far my expectations were not confirmed.
All these tests seem to prove only that indeed my computation is executed on GPU if such is present, but execution time is more or less the same as on CPU. And I still think that this is strange.

Craigacp · 2020-11-10T23:26:22Z

Larger tests will allow us to rule out misconfiguration of our compiled version of libtensorflow vs inefficiencies in the way the Java API passes in results to the graphs or the way the graphs/eager functions are called. Running a few batches of inference of a reasonably sized model should amortize all the TF Java calling overheads, and hopefully show a speedup, which would indicate that it's a problem with how we call the libtensorflow C library (or how we pass in arguments). If it doesn't show a speedup, then the way we build libtensorflow might be causing some problems.

At the moment it's hard to say what the root cause of the inefficiency is, so more data will help us narrow it down.

karllessard · 2020-11-12T23:18:15Z

Ok, I did another test, results confirm what we observed previously, where running on a GPU takes as long as on the CPU (even longer in this case, in fact). I've run the previous graph but within a loop of 5M iterations. See the following code and the results below:

package mypackage;

import org.tensorflow.*;
import org.tensorflow.op.Ops;
import org.tensorflow.op.core.Constant;
import org.tensorflow.op.core.ReduceSum;
import org.tensorflow.proto.framework.ConfigProto;
import org.tensorflow.proto.framework.GPUOptions;
import org.tensorflow.types.TFloat64;

import java.util.Arrays;
import org.tensorflow.types.TInt32;

public class GPUTest {
  public static void main(String[] args) {
    System.out.println("Hello TensorFlow " + TensorFlow.version());

    double[] pi = new double[1_000_000];
    Arrays.fill(pi, 3.14); // Pi
    double[] e = new double[1_000_000];
    Arrays.fill(e, 2.72); // e

    ConfigProto config = ConfigProto.newBuilder(ConfigProto.getDefaultInstance())
        .setLogDevicePlacement(true)
        .build();

    try (Graph g = new Graph(); Session s = new Session(g, config)) {
      Ops tf = Ops.create(g);

      Output<?> loopInput = tf.placeholder(TInt32.DTYPE).output();

      Output<?> loopOutputs[] = g.whileLoop(
           new Output<?>[] { loopInput },
          (cgraph, inputs, outputs) -> {
            Ops ctf = Ops.create(cgraph);
            outputs[0] = ctf.math.less((Output<TInt32>)inputs[0], ctf.constant(1_000_000)).z();
          },
          (bgraph, inputs, outputs) -> {
            Ops btf = Ops.create(bgraph);

            Constant<TFloat64> a = btf.constant(3.456);
            Constant<TFloat64> b = btf.constant(0.5);
            Constant<TFloat64> x = btf.constant(pi);
            Constant<TFloat64> y = btf.constant(e);

            ReduceSum<TFloat64> r = btf.reduceSum(
                btf.math.add(
                    btf.math.div(
                        btf.math.mul(
                            btf.math.mul(x, y), a),
                        x),
                    b),
                btf.constant(0)
            );
            outputs[0] = btf.withControlDependencies(Arrays.asList(r)).math.add((Operand)inputs[0], btf.constant(1)).asOutput();
          },
          "test_gpu_loop"
      );
      long startTime = System.nanoTime();
      try (Tensor<TInt32> c = TInt32.scalarOf(0);
        Tensor<?> tensor = s.runner()
            .feed(loopInput, c)
            .fetch(loopOutputs[0])
            .run()
            .get(0)) {}
      long stopTime = System.nanoTime();
      System.out.println("Time [s.] " + (double)(stopTime - startTime) / 1000_000_000);
    }

    System.out.println("Bye bye");
  }
}

CPU: 11.65383011 secs
GPU: 18.522389901 secs

Now, we can confirm that most of that time is totally spent in the native TF runtime library that has no interaction with the JVM. Device placement logs shown me as well that the ops were placed on the expected device in both cases. So it might be a misconfiguration of the native library build that prevents the GPU kernels to operate at full speed. @saudet already mentioned that XLA support is not enabled by default but there must something else missing.

tmichniewski · 2020-11-13T07:28:58Z

Hello Karl @karllessard,

Is it possible that even the operations were placed on GPU, they are not executed there due to for example wrong (too small) compute capability of GPU?

My understanding of this whole situation is that it seems that the Java program is doing its best and we confirmed that it should use GPU, but at the moment the computation times prove that something weird is happening and apparently the computation is done on CPU or some other problem extends computation time on GPU. This questions the point of using TF Java API untill the problem is fixed.

zaleslaw · 2020-11-23T14:39:22Z

I'll share some results from my side:
I ran on my local machine with NVIDIA card GeForce GTX 1650 the following example
I made minor changes:

10 epochs instead 1 in example
batch_size = 100 instead of 1000 (for the best GPU memory utilization)

Also I set up the session with the following setting:

session = new Session(graph, ConfigProto.newBuilder().setGpuOptions(GPUOptions.newBuilder().setAllowGrowth(true).build()).build());

For 1 epoch:

GPU: 13367 +- 140 ms
CPU: 56156 +- 420 ms
Speed up ~ 4.2

For 10 epochs:

GPU: 106852 +- 1000 ms
CPU: 560519 +- 6900 ms
Speed up ~ 5,24

For 20 epochs:

GPU: 214016 +- 1800 ms
CPU: 1127356 +- 12500 ms
Speed up ~ 5,26

P.S. The speedup for this GPU on 1.15 was closer to 7.2 times on LeNet-like models

karllessard · 2020-11-23T14:50:13Z

Thanks a lot for the investigation @zaleslaw , that is very appreciated and these results are closer to what we can expect from a GPU with TF Java. So I guess the reason we were not observing this delta previously was just that the example that we've been using? My concern though is why in Python the same example shows a performance improvement while it does not in Java, there is still something that we are missing here.

tmichniewski · 2020-11-24T05:52:39Z

Hello Alexey @zaleslaw,
Could perform additional test on your machine, but this time with our example, which Karl mentioned? We have plain vector computation instead of sophisticated machine learning algorithms.

zaleslaw · 2020-11-24T07:24:35Z

@tmichniewski will try to reproduce and share results

zaleslaw · 2020-11-26T16:14:09Z

@tmichniewski @karllessard There are a lot of different code snippets, I used one from here

My results:

GPU:

Time [s.] 2.402464902
Bye bye
test_gpu_loop/Enter: (Enter): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Merge: (Merge): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/cond/Less: (Less): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/LoopCond: (LoopCond): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Switch: (Switch): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Mul_1: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Div: (Div): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Add_1: (Add): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/NextIteration: (NextIteration): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/Exit: (Exit): /job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/cond/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_2: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_3: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_4: (Const): /job:localhost/replica:0/task:0/device:GPU:0
test_gpu_loop/body/Const_5: (Const): /job:localhost/replica:0/task:0/device:GPU:0

Process finished with exit code 0

CPU:

Time [s.] 1.63166006
Bye bye
test_gpu_loop/Enter: (Enter): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Merge: (Merge): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/cond/Less: (Less): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/LoopCond: (LoopCond): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Switch: (Switch): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Mul_1: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Div: (Div): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/ReduceSum: (Sum): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Add_1: (Add): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/NextIteration: (NextIteration): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/Exit: (Exit): /job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/cond/Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_3: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_4: (Const): /job:localhost/replica:0/task:0/device:CPU:0
test_gpu_loop/body/Const_5: (Const): /job:localhost/replica:0/task:0/device:CPU:0

Looks like GPU is slower than CPU, but task is quite small and overhead is great.

I could test on my NVIDIA card yet something, if you need.

tmichniewski · 2020-11-26T17:15:34Z

Hello @zaleslaw
If the task seems to be too simple it might be increased to vector of 100 mln. elements. In Python we observed that GPU version was much faster than CPU one. This was like seconds to minutes. Why noone observed it on Java API? What is wrong in Java API?

Craigacp · 2020-11-26T17:51:16Z

Expanding the vector size doesn't stress the right places. If (as it appears given the information we have) there's something going wrong with the initial setup and copy then increasing the vector size bumps that as well as the computational workload.

We're not sure what's happening yet, clearly something is wrong but we've not got a root cause.

tmichniewski · 2020-11-27T05:56:45Z

Is there anyone who would be able to debug it, to check what is the cause of the problem? Why it is so difficult?

karllessard · 2020-11-28T04:08:14Z

Hi @tmichniewski , I think what's makes it a bit difficult for us to debug is that:

GPUs are handled entirely by the TensorFlow C++ library and normally Java should have nothing to do with it, so we need to dig deeper into the TensorFlow runtime and Python client to understand what is going wrong with your test application, while our expertise is more on the Java bindings themselves.
Not all of us, including me, has access to a local GPU machine.
This is a community-driven project and all collaborators are working on it on their free will and time.

So yes I think we are all interested to understand better the problem and meanwhile if you can please continue your investigations, the project would surely benefit from it.

Thanks,
Karl

tmichniewski · 2020-11-28T07:59:25Z

Hello Karl @karllessard,
Ad.1. Have you looked at Python TF client to see how it uses the c++ library and what are the differences. In fact Java TF API should mimic what Python client is doing, so it is worth analysing even without my issue reporting.
Ad.2. Sure, if you provide me with any source code (recommended improvements, configuration or logging) I may give it a go. But as I understand we tested all the improvements recommended by you and they only confirmed that something is wrong without giving the precise explanation. But still, we may repeat these tests, but I would need some source code input from you.
Ad.3. Finally, maybe it is worth trying to compare to version 1.15.0 or 2.3.1 of Java TF API, so right now the legacy version, but since last month only. Maybe these versions were working somehow differently? At least I got messages, that to make it working on my GPU compute compatibility 7.0 I would have to build this API from sources, as there was some issue with PTX files or something like this and only the most recent compute capabilities were supported by default. So maybe this is some clue?

karllessard · 2020-11-28T15:50:19Z

Have you looked at Python TF client to see how it uses the c++ library and what are the differences

One major difference is that the TensorFlow runtime has this thin ABI layer called "C API" which should be used by language binding libraries like TF Java to guarantee stability between TF releases but we all know that Python bypasses this layer in many cases and access directly the internal C++ classes. So maybe the C API is creating the overhead we observe or maybe it is not exposing everything that we need to enable GPU computation properly.

Finally, maybe it is worth trying to compare to version 1.15.0 or 2.3.1 of Java TF API

I completely agree, that would be an interesting test to do. The legacy version of TF Java is also calling exclusively the C API but the linkage is done differently (legacy version have custom JNI code to bind to the runtime library while the new version uses auto-generated JNI bindings using JavaCPP).

I don't expect the performances to be that different between the two versions but if they are, it will be much simpler to drill down to the actual cause of the problem.

On another note, the example from @zaleslaw shown that GPU is effectively supported by the new TF Java, and once the model has been build and ran in the C++ runtime, there is no interaction with the JVM that could impact the performances. So for now my suspicions are that:

Most of the time in your example is spent before or after running the graph (in the C API maybe?)
The graph session is not initialized properly (e.g. Python could use an internal tweak to enable more GPU kernels)

Karl

unik00 · 2022-09-21T15:11:10Z

It my case, it was because: while libtensorflow.jar was the latest version, I used libtensorflow_jni.so of outdated version. Eventhough it still run without error, GPU utilization percentage was very low. Not sure if this is your issue.

The Java Tensorflow library does not seem to be using GPU #140

The Java Tensorflow library does not seem to be using GPU #140

Comments

tmichniewski commented Nov 3, 2020

karllessard commented Nov 3, 2020 • edited Loading

karllessard commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

karllessard commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

Craigacp commented Nov 3, 2020 • edited Loading

tmichniewski commented Nov 3, 2020

Craigacp commented Nov 3, 2020 • edited Loading

tmichniewski commented Nov 3, 2020 • edited Loading

tmichniewski commented Nov 3, 2020

karllessard commented Nov 3, 2020

tmichniewski commented Nov 3, 2020

tmichniewski commented Nov 3, 2020 • edited Loading

karllessard commented Nov 3, 2020

saudet commented Nov 4, 2020

tmichniewski commented Nov 4, 2020

lanking520 commented Nov 4, 2020

tmichniewski commented Nov 4, 2020

tmichniewski commented Nov 4, 2020

tmichniewski commented Nov 4, 2020

zaleslaw commented Nov 4, 2020 via email

tmichniewski commented Nov 4, 2020

lanking520 commented Nov 4, 2020 • edited Loading

tmichniewski commented Nov 4, 2020 • edited Loading

Craigacp commented Nov 4, 2020 • edited Loading

tmichniewski commented Nov 5, 2020

karllessard commented Nov 5, 2020 • edited Loading

tmichniewski commented Nov 5, 2020

tmichniewski commented Nov 6, 2020 • edited Loading

tmichniewski commented Nov 6, 2020

tmichniewski commented Nov 6, 2020

saudet commented Nov 6, 2020

tmichniewski commented Nov 6, 2020

karllessard commented Nov 6, 2020 • edited Loading

karllessard commented Nov 6, 2020

Craigacp commented Nov 6, 2020

tmichniewski commented Nov 7, 2020 • edited Loading

tmichniewski commented Nov 7, 2020 • edited Loading

tmichniewski commented Nov 7, 2020

Craigacp commented Nov 10, 2020 • edited Loading

karllessard commented Nov 12, 2020 • edited Loading

tmichniewski commented Nov 13, 2020 • edited Loading

zaleslaw commented Nov 23, 2020

karllessard commented Nov 23, 2020

tmichniewski commented Nov 24, 2020

zaleslaw commented Nov 24, 2020

zaleslaw commented Nov 26, 2020 • edited Loading

tmichniewski commented Nov 26, 2020

Craigacp commented Nov 26, 2020

tmichniewski commented Nov 27, 2020

karllessard commented Nov 28, 2020

tmichniewski commented Nov 28, 2020

karllessard commented Nov 28, 2020

unik00 commented Sep 21, 2022 • edited Loading

karllessard commented Nov 3, 2020 •

edited

Loading

Craigacp commented Nov 3, 2020 •

edited

Loading

Craigacp commented Nov 3, 2020 •

edited

Loading

tmichniewski commented Nov 3, 2020 •

edited

Loading

tmichniewski commented Nov 3, 2020 •

edited

Loading

lanking520 commented Nov 4, 2020 •

edited

Loading

tmichniewski commented Nov 4, 2020 •

edited

Loading

Craigacp commented Nov 4, 2020 •

edited

Loading

karllessard commented Nov 5, 2020 •

edited

Loading

tmichniewski commented Nov 6, 2020 •

edited

Loading

karllessard commented Nov 6, 2020 •

edited

Loading

tmichniewski commented Nov 7, 2020 •

edited

Loading

tmichniewski commented Nov 7, 2020 •

edited

Loading

Craigacp commented Nov 10, 2020 •

edited

Loading

karllessard commented Nov 12, 2020 •

edited

Loading

tmichniewski commented Nov 13, 2020 •

edited

Loading

zaleslaw commented Nov 26, 2020 •

edited

Loading

unik00 commented Sep 21, 2022 •

edited

Loading