Large latency on Tensor allocation #313

nebulorum · 2021-05-03T19:44:10Z

System information

TensorFlow version: 0.3.1
Doc Link: https://www.tensorflow.org/jvm/install

Describe the documentation issue

Not sure this is the correct forum but I would like to some guidance on how to setup sessions and resource management would be interesting.

After two weeks trying to understand why latencies in 0.3.1 were completely uncontrollable (as compared to official 1.3.1) I ran into #208. This matches my observations.

We are trying to run Prediction on models with thousands of data points in different tensor per prediction. Memory allocation on the threads are in 20MB/s and there seems to be a sync between JavaCCP Allocation thread and our worker threads. In addition to this allocation using Size(1) tensor seems to be very slow (in the 7ms range).

After reading #208, it seems we are doing everything wrong. But I don't really have a clear picture of how it should be done: Would EagerSession help? Could I use a Session per HTTP request? Should I allocate larger multi-dimensional tensors instead of a single one? How should I configure thread pools? I understand that the API is work in progress, but current documentation is very light on this kind of documentation.

I don't think this is a bug, but I can convert into some other sort of issue.

The text was updated successfully, but these errors were encountered:

Craigacp · 2021-05-03T19:53:35Z

Session should be long lived as recreating it involves a bunch of allocations and other things. If you can batch up the incoming requests then that will allow the TF native runtime to perform more efficient batched matrix multiplies/convolutions (in many cases). I think the Session object should be thread safe so you can use it concurrently from multiple threads.

We should probably turn off the JavaCPP memory allocation size check and let the GC clear it up, if you're creating lots of Java side objects GC should run frequently enough to clear things out anyway. Are you hitting the thread.sleep inside JavaCPP's allocator?

Note if you're making lots of sessions out of saved model bundles you might be hitting some memory leaks inside TF's C API (which the TF Java API uses to interact with the native library). I opened this issue (tensorflow/tensorflow#48802) upstream but we've not had a response yet.

nebulorum · 2021-05-04T10:52:17Z

Running the JVM under the profiler I never see the JavaCCP deallocation thread sleep. It is either in wait, monitor or running state. On lower load most of the time on wait and then monitor/running sync up with the worker threads.

We have a simple model, that can run in 3ms under low load. Our overall call should happen under 40ms, but when load increases we can clearly see waits in our telemetry. Model evaluation is still quick but something is causing contention on the JVM.

After reviewing the Tensor construction it seems for each request we have 80 Tensor with length 200, some String, most Long. Our feature mapping maybe be generating too much garbage on the side, but after running for a about ~2% of head is org.bytedeco.javacpp.Pointer$DeallocatorReference and another ~2% org.tensorflow.internal.c_api.AbstractTF_Tensor$DeleteDeallocator.

We also see contention on the JavaCCP Deallocation thread. It is the yellow thread on the picture below. Other threads are some of our worker threads:

You can see clearly see several threads block up on a monitor. The image bellow shows that this sync pattern happens more often:

After reviewing some of our thread dumps we also see thread blocking like this:

java.lang.Thread.State: BLOCKED (on object monitor)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:666)
	- waiting to lock <0x00000005330184f8> (a java.lang.Class for org.bytedeco.javacpp.Pointer$DeallocatorThread)
	at org.bytedeco.javacpp.Pointer.init(Pointer.java:127)
	at org.bytedeco.javacpp.BytePointer.allocateArray(Native Method)
	at org.bytedeco.javacpp.BytePointer.<init>(BytePointer.java:124)
	at org.bytedeco.javacpp.BytePointer.<init>(BytePointer.java:97)
	at org.tensorflow.internal.buffer.ByteSequenceTensorBuffer$InitDataWriter.writeNext(ByteSequenceTensorBuffer.java:129)

Or:

at org.bytedeco.javacpp.Pointer.physicalBytes(Native Method)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:665)
	at org.tensorflow.internal.c_api.AbstractTF_Tensor.allocateTensor(AbstractTF_Tensor.java:91)
	at org.tensorflow.RawTensor.allocate(RawTensor.java:197)

org.bytedeco.javacpp.Pointer

Looking at org.bytedeco.javacpp.Pointer.java:666 we can see its a synchronized block. And the Pointer has several sync blocks when adding and removing deallocators. I suspect that as 200 requests per second come in, each allocating 80 tensors, we run into contention on the JavaCCP thread, and that causes the sync on monitor and blows our latency through the roof. This does not happen like this in the 1.3.1 library, but we wanted to move to 2.4.1.

I now question if this is a documentation issue, a usage issue or a bug/feature issue. And how to work around this behaviour.

saudet · 2021-05-04T11:13:36Z

@nebulorum If you see some output when running your app with something like shown at #251 (comment), then it means you're forgetting to call close() on those objects. You'll need to make sure to call close() on those objects to avoid incurring these costs associated with GC fallback. In other words, if you call close() everywhere you can, you will not see any overhead. That responsibility is on you! This has nothing to do with JavaCPP. It is only trying to cope with code that isn't properly calling close() at all the appropriate places. The close() method is a standard Java SE method of the AutoCloseable class. I do not see the point of writing additional documentation about that.

Craigacp · 2021-05-04T12:56:53Z

The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, type of CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)?

nebulorum · 2021-05-04T13:34:48Z

The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)?

We are running on AWS only CPU, no co-processors. What has baffled us it that 1.3.1 has very stable latency and the code is 3 year old. This new endpoint can be faster, but degrades really quickly as load increases.

I have considered having a Tensor allocating thread so that we have fewer thread fighting for the block. But this will obviously create a bottleneck.

karllessard · 2021-05-04T13:39:49Z

@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set maxPhysicalBytes, maxRetries or maxBytes to 0).

Since that's a protected method, I can override it and run a test to see if there is any gain. @nebulorum , do you have any code available to share to run this benchmark?

Craigacp · 2021-05-04T13:41:30Z

The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)?

We are running on AWS only CPU, no co-processors. What has baffled us it that 1.3.1 has very stable latency and the code is 3 year old. This new endpoint can be faster, but degrades really quickly as load increases.

I have considered having a Tensor allocating thread so that we have fewer thread fighting for the block. But this will obviously create a bottleneck.

Ok, how big is your AWS instance? If there are NUMA effects from multiple CPU sockets (or within a socket for some of the AMD EPYC variants) then this could slow things down, and having a single thread doing the input tensor creation would help (though given there are lots of other things that create JavaCPP Pointers during a session.run call it won't prevent everything from bouncing the locking object across sockets).

As Karl says, we should look at this locking on our side to see if we can improve things, but in the meantime pinning the threads if you're on a large machine might help.

nebulorum · 2021-05-04T13:44:55Z

@nebulorum If you see some output when running your app with something like shown at #251 (comment), then it means you're forgetting to call close() on those objects. You'll need to make sure to call close() on those objects to avoid incurring these costs associated with GC fallback. In other words, if you call close() everywhere you can, you will not see any overhead. That responsibility is on you! This has nothing to do with JavaCPP. It is only trying to cope with code that isn't properly calling close() at all the appropriate places. The close() method is a standard Java SE method of the AutoCloseable class. I do not see the point of writing additional documentation about that.

We checked in integration on if we call gc() in the test we do see the messages as referred on #251. We will try to close all Tensors before leaving the model execution part and see if this improves.

While we did have a face palm moment when you mentioned the AutoClosable, I would however suggest that the example include a dbx.close() in the example.

Making sure people are aware they need to do proper resource management to ensure performance may be helpful. Specially since this is not the first issue around the topic.

saudet · 2021-05-04T14:48:52Z

@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set maxPhysicalBytes, maxRetries or maxBytes to 0).

Sigh... sure, I can probably do that when the "org.bytedeco.javacpp.nopointergc" system property is set to "true" so that you can all stop blaming me for bugs in your code! :) No one has never ever encountered any issues like that with JavaCPP in the past. This is a problem in TF Java.

Craigacp · 2021-05-04T14:58:01Z

@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set maxPhysicalBytes, maxRetries or maxBytes to 0).

Sigh... sure, I can probably do that when the "org.bytedeco.javacpp.nopointergc" system property is set to "true" so that you can all stop blaming me for bugs in your code! :) No one has never ever encountered any issues like that with JavaCPP in the past. This is a problem in TF Java.

Lock contention is a different issue to the freeing of resources. If threads are blocked waiting for the lock on DeallocatorThread or DeallocatorReference (as indicated by the screenshots) then that's a different problem to the leaking of memory from unclosed tensors. How large a machine have you run this on? Using a single lock to guard a resource which is accessed from multiple threads or sockets is going to slow down under contention, the question is just how much contention it takes to cause problems.

nebulorum · 2021-05-04T15:08:12Z

So we added calling close on all tensors and the number of Pointer$NativeDeallocator went down. But thread sync is still going on and no noticeable impact of latencies.

We are doing this load testing on AWS c5.4xlarge (well the host for are K8S Pods).

One other thing I was considering is preallocating and reusing the Tensor. We could put an upper bound on tensor length and maybe preallocate them. Not sure this would work on the runtime though. And thinking, is a bit optimistic because doing this logic would be hard. Also current implementation has no mutator for the tensors.

Craigacp · 2021-05-04T15:11:17Z

So we added calling close on all tensors and the number of Pointer$NativeDeallocator went down. But thread sync is still going on and no noticeable impact of latencies.

We are doing this load testing on AWS c5.4xlarge (well the host for are K8S Pods).

One other thing I was considering is preallocating and reusing the Tensor. We could put an upper bound on tensor length and maybe preallocate them. Not sure this would work on the runtime though. And thinking, is a bit optimistic because doing this logic would be hard. Also current implementation has no mutator for the tensors.

You can write values directly into the tensors when they are cast to their appropriate type (e.g. TFloat32 or TUint8). They have set methods inherited from the NdArray subinterface.

A c5.4xlarge shouldn't be big enough to exhibit NUMA effects, as I'd expect it to be on a single socket, so at least it's not that.

nebulorum · 2021-05-04T15:37:00Z

You can write values directly into the tensors when they are cast to their appropriate type (e.g. TFloat32 or TUint8). They have set methods inherited from the NdArray subinterface.

If I go down this route, is it possible to also update the shape of the Tensor? My thinking is given a modes we know how many tensor, and the maximum size. Allocate the the for the maximum size, set the data update the dimension. Not allocation needed.

nebulorum · 2021-05-04T16:45:53Z

Tried to work with capacity to see the breaking point.

VU	RPS	p50	p95
300	277	70 ms	180 ms
200	190	40	100
100	97	24	86
50	48	18	37

The composite below show JavaCCP and some of the 32 thread in different load levels (300, 200, 100 VU).

The smaller the load the full column of sync action is less common. At higher load contention looks a lot worst. Not that the window tick on the graph as 2 seconds apart and light blue is state monitor. Event at low loads the sync appear. At 300 VU, K8S reported 6.27 CPUs of usage.

We will try smaller thread pools. But there seems to be some interesting interplay on the JavaCPP thread.

…edeco.javacpp.nopointergc" (issue tensorflow/java#313)

saudet · 2021-05-05T01:40:25Z

I've added a way to avoid any synchronized code at runtime in commit bytedeco/javacpp@d788390. Please give it a try with JavaCPP 1.5.6-SNAPSHOT and the "org.bytedeco.javacpp.nopointergc" system property set to "true". If you still see a "JavaCPP Deallocator" in your threads, that means the system property was not set early enough, so please try again, possibly with something like export _JAVA_OPTIONS=-Dorg.bytedeco.javacpp.nopointergc=true in a Bash profile somewhere.

nebulorum · 2021-05-05T14:54:15Z

I tied these parameter and artefact and the "JavaCCP Deallocator" is no where to be found. But the behaviour did not improve. I also tested 8 threads and then it looks like synchronised swimming team :)

On the positive side the machine does not to seem unstable and at 8 thread I have the same latency as 32 threads at 1000 virtual users. The Difference is the RPS, 8 -> 400 RPS, 32 -> 650 RPS.

I'm considering testing just layer at a time to confirm where the problem is. But this will take sometime. Maybe we can construct an example of just looping at allocation and check if the contention appears.

nebulorum · 2021-05-05T18:16:03Z

I created a small example that tries to allocated and close Tensors in several thread counts. I also added a simple latency check: https://gist.github.com/nebulorum/f7978aa5519cab8bece65d4dac689d4f

The main loop allocates 80 Tensors base on a single array, sleep for 5 ms, then close all Tensors.

On my Mac Dual Core I5 (4 HT if I got correctly) with 32 threads, we get >2500 allocations per second, and occasionally allocation takes more than 200 ms.

nAlloc 2076 (total):  1206 (<10ms);  675 (<20ms);  106 (<30ms);  27 (<40ms);  6 (<50ms);  20 (<100ms);  4 (<200ms);  32 ( >200ms)
nAlloc 2854 (total):  1761 (<10ms);  875 (<20ms);  116 (<30ms);  49 (<40ms);  13 (<50ms);  40 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 2960 (total):  2382 (<10ms);  444 (<20ms);  77 (<30ms);  16 (<40ms);  2 (<50ms);  5 (<100ms);  1 (<200ms);  32 ( >200ms)
nAlloc 3960 (total):  3731 (<10ms);  157 (<20ms);  30 (<30ms);  11 (<40ms);  - (<50ms);  - (<100ms);  32 (<200ms);  - ( >200ms)

With 4 worker threads you see allocation taking 50-100 ms:

nAlloc 546 (total):  538 (<10ms);  10 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 540 (total):  525 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  4 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 621 (total):  620 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 517 (total):  496 (<10ms);  16 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  4 (<100ms);  - (<200ms);  - ( >200ms)

On a single thread you can still see 50-100ms allocation but much less often:

nAlloc 147 (total):  148 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 144 (total):  144 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 136 (total):  130 (<10ms);  6 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 122 (total):  106 (<10ms);  14 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 130 (total):  121 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

If you remove the pause, a single thread goes to 2500 allocation per second, but you start seeing some slower threads but single occurrences.

saudet · 2021-05-06T01:54:06Z

JavaCPP can allocate and deallocate 2 MB over 2000 * 24 times from 24 threads running concurrently on a typical laptop in about 50 ms, see deepjavalibrary/djl#690 (comment). That means ~1 microsecond for each allocation/deallocation cycle, so I think we can safely assume that the issue isn't with lock contention in JavaCPP. Finally, we can put that to rest. Thank you for working on this!

nebulorum · 2021-05-06T06:59:16Z

I think there is contention on JavaCCP. Even with a single thread you will see latency spikes. But with a lot of thread and you unlucky 3% of your allocation take more than 50ms. There is event a drop in allocations per second.

Under normal 50ms or greater latency tolerances this is ok. But in our use case we aim for less than 40ms. On Tensorflow Java 1.3.1 our P999 is under 40ms. With the new 0.3.1 we can't really even reach this on the P50 under load.

The problem is not the average allocation time, but the outliers. The more threads you have the worst it becomes. If you look at the first example, you can see 32 thread taking more than 100ms to allocate and this happened 2 seconds apart. This could be interplay with GC too.

Here is a single thread with no sleep:

nAlloc 2478 (total):  2478 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 1693 (total):  1692 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  1 ( >200ms)
nAlloc 2797 (total):  2798 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2884 (total):  2883 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2378 (total):  2377 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  1 (<200ms);  - ( >200ms)

You can see on the second row > 200ms wait and the allocations per second is 33% lower. As a reminder each allocation is 80 tensors.

saudet · 2021-05-06T07:30:35Z

Right, ok, but the contention cannot happen in JavaCPP since I've removed all of that in commit bytedeco/javacpp@d788390. Something else in TF Java is happening that is unrelated to JavaCPP.

nebulorum · 2021-05-06T09:12:30Z

Right, ok, but the contention cannot happen in JavaCPP since I've removed all of that in commit bytedeco/javacpp@d788390. Something else in TF Java is happening that is unrelated to JavaCPP.

I think there is still sync on the add and remove, plus the concurrent data structure were JavaCPP does the adding and removing.

I did a run with the 1.5.6-SNAPSHOT and the -Dorg.bytedeco.javacpp.nopointergc=true in addition I added more memory for Heap and added the GC output (-Xlog:gc -XX:MaxRAMPercentage=40 -XX:InitialRAMPercentage=40). No pause between allocations.

The results are interesting:

Variance in the number of allocation is really large. On single thread it go as high as 5K, space of 2K and low hundreds
GC does kick in but it does not seem to be too long.
With more threads variance is large and you can see groups of high latency (like 12 with 100-200ms latency)
This this code just allocated and closes tensors, model evaluation and session and other TF components should have no impact.
With 16 threads the number of outliers is higher (about 2-3%)

Single Thread:

[125.007s][info][gc] GC(6) Pause Young (Normal) (G1 Evacuation Pause) 3632M->8M(6554M) 28.763ms
nAlloc 2449 (total):  2445 (<10ms);  - (<20ms);  3 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 2959 (total):  2956 (<10ms);  - (<20ms);  1 (<30ms);  1 (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 3111 (total):  3106 (<10ms);  4 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2757 (total):  2753 (<10ms);  4 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2860 (total):  2858 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
...
[156.620s][info][gc] GC(7) Pause Young (Normal) (G1 Evacuation Pause) 3930M->8M(6554M) 14.136ms
nAlloc 1015 (total):  1003 (<10ms);  3 (<20ms);  5 (<30ms);  2 (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 406 (total):  384 (<10ms);  8 (<20ms);  5 (<30ms);  3 (<40ms);  2 (<50ms);  4 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 352 (total):  337 (<10ms);  4 (<20ms);  2 (<30ms);  2 (<40ms);  1 (<50ms);  4 (<100ms);  2 (<200ms);  - ( >200ms)
nAlloc 292 (total):  274 (<10ms);  5 (<20ms);  4 (<30ms);  2 (<40ms);  1 (<50ms);  6 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 222 (total):  208 (<10ms);  5 (<20ms);  4 (<30ms);  - (<40ms);  1 (<50ms);  3 (<100ms);  - (<200ms);  1 ( >200ms)
nAlloc 282 (total):  264 (<10ms);  11 (<20ms);  1 (<30ms);  - (<40ms);  1 (<50ms);  2 (<100ms);  3 (<200ms);  - ( >200ms)
nAlloc 119 (total):  111 (<10ms);  1 (<20ms);  1 (<30ms);  1 (<40ms);  1 (<50ms);  1 (<100ms);  1 (<200ms);  2 ( >200ms)

With 4 Threads (which also had a strange behaviour of taking 49 seconds before I saw allocations):

[267.597s][info][gc] GC(10) Pause Young (Normal) (G1 Evacuation Pause) 3930M->8M(6554M) 7.855ms
nAlloc 1054 (total):  999 (<10ms);  16 (<20ms);  9 (<30ms);  3 (<40ms);  2 (<50ms);  10 (<100ms);  12 (<200ms);  3 ( >200ms)
nAlloc 1454 (total):  1399 (<10ms);  20 (<20ms);  20 (<30ms);  6 (<40ms);  2 (<50ms);  6 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 1552 (total):  1479 (<10ms);  31 (<20ms);  12 (<30ms);  7 (<40ms);  2 (<50ms);  11 (<100ms);  6 (<200ms);  4 ( >200ms)
nAlloc 2221 (total):  2149 (<10ms);  21 (<20ms);  19 (<30ms);  7 (<40ms);  9 (<50ms);  11 (<100ms);  5 (<200ms);  - ( >200ms)
nAlloc 1165 (total):  1122 (<10ms);  17 (<20ms);  8 (<30ms);  4 (<40ms);  1 (<50ms);  5 (<100ms);  3 (<200ms);  5 ( >200ms)
nAlloc 1501 (total):  1437 (<10ms);  24 (<20ms);  14 (<30ms);  4 (<40ms);  2 (<50ms);  10 (<100ms);  9 (<200ms);  1 ( >200ms)
nAlloc 2853 (total):  2786 (<10ms);  28 (<20ms);  10 (<30ms);  6 (<40ms);  9 (<50ms);  7 (<100ms);  7 (<200ms);  - ( >200ms)
nAlloc 6840 (total):  6809 (<10ms);  17 (<20ms);  7 (<30ms);  2 (<40ms);  1 (<50ms);  4 (<100ms);  1 (<200ms);  - ( >200ms)

With 16 threads:

[536.188s][info][gc] GC(35) Pause Young (Normal) (G1 Evacuation Pause) 3938M->8M(6554M) 3.820ms
nAlloc 1663 (total):  1591 (<10ms);  16 (<20ms);  8 (<30ms);  1 (<40ms);  - (<50ms);  4 (<100ms);  13 (<200ms);  30 ( >200ms)
nAlloc 2997 (total):  2896 (<10ms);  17 (<20ms);  7 (<30ms);  3 (<40ms);  2 (<50ms);  8 (<100ms);  28 (<200ms);  36 ( >200ms)
nAlloc 2279 (total):  2186 (<10ms);  22 (<20ms);  6 (<30ms);  2 (<40ms);  3 (<50ms);  6 (<100ms);  26 (<200ms);  28 ( >200ms)
nAlloc 2963 (total):  2844 (<10ms);  34 (<20ms);  15 (<30ms);  4 (<40ms);  - (<50ms);  9 (<100ms);  34 (<200ms);  23 ( >200ms)
nAlloc 3228 (total):  3097 (<10ms);  30 (<20ms);  8 (<30ms);  10 (<40ms);  4 (<50ms);  16 (<100ms);  36 (<200ms);  27 ( >200ms)

nebulorum · 2021-05-06T09:42:19Z

Just as a baseline I did only JVM allocation. GC goes crazy and we do see some contention, but at 10 X the allocations.

 private void allocLoop() {
    long[] nd = new long[200];
    for (var i = 0; i < 80; i++) {
      long[] nd2 = new long[200];
    }

Once head is full size you see variances on single thread, but not as bad
On multiple thread the number of outliers grows, but again the base line is much larger.

Benchmarking is hard :)

Either way just layering tensor allocation cost 10X.

1 threads:

[64.609s][info][gc] GC(47) Pause Young (Normal) (G1 Evacuation Pause) 3937M->7M(6554M) 6.470ms
nAlloc 17489 (total):  17479 (<10ms);  5 (<20ms);  1 (<30ms);  1 (<40ms);  - (<50ms);  2 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 9884 (total):  9864 (<10ms);  12 (<20ms);  2 (<30ms);  2 (<40ms);  2 (<50ms);  2 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 16051 (total):  16044 (<10ms);  3 (<20ms);  1 (<30ms);  1 (<40ms);  - (<50ms);  2 (<100ms);  - (<200ms);  - ( >200ms)
[67.497s][info][gc] GC(48) Pause Young (Normal) (G1 Evacuation Pause) 3937M->7M(6554M) 3.663ms
nAlloc 7471 (total):  7454 (<10ms);  10 (<20ms);  3 (<30ms);  - (<40ms);  2 (<50ms);  1 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 12057 (total):  12040 (<10ms);  10 (<20ms);  2 (<30ms);  1 (<40ms);  2 (<50ms);  2 (<100ms);  - (<200ms);  - ( >200ms)
[69.543s][info][gc] GC(49) Pause Young (Normal) (G1 Evacuation Pause) 3937M->7M(6554M) 2.588ms
nAlloc 31432 (total):  31452 (<10ms);  3 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

16 threads:

[71.344s][info][gc] GC(51) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 2.796ms
nAlloc 25410 (total):  25337 (<10ms);  16 (<20ms);  3 (<30ms);  2 (<40ms);  1 (<50ms);  2 (<100ms);  27 (<200ms);  24 ( >200ms)
[72.515s][info][gc] GC(52) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 3.517ms
nAlloc 24546 (total):  24444 (<10ms);  18 (<20ms);  16 (<30ms);  4 (<40ms);  9 (<50ms);  5 (<100ms);  15 (<200ms);  34 ( >200ms)
[74.215s][info][gc] GC(53) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 3.105ms
nAlloc 18327 (total):  18268 (<10ms);  11 (<20ms);  5 (<30ms);  2 (<40ms);  1 (<50ms);  3 (<100ms);  7 (<200ms);  36 ( >200ms)
[75.341s][info][gc] GC(54) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 19.630ms
nAlloc 31495 (total):  31387 (<10ms);  15 (<20ms);  6 (<30ms);  1 (<40ms);  3 (<50ms);  12 (<100ms);  41 (<200ms);  24 ( >200ms)

nebulorum · 2021-05-06T17:12:43Z

We've been testing some more. We tried running our server without TensorFlow, but including data access to Cassandra. And same config can handle 380 RPS with 34 ms P99. So our time budget is tight, but if TF would add 4-5 ms we would be OK. For full transparency adding a 5ms sleep makes P99 goes up to 70ms, but this probably means we need went over the RPS capacity. We will clean up allocation logic to make it generate less garbage and takes less time.

saudet · 2021-05-06T23:58:03Z

I think there is still sync on the add and remove, plus the concurrent data structure were JavaCPP does the adding and removing.

No, add() never gets called, remove() returns before entering the synchronized {} block, and nothing else is synchronized.

saudet · 2021-05-07T00:33:59Z

I've replaced TInt64 with LongPointer in your AllocateStress as per this gist:
https://gist.github.com/saudet/6f020ecd6a5660762ff7535d65e5f69e

And I get the following results. I think that looks fine. Do you see anything wrong with those numbers?

nAlloc 6015 (total):  6014 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5939 (total):  5907 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5993 (total):  5999 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5940 (total):  5902 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5992 (total):  5993 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5927 (total):  5894 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5900 (total):  5891 (<10ms);  15 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5971 (total):  5948 (<10ms);  17 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5904 (total):  5872 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5962 (total):  5967 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5925 (total):  5888 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5996 (total):  6004 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5925 (total):  5885 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5961 (total):  5961 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5943 (total):  5913 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 6001 (total):  6002 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5969 (total):  5934 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

nebulorum · 2021-05-07T07:48:34Z

That is an interesting find. That is kind of what I was expecting, low variance. Are the data types equivalent? Or is this showing something else inside the tensor.

nebulorum · 2021-05-07T08:45:00Z

You test made me try something different I used async profiler on Intellij and maybe the thing is on the other side of JNI. Here is an example of the FlameGraph and a lot of frames are in Java_org_tensorflow_internal_c_1api_global_tensorflow_TF_1AllocateTensor__I_3JIJ. There is event some __unlock_wait.

This would be in line with your test. Allocating Tensor may be generating contention on the CPP side.

saudet · 2021-05-07T11:39:45Z

A mutex does get used here when collecting stats:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/cpu_allocator_impl.cc
Are you sure that hasn't been enabled somehow?

nebulorum · 2021-05-07T12:30:51Z

We did not do any configuration on TensorFlow:

LoadModel bundle
Get session from model (once each time the model gets updated)
Get runner from session per request.
Allocate Tensors, run. model
Copy answer
Close all Tensors

We follow the implementation of 1.3.1 on this, so maybe we got the wrong things.

We have no TensorFlow specific config or flags, and on Dockerfile we just give some params to the JVM:

java -server -XX:MaxRAMPercentage=70 -XX:InitialRAMPercentage=70 \
 -Xlog:gc -XX:MaxGCPauseMillis=20 -XX:+AlwaysPreTouch -XX:+DisableExplicitGC\
 -XX:+ExitOnOutOfMemoryError \
 -jar /service.jar

PS: Maybe the title of this ticket no longer makes sense.

dennisTS · 2021-05-07T15:16:04Z

And this is a specific example of a blocked condition; one of the worker threads was waiting (for ~20ms) for the other one to release monitor inside deallocator

saudet · 2021-05-08T00:43:17Z

@dennisTS I've just shown above that this doesn't happen with 1.5.6-SNAPSHOT.

….bytedeco.javacpp.nopointergc" (issue tensorflow/java#313)

saudet · 2021-05-09T09:11:02Z

We have no TensorFlow specific config or flags, and on Dockerfile we just give some params to the JVM:
java -server -XX:MaxRAMPercentage=70 -XX:InitialRAMPercentage=70 \
 -Xlog:gc -XX:MaxGCPauseMillis=20 -XX:+AlwaysPreTouch -XX:+DisableExplicitGC\
 -XX:+ExitOnOutOfMemoryError \
 -jar /service.jar

Where are you setting the "org.bytedeco.javacpp.nopointergc" system property to "true"?

There's one more place where synchronization could potentially take place when classes are unloaded, for the WeakHashMap used inside the call to sizeof() in Pointer.DeallocatorReference, but I have a hard time imagining in what kind of application that would actually matter. In any case, I've "fixed" that in commit bytedeco/javacpp@0ce97fc .

nebulorum · 2021-05-10T07:15:10Z

Where are you setting the "org.bytedeco.javacpp.nopointergc" system property to "true"?

We collected from our production config. If we can run the SNAPSHOT version in production we will do it. We need to release a version of this for some testing.

But this was more to this question:

A mutex does get used here when collecting stats: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/cpu_allocator_impl.cc
Are you sure that hasn't been enabled somehow?

No I don't know how I would have turned on the Mutex. We are not using any additional configuration. Maybe we should make sure the mutex is not on.

As your example showed if Tensors are not allocated performance is pretty stable. So maybe the issues is not JavaCPP.

saudet · 2021-05-11T02:11:32Z

Using TF Java 0.3.1 and JavaCPP 1.5.6-SNAPSHOT with "noPointerGC", the original AllocateStress also looks OK to me:

nAlloc 5621 (total):  5586 (<10ms);  34 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5678 (total):  5668 (<10ms);  10 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5663 (total):  5651 (<10ms);  15 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5720 (total):  5699 (<10ms);  20 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5613 (total):  5601 (<10ms);  11 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5776 (total):  5766 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5640 (total):  5600 (<10ms);  41 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5660 (total):  5650 (<10ms);  13 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5667 (total):  5655 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5627 (total):  5593 (<10ms);  33 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5718 (total):  5706 (<10ms);  12 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5658 (total):  5650 (<10ms);  10 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5638 (total):  5615 (<10ms);  22 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5648 (total):  5631 (<10ms);  27 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5637 (total):  5603 (<10ms);  23 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5640 (total):  5608 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5731 (total):  5707 (<10ms);  24 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

We collected from our production config. If we can run the SNAPSHOT version in production we will do it. We need to release a version of this for some testing.

If you're satisfied with JavaCPP 1.5.6-SNAPSHOT, but need a release, I can do a 1.5.5-1 or something with the fix, but before doing that please make sure that there isn't something else that you may want updated as well. Thanks for testing!

dennisTS · 2021-05-11T11:40:12Z

Hi @saudet , thank you! We'll check it and get back to you

nebulorum · 2021-05-11T12:35:23Z

We ran this (better @dennisTS did) in our test rig and things improved a lot:

600 RPS against 8CPU we are getting p99.9 =48ms and p95 = 24ms

So real improvement using this version and option.

As a follow-up what do we lose by using nopointergc?

karllessard · 2021-05-11T13:22:49Z

JavaCPP GC thread is mostly useful to collect unclosed resources in eager mode, which is mostly used for debugging than for high-performance in production. If you are running sessions from a saved model graph, you normally don't need an eager session and can live with nopointergc without issue.

karllessard · 2021-05-11T13:29:57Z

@saudet , is there a way to completely turn off GC support in JavaCPP programmatically instead of passing a parameter to the command line? I guess we can define the value for this environment variable directly in TF Java before loading the TensorFlow runtime library but how to guarantee we can do it before JavaCPP libraries statically loads up?

nebulorum · 2021-05-11T15:05:59Z

Now that we've been running for some hours in test we would really like to release to production. Will check with security if we can use the snapshot. But how long would it take to release a version of JavaCPP?

saudet · 2021-05-12T01:31:36Z

@saudet , is there a way to completely turn off GC support in JavaCPP programmatically instead of passing a parameter to the command line? I guess we can define the value for this environment variable directly in TF Java before loading the TensorFlow runtime library but how to guarantee we can do it before JavaCPP libraries statically loads up?

Not really possible, unless TF Java is the only library using JavaCPP in the app, and if the user has other libraries using JavaCPP, it's probably not a good idea to tamper with global settings like that anyway. In any case, this is the kind of information that is useful for optimization, and should be part of the documentation on a page, for example, like those here:
https://deeplearning4j.konduit.ai/config/config-memory
https://netty.io/wiki/reference-counted-objects.html

That said, I could add a parameter to Pointer.deallocator() that would let users specify whether they want to keep it out of the reference queue or not. The deallocator thread would still be there by default, but wouldn't do anything.

Now that we've been running for some hours in test we would really like to release to production. Will check with security if we can use the snapshot. But how long would it take to release a version of JavaCPP?

A couple of days, but like I said, let's make sure there isn't anything else we want to put in there...

nebulorum · 2021-05-31T09:04:15Z

We've been running the SNAPSHOT for some days and it seems OK. Did the final version get release?

saudet · 2021-05-31T10:05:52Z

We've been running the SNAPSHOT for some days and it seems OK. Did the final version get release?

You mean JavaCPP? No, do you need one?

BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try:

https://github.com/bytedeco/javacpp-presets/tree/master/tensorflow-lite

Those are currently wrappers for the C++ API, but if you would like to use an idiomatic high-level API instead, please let us know!

dennisTS · 2021-07-05T11:42:41Z

@saudet, sorry for getting back to you with such a delay - somehow forgot about this thread

You mean JavaCPP? No, do you need one?

Well, it would be nice - using SNAPSHOT in production just doesn't feel good :)

BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try:

We are using higher level APIs now; but generally even with the SNAPSHOT version of JavaCPP latencies are good for us

saudet added a commit to bytedeco/javacpp that referenced this issue May 5, 2021

* Ensure synchronized code in Pointer gets skipped with "org.byt…

d788390

…edeco.javacpp.nopointergc" (issue tensorflow/java#313)

saudet added a commit to bytedeco/javacpp that referenced this issue May 9, 2021

Skip some more potentially synchronized code in Pointer with "org…

0ce97fc

….bytedeco.javacpp.nopointergc" (issue tensorflow/java#313)

nebulorum changed the title ~~Documentation on Session and Resource management~~ Large latency on Tensor allocation May 10, 2021

saudet mentioned this issue Aug 3, 2021

Upgrade JavaCPP to just released version 1.5.6 #361

Merged

karllessard closed this as completed in #361 Aug 5, 2021

saudet mentioned this issue Sep 16, 2021

Resource scopes draft #354

Draft

saudet mentioned this issue Dec 3, 2021

Struct memory allocation is slow bytedeco/javacpp#299

Open

Large latency on Tensor allocation #313

Large latency on Tensor allocation #313

Comments

nebulorum commented May 3, 2021

Craigacp commented May 3, 2021

nebulorum commented May 4, 2021

saudet commented May 4, 2021

Craigacp commented May 4, 2021 • edited Loading

nebulorum commented May 4, 2021

karllessard commented May 4, 2021

Craigacp commented May 4, 2021 • edited Loading

nebulorum commented May 4, 2021

saudet commented May 4, 2021

Craigacp commented May 4, 2021 • edited Loading

nebulorum commented May 4, 2021

Craigacp commented May 4, 2021 • edited Loading

nebulorum commented May 4, 2021

nebulorum commented May 4, 2021

saudet commented May 5, 2021

nebulorum commented May 5, 2021

nebulorum commented May 5, 2021 • edited Loading

saudet commented May 6, 2021

nebulorum commented May 6, 2021 • edited Loading

saudet commented May 6, 2021

nebulorum commented May 6, 2021

nebulorum commented May 6, 2021

nebulorum commented May 6, 2021

saudet commented May 6, 2021

saudet commented May 7, 2021

nebulorum commented May 7, 2021

nebulorum commented May 7, 2021 • edited Loading

saudet commented May 7, 2021

nebulorum commented May 7, 2021 • edited Loading

dennisTS commented May 7, 2021

saudet commented May 8, 2021

saudet commented May 9, 2021

nebulorum commented May 10, 2021

saudet commented May 11, 2021

dennisTS commented May 11, 2021

nebulorum commented May 11, 2021 • edited Loading

karllessard commented May 11, 2021

karllessard commented May 11, 2021

nebulorum commented May 11, 2021

saudet commented May 12, 2021

nebulorum commented May 31, 2021

saudet commented May 31, 2021

dennisTS commented Jul 5, 2021

Craigacp commented May 4, 2021 •

edited

Loading

Craigacp commented May 4, 2021 •

edited

Loading

Craigacp commented May 4, 2021 •

edited

Loading

Craigacp commented May 4, 2021 •

edited

Loading

nebulorum commented May 5, 2021 •

edited

Loading

nebulorum commented May 6, 2021 •

edited

Loading

nebulorum commented May 7, 2021 •

edited

Loading

nebulorum commented May 7, 2021 •

edited

Loading

nebulorum commented May 11, 2021 •

edited

Loading