Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large latency on Tensor allocation #313

Closed
nebulorum opened this issue May 3, 2021 · 43 comments · Fixed by #361
Closed

Large latency on Tensor allocation #313

nebulorum opened this issue May 3, 2021 · 43 comments · Fixed by #361

Comments

@nebulorum
Copy link

System information

Describe the documentation issue

Not sure this is the correct forum but I would like to some guidance on how to setup sessions and resource management would be interesting.

After two weeks trying to understand why latencies in 0.3.1 were completely uncontrollable (as compared to official 1.3.1) I ran into #208. This matches my observations.

We are trying to run Prediction on models with thousands of data points in different tensor per prediction. Memory allocation on the threads are in 20MB/s and there seems to be a sync between JavaCCP Allocation thread and our worker threads. In addition to this allocation using Size(1) tensor seems to be very slow (in the 7ms range).

After reading #208, it seems we are doing everything wrong. But I don't really have a clear picture of how it should be done: Would EagerSession help? Could I use a Session per HTTP request? Should I allocate larger multi-dimensional tensors instead of a single one? How should I configure thread pools? I understand that the API is work in progress, but current documentation is very light on this kind of documentation.

I don't think this is a bug, but I can convert into some other sort of issue.

@Craigacp
Copy link
Collaborator

Craigacp commented May 3, 2021

Session should be long lived as recreating it involves a bunch of allocations and other things. If you can batch up the incoming requests then that will allow the TF native runtime to perform more efficient batched matrix multiplies/convolutions (in many cases). I think the Session object should be thread safe so you can use it concurrently from multiple threads.

We should probably turn off the JavaCPP memory allocation size check and let the GC clear it up, if you're creating lots of Java side objects GC should run frequently enough to clear things out anyway. Are you hitting the thread.sleep inside JavaCPP's allocator?

Note if you're making lots of sessions out of saved model bundles you might be hitting some memory leaks inside TF's C API (which the TF Java API uses to interact with the native library). I opened this issue (tensorflow/tensorflow#48802) upstream but we've not had a response yet.

@nebulorum
Copy link
Author

Running the JVM under the profiler I never see the JavaCCP deallocation thread sleep. It is either in wait, monitor or running state. On lower load most of the time on wait and then monitor/running sync up with the worker threads.

We have a simple model, that can run in 3ms under low load. Our overall call should happen under 40ms, but when load increases we can clearly see waits in our telemetry. Model evaluation is still quick but something is causing contention on the JVM.

After reviewing the Tensor construction it seems for each request we have 80 Tensor with length 200, some String, most Long. Our feature mapping maybe be generating too much garbage on the side, but after running for a about ~2% of head is org.bytedeco.javacpp.Pointer$DeallocatorReference and another ~2% org.tensorflow.internal.c_api.AbstractTF_Tensor$DeleteDeallocator.

We also see contention on the JavaCCP Deallocation thread. It is the yellow thread on the picture below. Other threads are some of our worker threads:

image

You can see clearly see several threads block up on a monitor. The image bellow shows that this sync pattern happens more often:
image

After reviewing some of our thread dumps we also see thread blocking like this:

java.lang.Thread.State: BLOCKED (on object monitor)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:666)
	- waiting to lock <0x00000005330184f8> (a java.lang.Class for org.bytedeco.javacpp.Pointer$DeallocatorThread)
	at org.bytedeco.javacpp.Pointer.init(Pointer.java:127)
	at org.bytedeco.javacpp.BytePointer.allocateArray(Native Method)
	at org.bytedeco.javacpp.BytePointer.<init>(BytePointer.java:124)
	at org.bytedeco.javacpp.BytePointer.<init>(BytePointer.java:97)
	at org.tensorflow.internal.buffer.ByteSequenceTensorBuffer$InitDataWriter.writeNext(ByteSequenceTensorBuffer.java:129)

Or:

at org.bytedeco.javacpp.Pointer.physicalBytes(Native Method)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:665)
	at org.tensorflow.internal.c_api.AbstractTF_Tensor.allocateTensor(AbstractTF_Tensor.java:91)
	at org.tensorflow.RawTensor.allocate(RawTensor.java:197)

org.bytedeco.javacpp.Pointer

Looking at org.bytedeco.javacpp.Pointer.java:666 we can see its a synchronized block. And the Pointer has several sync blocks when adding and removing deallocators. I suspect that as 200 requests per second come in, each allocating 80 tensors, we run into contention on the JavaCCP thread, and that causes the sync on monitor and blows our latency through the roof. This does not happen like this in the 1.3.1 library, but we wanted to move to 2.4.1.

I now question if this is a documentation issue, a usage issue or a bug/feature issue. And how to work around this behaviour.

@saudet
Copy link
Contributor

saudet commented May 4, 2021

@nebulorum If you see some output when running your app with something like shown at #251 (comment), then it means you're forgetting to call close() on those objects. You'll need to make sure to call close() on those objects to avoid incurring these costs associated with GC fallback. In other words, if you call close() everywhere you can, you will not see any overhead. That responsibility is on you! This has nothing to do with JavaCPP. It is only trying to cope with code that isn't properly calling close() at all the appropriate places. The close() method is a standard Java SE method of the AutoCloseable class. I do not see the point of writing additional documentation about that.

@Craigacp
Copy link
Collaborator

Craigacp commented May 4, 2021

The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, type of CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)?

@nebulorum
Copy link
Author

The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)?

We are running on AWS only CPU, no co-processors. What has baffled us it that 1.3.1 has very stable latency and the code is 3 year old. This new endpoint can be faster, but degrades really quickly as load increases.

I have considered having a Tensor allocating thread so that we have fewer thread fighting for the block. But this will obviously create a bottleneck.

@karllessard
Copy link
Collaborator

@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set maxPhysicalBytes, maxRetries or maxBytes to 0).

Since that's a protected method, I can override it and run a test to see if there is any gain. @nebulorum , do you have any code available to share to run this benchmark?

@Craigacp
Copy link
Collaborator

Craigacp commented May 4, 2021

The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)?

We are running on AWS only CPU, no co-processors. What has baffled us it that 1.3.1 has very stable latency and the code is 3 year old. This new endpoint can be faster, but degrades really quickly as load increases.

I have considered having a Tensor allocating thread so that we have fewer thread fighting for the block. But this will obviously create a bottleneck.

Ok, how big is your AWS instance? If there are NUMA effects from multiple CPU sockets (or within a socket for some of the AMD EPYC variants) then this could slow things down, and having a single thread doing the input tensor creation would help (though given there are lots of other things that create JavaCPP Pointers during a session.run call it won't prevent everything from bouncing the locking object across sockets).

As Karl says, we should look at this locking on our side to see if we can improve things, but in the meantime pinning the threads if you're on a large machine might help.

@nebulorum
Copy link
Author

@nebulorum If you see some output when running your app with something like shown at #251 (comment), then it means you're forgetting to call close() on those objects. You'll need to make sure to call close() on those objects to avoid incurring these costs associated with GC fallback. In other words, if you call close() everywhere you can, you will not see any overhead. That responsibility is on you! This has nothing to do with JavaCPP. It is only trying to cope with code that isn't properly calling close() at all the appropriate places. The close() method is a standard Java SE method of the AutoCloseable class. I do not see the point of writing additional documentation about that.

We checked in integration on if we call gc() in the test we do see the messages as referred on #251. We will try to close all Tensors before leaving the model execution part and see if this improves.

While we did have a face palm moment when you mentioned the AutoClosable, I would however suggest that the example include a dbx.close() in the example.

Making sure people are aware they need to do proper resource management to ensure performance may be helpful. Specially since this is not the first issue around the topic.

@saudet
Copy link
Contributor

saudet commented May 4, 2021

@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set maxPhysicalBytes, maxRetries or maxBytes to 0).

Sigh... sure, I can probably do that when the "org.bytedeco.javacpp.nopointergc" system property is set to "true" so that you can all stop blaming me for bugs in your code! :) No one has never ever encountered any issues like that with JavaCPP in the past. This is a problem in TF Java.

@Craigacp
Copy link
Collaborator

Craigacp commented May 4, 2021

@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set maxPhysicalBytes, maxRetries or maxBytes to 0).

Sigh... sure, I can probably do that when the "org.bytedeco.javacpp.nopointergc" system property is set to "true" so that you can all stop blaming me for bugs in your code! :) No one has never ever encountered any issues like that with JavaCPP in the past. This is a problem in TF Java.

Lock contention is a different issue to the freeing of resources. If threads are blocked waiting for the lock on DeallocatorThread or DeallocatorReference (as indicated by the screenshots) then that's a different problem to the leaking of memory from unclosed tensors. How large a machine have you run this on? Using a single lock to guard a resource which is accessed from multiple threads or sockets is going to slow down under contention, the question is just how much contention it takes to cause problems.

@nebulorum
Copy link
Author

So we added calling close on all tensors and the number of Pointer$NativeDeallocator went down. But thread sync is still going on and no noticeable impact of latencies.

We are doing this load testing on AWS c5.4xlarge (well the host for are K8S Pods).

One other thing I was considering is preallocating and reusing the Tensor. We could put an upper bound on tensor length and maybe preallocate them. Not sure this would work on the runtime though. And thinking, is a bit optimistic because doing this logic would be hard. Also current implementation has no mutator for the tensors.

@Craigacp
Copy link
Collaborator

Craigacp commented May 4, 2021

So we added calling close on all tensors and the number of Pointer$NativeDeallocator went down. But thread sync is still going on and no noticeable impact of latencies.

We are doing this load testing on AWS c5.4xlarge (well the host for are K8S Pods).

One other thing I was considering is preallocating and reusing the Tensor. We could put an upper bound on tensor length and maybe preallocate them. Not sure this would work on the runtime though. And thinking, is a bit optimistic because doing this logic would be hard. Also current implementation has no mutator for the tensors.

You can write values directly into the tensors when they are cast to their appropriate type (e.g. TFloat32 or TUint8). They have set methods inherited from the NdArray subinterface.

A c5.4xlarge shouldn't be big enough to exhibit NUMA effects, as I'd expect it to be on a single socket, so at least it's not that.

@nebulorum
Copy link
Author

You can write values directly into the tensors when they are cast to their appropriate type (e.g. TFloat32 or TUint8). They have set methods inherited from the NdArray subinterface.

If I go down this route, is it possible to also update the shape of the Tensor? My thinking is given a modes we know how many tensor, and the maximum size. Allocate the the for the maximum size, set the data update the dimension. Not allocation needed.

@nebulorum
Copy link
Author

Tried to work with capacity to see the breaking point.

VU RPS p50 p95
300 277 70 ms 180 ms
200 190 40 100
100 97 24 86
50 48 18 37

The composite below show JavaCCP and some of the 32 thread in different load levels (300, 200, 100 VU).

image

The smaller the load the full column of sync action is less common. At higher load contention looks a lot worst. Not that the window tick on the graph as 2 seconds apart and light blue is state monitor. Event at low loads the sync appear. At 300 VU, K8S reported 6.27 CPUs of usage.

We will try smaller thread pools. But there seems to be some interesting interplay on the JavaCPP thread.

saudet added a commit to bytedeco/javacpp that referenced this issue May 5, 2021
@saudet
Copy link
Contributor

saudet commented May 5, 2021

I've added a way to avoid any synchronized code at runtime in commit bytedeco/javacpp@d788390. Please give it a try with JavaCPP 1.5.6-SNAPSHOT and the "org.bytedeco.javacpp.nopointergc" system property set to "true". If you still see a "JavaCPP Deallocator" in your threads, that means the system property was not set early enough, so please try again, possibly with something like export _JAVA_OPTIONS=-Dorg.bytedeco.javacpp.nopointergc=true in a Bash profile somewhere.

@nebulorum
Copy link
Author

I tied these parameter and artefact and the "JavaCCP Deallocator" is no where to be found. But the behaviour did not improve. I also tested 8 threads and then it looks like synchronised swimming team :)

On the positive side the machine does not to seem unstable and at 8 thread I have the same latency as 32 threads at 1000 virtual users. The Difference is the RPS, 8 -> 400 RPS, 32 -> 650 RPS.

I'm considering testing just layer at a time to confirm where the problem is. But this will take sometime. Maybe we can construct an example of just looping at allocation and check if the contention appears.

@nebulorum
Copy link
Author

nebulorum commented May 5, 2021

I created a small example that tries to allocated and close Tensors in several thread counts. I also added a simple latency check: https://gist.github.com/nebulorum/f7978aa5519cab8bece65d4dac689d4f

The main loop allocates 80 Tensors base on a single array, sleep for 5 ms, then close all Tensors.

On my Mac Dual Core I5 (4 HT if I got correctly) with 32 threads, we get >2500 allocations per second, and occasionally allocation takes more than 200 ms.

nAlloc 2076 (total):  1206 (<10ms);  675 (<20ms);  106 (<30ms);  27 (<40ms);  6 (<50ms);  20 (<100ms);  4 (<200ms);  32 ( >200ms)
nAlloc 2854 (total):  1761 (<10ms);  875 (<20ms);  116 (<30ms);  49 (<40ms);  13 (<50ms);  40 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 2960 (total):  2382 (<10ms);  444 (<20ms);  77 (<30ms);  16 (<40ms);  2 (<50ms);  5 (<100ms);  1 (<200ms);  32 ( >200ms)
nAlloc 3960 (total):  3731 (<10ms);  157 (<20ms);  30 (<30ms);  11 (<40ms);  - (<50ms);  - (<100ms);  32 (<200ms);  - ( >200ms)

With 4 worker threads you see allocation taking 50-100 ms:

nAlloc 546 (total):  538 (<10ms);  10 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 540 (total):  525 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  4 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 621 (total):  620 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 517 (total):  496 (<10ms);  16 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  4 (<100ms);  - (<200ms);  - ( >200ms)

On a single thread you can still see 50-100ms allocation but much less often:

nAlloc 147 (total):  148 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 144 (total):  144 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 136 (total):  130 (<10ms);  6 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 122 (total):  106 (<10ms);  14 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 130 (total):  121 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

If you remove the pause, a single thread goes to 2500 allocation per second, but you start seeing some slower threads but single occurrences.

@saudet
Copy link
Contributor

saudet commented May 6, 2021

JavaCPP can allocate and deallocate 2 MB over 2000 * 24 times from 24 threads running concurrently on a typical laptop in about 50 ms, see deepjavalibrary/djl#690 (comment). That means ~1 microsecond for each allocation/deallocation cycle, so I think we can safely assume that the issue isn't with lock contention in JavaCPP. Finally, we can put that to rest. Thank you for working on this!

@nebulorum
Copy link
Author

nebulorum commented May 6, 2021

I think there is contention on JavaCCP. Even with a single thread you will see latency spikes. But with a lot of thread and you unlucky 3% of your allocation take more than 50ms. There is event a drop in allocations per second.

Under normal 50ms or greater latency tolerances this is ok. But in our use case we aim for less than 40ms. On Tensorflow Java 1.3.1 our P999 is under 40ms. With the new 0.3.1 we can't really even reach this on the P50 under load.

The problem is not the average allocation time, but the outliers. The more threads you have the worst it becomes. If you look at the first example, you can see 32 thread taking more than 100ms to allocate and this happened 2 seconds apart. This could be interplay with GC too.

Here is a single thread with no sleep:

nAlloc 2478 (total):  2478 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 1693 (total):  1692 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  1 ( >200ms)
nAlloc 2797 (total):  2798 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2884 (total):  2883 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2378 (total):  2377 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  1 (<200ms);  - ( >200ms)

You can see on the second row > 200ms wait and the allocations per second is 33% lower. As a reminder each allocation is 80 tensors.

@saudet
Copy link
Contributor

saudet commented May 6, 2021

Right, ok, but the contention cannot happen in JavaCPP since I've removed all of that in commit bytedeco/javacpp@d788390. Something else in TF Java is happening that is unrelated to JavaCPP.

@nebulorum
Copy link
Author

Right, ok, but the contention cannot happen in JavaCPP since I've removed all of that in commit bytedeco/javacpp@d788390. Something else in TF Java is happening that is unrelated to JavaCPP.

I think there is still sync on the add and remove, plus the concurrent data structure were JavaCPP does the adding and removing.

I did a run with the 1.5.6-SNAPSHOT and the -Dorg.bytedeco.javacpp.nopointergc=true in addition I added more memory for Heap and added the GC output (-Xlog:gc -XX:MaxRAMPercentage=40 -XX:InitialRAMPercentage=40). No pause between allocations.

The results are interesting:

  • Variance in the number of allocation is really large. On single thread it go as high as 5K, space of 2K and low hundreds
  • GC does kick in but it does not seem to be too long.
  • With more threads variance is large and you can see groups of high latency (like 12 with 100-200ms latency)
  • This this code just allocated and closes tensors, model evaluation and session and other TF components should have no impact.
  • With 16 threads the number of outliers is higher (about 2-3%)

Single Thread:

[125.007s][info][gc] GC(6) Pause Young (Normal) (G1 Evacuation Pause) 3632M->8M(6554M) 28.763ms
nAlloc 2449 (total):  2445 (<10ms);  - (<20ms);  3 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 2959 (total):  2956 (<10ms);  - (<20ms);  1 (<30ms);  1 (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 3111 (total):  3106 (<10ms);  4 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2757 (total):  2753 (<10ms);  4 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 2860 (total):  2858 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
...
[156.620s][info][gc] GC(7) Pause Young (Normal) (G1 Evacuation Pause) 3930M->8M(6554M) 14.136ms
nAlloc 1015 (total):  1003 (<10ms);  3 (<20ms);  5 (<30ms);  2 (<40ms);  - (<50ms);  1 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 406 (total):  384 (<10ms);  8 (<20ms);  5 (<30ms);  3 (<40ms);  2 (<50ms);  4 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 352 (total):  337 (<10ms);  4 (<20ms);  2 (<30ms);  2 (<40ms);  1 (<50ms);  4 (<100ms);  2 (<200ms);  - ( >200ms)
nAlloc 292 (total):  274 (<10ms);  5 (<20ms);  4 (<30ms);  2 (<40ms);  1 (<50ms);  6 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 222 (total):  208 (<10ms);  5 (<20ms);  4 (<30ms);  - (<40ms);  1 (<50ms);  3 (<100ms);  - (<200ms);  1 ( >200ms)
nAlloc 282 (total):  264 (<10ms);  11 (<20ms);  1 (<30ms);  - (<40ms);  1 (<50ms);  2 (<100ms);  3 (<200ms);  - ( >200ms)
nAlloc 119 (total):  111 (<10ms);  1 (<20ms);  1 (<30ms);  1 (<40ms);  1 (<50ms);  1 (<100ms);  1 (<200ms);  2 ( >200ms)

With 4 Threads (which also had a strange behaviour of taking 49 seconds before I saw allocations):

[267.597s][info][gc] GC(10) Pause Young (Normal) (G1 Evacuation Pause) 3930M->8M(6554M) 7.855ms
nAlloc 1054 (total):  999 (<10ms);  16 (<20ms);  9 (<30ms);  3 (<40ms);  2 (<50ms);  10 (<100ms);  12 (<200ms);  3 ( >200ms)
nAlloc 1454 (total):  1399 (<10ms);  20 (<20ms);  20 (<30ms);  6 (<40ms);  2 (<50ms);  6 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 1552 (total):  1479 (<10ms);  31 (<20ms);  12 (<30ms);  7 (<40ms);  2 (<50ms);  11 (<100ms);  6 (<200ms);  4 ( >200ms)
nAlloc 2221 (total):  2149 (<10ms);  21 (<20ms);  19 (<30ms);  7 (<40ms);  9 (<50ms);  11 (<100ms);  5 (<200ms);  - ( >200ms)
nAlloc 1165 (total):  1122 (<10ms);  17 (<20ms);  8 (<30ms);  4 (<40ms);  1 (<50ms);  5 (<100ms);  3 (<200ms);  5 ( >200ms)
nAlloc 1501 (total):  1437 (<10ms);  24 (<20ms);  14 (<30ms);  4 (<40ms);  2 (<50ms);  10 (<100ms);  9 (<200ms);  1 ( >200ms)
nAlloc 2853 (total):  2786 (<10ms);  28 (<20ms);  10 (<30ms);  6 (<40ms);  9 (<50ms);  7 (<100ms);  7 (<200ms);  - ( >200ms)
nAlloc 6840 (total):  6809 (<10ms);  17 (<20ms);  7 (<30ms);  2 (<40ms);  1 (<50ms);  4 (<100ms);  1 (<200ms);  - ( >200ms)

With 16 threads:

[536.188s][info][gc] GC(35) Pause Young (Normal) (G1 Evacuation Pause) 3938M->8M(6554M) 3.820ms
nAlloc 1663 (total):  1591 (<10ms);  16 (<20ms);  8 (<30ms);  1 (<40ms);  - (<50ms);  4 (<100ms);  13 (<200ms);  30 ( >200ms)
nAlloc 2997 (total):  2896 (<10ms);  17 (<20ms);  7 (<30ms);  3 (<40ms);  2 (<50ms);  8 (<100ms);  28 (<200ms);  36 ( >200ms)
nAlloc 2279 (total):  2186 (<10ms);  22 (<20ms);  6 (<30ms);  2 (<40ms);  3 (<50ms);  6 (<100ms);  26 (<200ms);  28 ( >200ms)
nAlloc 2963 (total):  2844 (<10ms);  34 (<20ms);  15 (<30ms);  4 (<40ms);  - (<50ms);  9 (<100ms);  34 (<200ms);  23 ( >200ms)
nAlloc 3228 (total):  3097 (<10ms);  30 (<20ms);  8 (<30ms);  10 (<40ms);  4 (<50ms);  16 (<100ms);  36 (<200ms);  27 ( >200ms)

@nebulorum
Copy link
Author

Just as a baseline I did only JVM allocation. GC goes crazy and we do see some contention, but at 10 X the allocations.

 private void allocLoop() {
    long[] nd = new long[200];
    for (var i = 0; i < 80; i++) {
      long[] nd2 = new long[200];
    }
  • Once head is full size you see variances on single thread, but not as bad
  • On multiple thread the number of outliers grows, but again the base line is much larger.

Benchmarking is hard :)

Either way just layering tensor allocation cost 10X.

1 threads:

[64.609s][info][gc] GC(47) Pause Young (Normal) (G1 Evacuation Pause) 3937M->7M(6554M) 6.470ms
nAlloc 17489 (total):  17479 (<10ms);  5 (<20ms);  1 (<30ms);  1 (<40ms);  - (<50ms);  2 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 9884 (total):  9864 (<10ms);  12 (<20ms);  2 (<30ms);  2 (<40ms);  2 (<50ms);  2 (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 16051 (total):  16044 (<10ms);  3 (<20ms);  1 (<30ms);  1 (<40ms);  - (<50ms);  2 (<100ms);  - (<200ms);  - ( >200ms)
[67.497s][info][gc] GC(48) Pause Young (Normal) (G1 Evacuation Pause) 3937M->7M(6554M) 3.663ms
nAlloc 7471 (total):  7454 (<10ms);  10 (<20ms);  3 (<30ms);  - (<40ms);  2 (<50ms);  1 (<100ms);  1 (<200ms);  - ( >200ms)
nAlloc 12057 (total):  12040 (<10ms);  10 (<20ms);  2 (<30ms);  1 (<40ms);  2 (<50ms);  2 (<100ms);  - (<200ms);  - ( >200ms)
[69.543s][info][gc] GC(49) Pause Young (Normal) (G1 Evacuation Pause) 3937M->7M(6554M) 2.588ms
nAlloc 31432 (total):  31452 (<10ms);  3 (<20ms);  1 (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

16 threads:

[71.344s][info][gc] GC(51) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 2.796ms
nAlloc 25410 (total):  25337 (<10ms);  16 (<20ms);  3 (<30ms);  2 (<40ms);  1 (<50ms);  2 (<100ms);  27 (<200ms);  24 ( >200ms)
[72.515s][info][gc] GC(52) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 3.517ms
nAlloc 24546 (total):  24444 (<10ms);  18 (<20ms);  16 (<30ms);  4 (<40ms);  9 (<50ms);  5 (<100ms);  15 (<200ms);  34 ( >200ms)
[74.215s][info][gc] GC(53) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 3.105ms
nAlloc 18327 (total):  18268 (<10ms);  11 (<20ms);  5 (<30ms);  2 (<40ms);  1 (<50ms);  3 (<100ms);  7 (<200ms);  36 ( >200ms)
[75.341s][info][gc] GC(54) Pause Young (Normal) (G1 Evacuation Pause) 3936M->6M(6554M) 19.630ms
nAlloc 31495 (total):  31387 (<10ms);  15 (<20ms);  6 (<30ms);  1 (<40ms);  3 (<50ms);  12 (<100ms);  41 (<200ms);  24 ( >200ms)

@nebulorum
Copy link
Author

We've been testing some more. We tried running our server without TensorFlow, but including data access to Cassandra. And same config can handle 380 RPS with 34 ms P99. So our time budget is tight, but if TF would add 4-5 ms we would be OK. For full transparency adding a 5ms sleep makes P99 goes up to 70ms, but this probably means we need went over the RPS capacity. We will clean up allocation logic to make it generate less garbage and takes less time.

@saudet
Copy link
Contributor

saudet commented May 6, 2021

I think there is still sync on the add and remove, plus the concurrent data structure were JavaCPP does the adding and removing.

No, add() never gets called, remove() returns before entering the synchronized {} block, and nothing else is synchronized.

@saudet
Copy link
Contributor

saudet commented May 7, 2021

I've replaced TInt64 with LongPointer in your AllocateStress as per this gist:
https://gist.github.com/saudet/6f020ecd6a5660762ff7535d65e5f69e

And I get the following results. I think that looks fine. Do you see anything wrong with those numbers?

nAlloc 6015 (total):  6014 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5939 (total):  5907 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5993 (total):  5999 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5940 (total):  5902 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5992 (total):  5993 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5927 (total):  5894 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5900 (total):  5891 (<10ms);  15 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5971 (total):  5948 (<10ms);  17 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5904 (total):  5872 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5962 (total):  5967 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5925 (total):  5888 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5996 (total):  6004 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5925 (total):  5885 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5961 (total):  5961 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5943 (total):  5913 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 6001 (total):  6002 (<10ms);  - (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5969 (total):  5934 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

@nebulorum
Copy link
Author

That is an interesting find. That is kind of what I was expecting, low variance. Are the data types equivalent? Or is this showing something else inside the tensor.

@nebulorum
Copy link
Author

nebulorum commented May 7, 2021

You test made me try something different I used async profiler on Intellij and maybe the thing is on the other side of JNI. Here is an example of the FlameGraph and a lot of frames are in Java_org_tensorflow_internal_c_1api_global_tensorflow_TF_1AllocateTensor__I_3JIJ. There is event some __unlock_wait.
image

This would be in line with your test. Allocating Tensor may be generating contention on the CPP side.

@saudet
Copy link
Contributor

saudet commented May 7, 2021

A mutex does get used here when collecting stats:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/cpu_allocator_impl.cc
Are you sure that hasn't been enabled somehow?

@nebulorum
Copy link
Author

nebulorum commented May 7, 2021

We did not do any configuration on TensorFlow:

  1. LoadModel bundle
  2. Get session from model (once each time the model gets updated)
  3. Get runner from session per request.
  4. Allocate Tensors, run. model
  5. Copy answer
  6. Close all Tensors

We follow the implementation of 1.3.1 on this, so maybe we got the wrong things.

We have no TensorFlow specific config or flags, and on Dockerfile we just give some params to the JVM:

java -server -XX:MaxRAMPercentage=70 -XX:InitialRAMPercentage=70 \
 -Xlog:gc -XX:MaxGCPauseMillis=20 -XX:+AlwaysPreTouch -XX:+DisableExplicitGC\
 -XX:+ExitOnOutOfMemoryError \
 -jar /service.jar

PS: Maybe the title of this ticket no longer makes sense.

@dennisTS
Copy link

dennisTS commented May 7, 2021

And this is a specific example of a blocked condition; one of the worker threads was waiting (for ~20ms) for the other one to release monitor inside deallocator
image

@saudet
Copy link
Contributor

saudet commented May 8, 2021

@dennisTS I've just shown above that this doesn't happen with 1.5.6-SNAPSHOT.

saudet added a commit to bytedeco/javacpp that referenced this issue May 9, 2021
@saudet
Copy link
Contributor

saudet commented May 9, 2021

We have no TensorFlow specific config or flags, and on Dockerfile we just give some params to the JVM:

java -server -XX:MaxRAMPercentage=70 -XX:InitialRAMPercentage=70 \
 -Xlog:gc -XX:MaxGCPauseMillis=20 -XX:+AlwaysPreTouch -XX:+DisableExplicitGC\
 -XX:+ExitOnOutOfMemoryError \
 -jar /service.jar

Where are you setting the "org.bytedeco.javacpp.nopointergc" system property to "true"?

There's one more place where synchronization could potentially take place when classes are unloaded, for the WeakHashMap used inside the call to sizeof() in Pointer.DeallocatorReference, but I have a hard time imagining in what kind of application that would actually matter. In any case, I've "fixed" that in commit bytedeco/javacpp@0ce97fc .

@nebulorum
Copy link
Author

Where are you setting the "org.bytedeco.javacpp.nopointergc" system property to "true"?

We collected from our production config. If we can run the SNAPSHOT version in production we will do it. We need to release a version of this for some testing.

But this was more to this question:

A mutex does get used here when collecting stats: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/cpu_allocator_impl.cc
Are you sure that hasn't been enabled somehow?

No I don't know how I would have turned on the Mutex. We are not using any additional configuration. Maybe we should make sure the mutex is not on.

As your example showed if Tensors are not allocated performance is pretty stable. So maybe the issues is not JavaCPP.

@nebulorum nebulorum changed the title Documentation on Session and Resource management Large latency on Tensor allocation May 10, 2021
@saudet
Copy link
Contributor

saudet commented May 11, 2021

Using TF Java 0.3.1 and JavaCPP 1.5.6-SNAPSHOT with "noPointerGC", the original AllocateStress also looks OK to me:

nAlloc 5621 (total):  5586 (<10ms);  34 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5678 (total):  5668 (<10ms);  10 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5663 (total):  5651 (<10ms);  15 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5720 (total):  5699 (<10ms);  20 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5613 (total):  5601 (<10ms);  11 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5776 (total):  5766 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5640 (total):  5600 (<10ms);  41 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5660 (total):  5650 (<10ms);  13 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5667 (total):  5655 (<10ms);  9 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5627 (total):  5593 (<10ms);  33 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5718 (total):  5706 (<10ms);  12 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5658 (total):  5650 (<10ms);  10 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5638 (total):  5615 (<10ms);  22 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5648 (total):  5631 (<10ms);  27 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5637 (total):  5603 (<10ms);  23 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5640 (total):  5608 (<10ms);  32 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)
nAlloc 5731 (total):  5707 (<10ms);  24 (<20ms);  - (<30ms);  - (<40ms);  - (<50ms);  - (<100ms);  - (<200ms);  - ( >200ms)

We collected from our production config. If we can run the SNAPSHOT version in production we will do it. We need to release a version of this for some testing.

If you're satisfied with JavaCPP 1.5.6-SNAPSHOT, but need a release, I can do a 1.5.5-1 or something with the fix, but before doing that please make sure that there isn't something else that you may want updated as well. Thanks for testing!

@dennisTS
Copy link

Hi @saudet , thank you! We'll check it and get back to you

@nebulorum
Copy link
Author

nebulorum commented May 11, 2021

We ran this (better @dennisTS did) in our test rig and things improved a lot:

600 RPS against 8CPU we are getting p99.9 =48ms and p95 = 24ms

So real improvement using this version and option.

As a follow-up what do we lose by using nopointergc?

@karllessard
Copy link
Collaborator

JavaCPP GC thread is mostly useful to collect unclosed resources in eager mode, which is mostly used for debugging than for high-performance in production. If you are running sessions from a saved model graph, you normally don't need an eager session and can live with nopointergc without issue.

@karllessard
Copy link
Collaborator

@saudet , is there a way to completely turn off GC support in JavaCPP programmatically instead of passing a parameter to the command line? I guess we can define the value for this environment variable directly in TF Java before loading the TensorFlow runtime library but how to guarantee we can do it before JavaCPP libraries statically loads up?

@nebulorum
Copy link
Author

Now that we've been running for some hours in test we would really like to release to production. Will check with security if we can use the snapshot. But how long would it take to release a version of JavaCPP?

@saudet
Copy link
Contributor

saudet commented May 12, 2021

@saudet , is there a way to completely turn off GC support in JavaCPP programmatically instead of passing a parameter to the command line? I guess we can define the value for this environment variable directly in TF Java before loading the TensorFlow runtime library but how to guarantee we can do it before JavaCPP libraries statically loads up?

Not really possible, unless TF Java is the only library using JavaCPP in the app, and if the user has other libraries using JavaCPP, it's probably not a good idea to tamper with global settings like that anyway. In any case, this is the kind of information that is useful for optimization, and should be part of the documentation on a page, for example, like those here:
https://deeplearning4j.konduit.ai/config/config-memory
https://netty.io/wiki/reference-counted-objects.html

That said, I could add a parameter to Pointer.deallocator() that would let users specify whether they want to keep it out of the reference queue or not. The deallocator thread would still be there by default, but wouldn't do anything.

Now that we've been running for some hours in test we would really like to release to production. Will check with security if we can use the snapshot. But how long would it take to release a version of JavaCPP?

A couple of days, but like I said, let's make sure there isn't anything else we want to put in there...

@nebulorum
Copy link
Author

We've been running the SNAPSHOT for some days and it seems OK. Did the final version get release?

@saudet
Copy link
Contributor

saudet commented May 31, 2021

We've been running the SNAPSHOT for some days and it seems OK. Did the final version get release?

You mean JavaCPP? No, do you need one?

BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try:

Those are currently wrappers for the C++ API, but if you would like to use an idiomatic high-level API instead, please let us know!

@dennisTS
Copy link

dennisTS commented Jul 5, 2021

@saudet, sorry for getting back to you with such a delay - somehow forgot about this thread

You mean JavaCPP? No, do you need one?

Well, it would be nice - using SNAPSHOT in production just doesn't feel good :)

BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try:

We are using higher level APIs now; but generally even with the SNAPSHOT version of JavaCPP latencies are good for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants