Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction between Tensorflow Java and JavaCPP Pointer deallocation #208

Open
skirdey opened this issue Feb 5, 2021 · 40 comments
Open

Comments

@skirdey
Copy link

skirdey commented Feb 5, 2021

I am trying to understand why when I use DJL + Tensorflow engine, there is a good amount of time spent in GC, while both DJL and Tensorflow Java seem to use pointerscope and do not relay on GC for object cleanup.

For example, I see JavaCPP Pointer.deallocator gets invoked https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Pointer.java#L666-L667

which is a heavy synchronized call.

Any help is appreciated.

Screen Shot 2021-02-04 at 4 34 24 PM

@saudet
Copy link
Contributor

saudet commented Feb 5, 2021

If you're confident you don't need GC, we can set the "org.bytedeco.javacpp.nopointergc" system property to "true" to reduce overhead.

@skirdey
Copy link
Author

skirdey commented Feb 5, 2021

I am not sure wether I need GC or not. Looking at the EagerSession class seems like it is managing memory through the pointer scope. The NDArrays I allocate through DJL/TF also use the same.

@skirdey
Copy link
Author

skirdey commented Feb 5, 2021

I will try the system property you have mentioned and run my benchmarks, thanks for such a quick response!

@saudet
Copy link
Contributor

saudet commented Feb 5, 2021

You'll need to keep in mind though that neither TF nor DJL are particularly designed to reduce GC. JavaCPP is only a small part of the overall design, and there is a lot garbage that gets generated elsewhere.

@skirdey
Copy link
Author

skirdey commented Feb 17, 2021

Still investigating how GC optimizations work in the DJL + Tensorflow Environment.
There seems to be also a blocking call when using several eager sessions in parallel for multi-threading, and when JavaCPP needs to run GC, it blocks all of the threads with sessions.

@saudet
Copy link
Contributor

saudet commented Feb 18, 2021

Like I said, make sure to set the "org.bytedeco.javacpp.nopointergc" system property to "true" to prevent JavaCPP from calling System.gc(). It should also avoid blocking calls, but let me know if you still see anything of concern there.

@skirdey
Copy link
Author

skirdey commented Feb 19, 2021

I've tried the org.bytedeco.javacpp.nopointergc and very quickly ran out of memory :-)

@skirdey
Copy link
Author

skirdey commented Feb 19, 2021

If I keep it "false", then in JVM with G1GC, I spend 25% of the time in GC on a high throughput inference use-case, with occasional blocking.

@skirdey
Copy link
Author

skirdey commented Feb 19, 2021

stacktrace related to allocation

Exception in thread "ForkJoinPool-1-worker-7" java.lang.OutOfMemoryError: Cannot allocate new PointerPointer(1000): totalBytes = 3640M, physicalBytes = 1226M
at org.bytedeco.javacpp.PointerPointer.(PointerPointer.java:149)
at org.tensorflow.EagerOperationBuilder.execute(EagerOperationBuilder.java:270)
at org.tensorflow.EagerOperationBuilder.build(EagerOperationBuilder.java:66)
at org.tensorflow.EagerOperationBuilder.build(EagerOperationBuilder.java:55)
at ai.djl.tensorflow.engine.TfNDArray.(TfNDArray.java:81)
at ai.djl.tensorflow.engine.TfNDManager.create(TfNDManager.java:154)

@skirdey
Copy link
Author

skirdey commented Feb 19, 2021

So I've tried it, the blocking calls are removed, the thread with JavaCPP Deallocator is also disappears and no-more blocking GCs. Only issue is I quickly run out of memory :-)

@skirdey
Copy link
Author

skirdey commented Feb 19, 2021

There is a team discussion on DJL around Tensorflow Java and JavaCPP performance, if you are interested - deepjavalibrary/djl#625

@skirdey
Copy link
Author

skirdey commented Feb 19, 2021

Screen Shot 2021-02-19 at 3 44 36 PM

The GC behaviour I see when

org.bytedeco.javacpp.nopointergc=false
Xmx2G
Xms2G

maxPhysicalBytes=1G

when running multi-threaded inference locally on my laptop. It uses Tensorflow Eager Session.

@saudet
Copy link
Contributor

saudet commented Feb 20, 2021

So I've tried it, the blocking calls are removed, the thread with JavaCPP Deallocator is also disappears and no-more blocking GCs. Only issue is I quickly run out of memory :-)

Obviously, when not using the GC, you'll need to make sure to deallocate native memory some other way!

@skirdey
Copy link
Author

skirdey commented Feb 20, 2021 via email

@karllessard
Copy link
Collaborator

Eager execution does allocate a lot of native resources, as each operation and each of their outputs will need to be freed.

I don't know much about details the GC feature of JavaCPP we are using now but I can tell that in 1.14, the way it was working is that the GC listener was running in a separate thread and was trying to free resources referenced by deleted objects, but just as a "best effort" since GC is not entirely reliable when it comes to native memory. @saudet, this thread-based implementation was just listening to a phantom reference queue and was therefore non-blocking, is JavaCPP doing something similar?

Ultimately, resources should be cleaned up by closing the eager session enclosing some piece of code and that part is still true:

try (EagerSession s = EagerSession.create()) {
    Ops tf = Ops.create(s);
    // your eager computations
}

This ensure to release all resources independently from the GC. So while it is not enforced by the API, it is recommended to scope your eager operations into multiple eager session instead of only relying on the default one (i.e. the one used when simply invoking Ops.create() without any parameter).

@saudet
Copy link
Contributor

saudet commented Feb 22, 2021

I don't know much about details the GC feature of JavaCPP we are using now but I can tell that in 1.14, the way it was working is that the GC listener was running in a separate thread and was trying to free resources referenced by deleted objects, but just as a "best effort" since GC is not entirely reliable when it comes to native memory. @saudet, this thread-based implementation was just listening to a phantom reference queue and was therefore non-blocking, is JavaCPP doing something similar?

It's doing that, but it also blocks by default when maxBytes or maxPhysicalBytes is reached, unless that is set to 0 as done in DJL here: https://github.com/awslabs/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/LibUtils.java#L52

Java users are typically used to having per-process memory thresholds like that, but it looks like users of DJL consider this a "bug", and prefer to have it behave more like C/C++/Python, where we can always allocate all memory available on the system with no restrictions.

@skirdey
Copy link
Author

skirdey commented Feb 22, 2021 via email

@saudet
Copy link
Contributor

saudet commented Feb 22, 2021

If I leave default settings for JavaCPP, use single interop and intraop thread in Tensorflow and use G1GC for GC - I get a weird behavior. Everything works fine until certain load threshold, after which heap gets to about 100% and full GC cycles can not clean anything in it. Seems like something is holding pointers forever. Need to test custom maxPhysicalBytes params.

That sounds like an issue somewhere in TF Java, not JavaCPP. If you have a simple way to reproduce this, please let us know!

@skirdey
Copy link
Author

skirdey commented Feb 22, 2021 via email

@karllessard
Copy link
Collaborator

@saudet , I just to verify with you one more thing concerning the GC listening thread in JavaCPP.

Before in TF Java (1.x), each resource that has native memory allocated was being referred by a PhantomReference retrieved from the GC queue once discarded by it: https://github.com/tensorflow/tensorflow/blob/3f113a31d250a5c772d95ffa8e8562cf174d222a/tensorflow/java/src/main/java/org/tensorflow/EagerSession.java#L387

So basically, that phantom reference contains all the information required to release any native resources associated to a particular object that was garbage collected.

In JavaCPP, how do you keep track of which object refers to which native resources? Does it exclusively rely on PointerScope, hence only on the lifetime of the EagerSession in the case?

@saudet
Copy link
Contributor

saudet commented Feb 24, 2021

No, there are PhantomReference getting queued, unless "org.bytedeco.javacpp.nopointergc" is set to "true", that is:
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Pointer.java#L459-L467
PhantomReference is referenced here, alongside the machinery for reference counting:
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Pointer.java#L274

@karllessard
Copy link
Collaborator

karllessard commented Feb 24, 2021

Ok so the PhantomReference is on the Pointer object itself, so that when this one is garbage collected, it's backing native resource is as well? If that's the case, sounds good.

Then another question, if the Pointer member of a given object have also been attached to an outer PointerScope, like it is done here for eager operations, will the garbage collection of the owning object prevail on the scope to release the resources? i.e. is the scope only keeping a weak reference to the Pointer or it will hold a strong reference to it, preventing the garbage collector to collect it when the EagerOperation is collected?

@saudet
Copy link
Contributor

saudet commented Feb 25, 2021

Ok so the PhantomReference is on the Pointer object itself, so that when this one is garbage collected, it's backing native resource is as well? If that's the case, sounds good.

Yes, that's idea. Everything is centered around Pointer to make it easier to reason about, and I've found that we can also use it to map anything of interest from the C/C++ world, well anything that has to do with memory living on the native heap anyway.

Then another question, if the Pointer member of a given object have also been attached to an outer PointerScope, like it is done here for eager operations, will the garbage collection of the owning object prevail on the scope to release the resources? i.e. is the scope only keeping a weak reference to the Pointer or it will hold a strong reference to it, preventing the garbage collector to collect it when the EagerOperation is collected?

It stores a strong reference, otherwise this kind of pattern may result in a crash:

try (PointerScope scope = new PointerScope()) {
    IntPointer something = new IntPointer(10);
    someNativeObject.keepReferenceTo(something);
    // The Java object "something" is no longer referenced from here, but someNativeObject expects the native memory to be
    // still allocated, so this call may crash, unless an object like PointerScope keeps a strong reference, which it does.
    someNativeObject.doSomething()
}

In the event that a scope is not closed, but becomes itself unreachable though, everything it contains becomes eligible for GC.

@karllessard
Copy link
Collaborator

It stores a strong reference, otherwise this kind of pattern may result in a crash:

Ok, this new paradigm introduced once we have switched to JavaCPP is probably the cause of the issue here. Before, the EagerSession scope was acting weakly and would not prevent the native memory of its garbage-collected resources to be released.

Some sessions (like the default one, which I think @skirdey uses) are never or rarely closed, therefore entirely rely on the GC or on explicit closing of the resources themselves (the tensors, for example). Since a strong reference is kept on the native pointer, the memory will never be released. That certainly happens with the EagerOperation resources.

It could be great if a user can decide to attach weakly or strongly a Pointer to a PointerScope. If that's not feasible, then we might want to prevent attaching these resources to the eager session if we know that session is the default one. But there is probably a cleaner way to fix this as well.

@skirdey , do you build your own version of TF Java when you are doing your tests or you use the prebuilt one coming with DJL? It would be interesting to see the behaviour of your code by adding this patch.

@skirdey
Copy link
Author

skirdey commented Feb 25, 2021

@karllessard I would love to try a patch. I am using stock version of tensor-flow that comes with DJL. I am still not sure what would be the patch, so if you have one send it my way :-)

@skirdey
Copy link
Author

skirdey commented Feb 25, 2021 via email

@stu1130
Copy link

stu1130 commented Feb 25, 2021

@karllessard @saudet @skirdey Here is my thought.
There are two strategies of releasing native memory.

  1. Java style: spawn another dedicated thread to clean up native resource object in the ReferenceQueue.
    pros: I guess the performance would be slighter better than approach 2 as delete native memory takes time but needs experiments
    cons: If the client API invoke the function in asynchronous way (which is usually the case in Java), the dedicated thread is not fast enough to clean up the native resource and cause OOM. Additionally, if engines like pytorch and apache mxnet don't support thread-safe delete, it will cause the JVM crash. As there might be a small chance that both PointerScope & dedicated thread release the native resource at the same cause and ran into double delete crash. Both PyTorch & Apache MXNet are of this type. So we have no choice but to go with approach 2. Do you know if tensorflow is thread-safe in terms of releasing the native resource like tensor?
  2. C++ style: release the memory as soon as we mark the memory is not used.
    pros: The memory consumption is small comparing to approach 1.
    cons: The performance could be slightly slower than approach 1 but again it requires experiments.

Either way is trying to get rid of GC (System.gc()) which slows down the entire system. The right pattern of coding Deep Learning system is to make sure all native resources are tracked by a scope and release the memory without a leak and help of GC.

@saudet Does JavaCpp support approach 2 if users are working on memory-intensive application?

@skirdey
Copy link
Author

skirdey commented Feb 26, 2021

It stores a strong reference, otherwise this kind of pattern may result in a crash:

Ok, this new paradigm introduced once we have switched to JavaCPP is probably the cause of the issue here. Before, the EagerSession scope was acting weakly and would not prevent the native memory of its garbage-collected resources to be released.

Some sessions (like the default one, which I think @skirdey uses) are never or rarely closed, therefore entirely rely on the GC or on explicit closing of the resources themselves (the tensors, for example). Since a strong reference is kept on the native pointer, the memory will never be released. That certainly happens with the EagerOperation resources.

It could be great if a user can decide to attach weakly or strongly a Pointer to a PointerScope. If that's not feasible, then we might want to prevent attaching these resources to the eager session if we know that session is the default one. But there is probably a cleaner way to fix this as well.

@skirdey , do you build your own version of TF Java when you are doing your tests or you use the prebuilt one coming with DJL? It would be interesting to see the behaviour of your code by adding this patch.

Looking at DJL code and TfNDManager it does create a single session in with async option https://github.com/awslabs/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/TfNDManager.java#L69

I can not see if it is a "default" session or not.

@saudet
Copy link
Contributor

saudet commented Feb 26, 2021

It stores a strong reference, otherwise this kind of pattern may result in a crash:

Ok, this new paradigm introduced once we have switched to JavaCPP is probably the cause of the issue here. Before, the EagerSession scope was acting weakly and would not prevent the native memory of its garbage-collected resources to be released.

Some sessions (like the default one, which I think @skirdey uses) are never or rarely closed, therefore entirely rely on the GC or on explicit closing of the resources themselves (the tensors, for example). Since a strong reference is kept on the native pointer, the memory will never be released. That certainly happens with the EagerOperation resources.

I see, that's what @rnett was referring in pull #188 (comment).

It could be great if a user can decide to attach weakly or strongly a Pointer to a PointerScope. If that's not feasible, then we might want to prevent attaching these resources to the eager session if we know that session is the default one. But there is probably a cleaner way to fix this as well.

It's never a good idea to rely on the GC. We cannot avoid situations like I mention above where the native side keeps a reference to an existing Tensor, but where the Java side doesn't. The only sane way to deal with this is by not relying on the GC at all. We can still have it as a sort of option for users that don't want to use something like TensorScope though, but in that case, since we cannot offer any guarantees anyway, it makes no difference whether eager session holds on to weak references, or no reference at all!

@skirdey , do you build your own version of TF Java when you are doing your tests or you use the prebuilt one coming with DJL? It would be interesting to see the behaviour of your code by adding this patch.

Which patch?

@stu1130
Copy link

stu1130 commented Feb 26, 2021

It stores a strong reference, otherwise this kind of pattern may result in a crash:

Ok, this new paradigm introduced once we have switched to JavaCPP is probably the cause of the issue here. Before, the EagerSession scope was acting weakly and would not prevent the native memory of its garbage-collected resources to be released.
Some sessions (like the default one, which I think @skirdey uses) are never or rarely closed, therefore entirely rely on the GC or on explicit closing of the resources themselves (the tensors, for example). Since a strong reference is kept on the native pointer, the memory will never be released. That certainly happens with the EagerOperation resources.
It could be great if a user can decide to attach weakly or strongly a Pointer to a PointerScope. If that's not feasible, then we might want to prevent attaching these resources to the eager session if we know that session is the default one. But there is probably a cleaner way to fix this as well.
@skirdey , do you build your own version of TF Java when you are doing your tests or you use the prebuilt one coming with DJL? It would be interesting to see the behaviour of your code by adding this patch.

Looking at DJL code and TfNDManager it does create a single session in with async option https://github.com/awslabs/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/TfNDManager.java#L69

I can not see if it is a "default" session or not.

I checked the source code. It should not be a default EagerSession. The defaultEagerSession is created only when you call initDefault(options) or getDefault()

@saudet
Copy link
Contributor

saudet commented Feb 26, 2021

@karllessard @saudet @skirdey Here is my thought.
There are two strategies of releasing native memory.

  1. Java style: spawn another dedicated thread to clean up native resource object in the ReferenceQueue.
    pros: I guess the performance would be slighter better than approach 2 as delete native memory takes time but needs experiments
    cons: If the client API invoke the function in asynchronous way (which is usually the case in Java), the dedicated thread is not fast enough to clean up the native resource and cause OOM. Additionally, if engines like pytorch and apache mxnet don't support thread-safe delete, it will cause the JVM crash. As there might be a small chance that both PointerScope & dedicated thread release the native resource at the same cause and ran into double delete crash. Both PyTorch & Apache MXNet are of this type. So we have no choice but to go with approach 2. Do you know if tensorflow is thread-safe in terms of releasing the native resource like tensor?

  2. C++ style: release the memory as soon as we mark the memory is not used.
    pros: The memory consumption is small comparing to approach 1.
    cons: The performance could be slightly slower than approach 1 but again it requires experiments.

Either way is trying to get rid of GC (System.gc()) which slows down the entire system. The right pattern of coding Deep Learning system is to make sure all native resources are tracked by a scope and release the memory without a leak and help of GC.

@saudet Does JavaCpp support approach 2 if users are working on memory-intensive application?

JavaCPP supports both the "Java style", that is using GC via PhantomReference and a ReferenceQueue, and "C++ style" via PointerScope. Currently, JavaCPP only uses 1 "deallocator thread", but it is possible to extend that to support multiple threads and increase throughput. However, there are too many limitations to that approach using GC that I don't think it's worth spending too much time on that. Like you mention, it's problematic when the native allocator isn't thread-safe, but even when it is thread-safe, allocating more memory than required leads to memory fragmentation and poor performance, typically poorer than managing it "C++ style" in my experience. So it is almost always a better idea to manage native memory like one would do it in C++ or Python. That's exactly the paradigm that PointerScope attempts to bring to Java, that we can use across libraries, like I showed here for the C++ APIs of TensorFlow and OpenCV: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

@saudet saudet mentioned this issue Feb 26, 2021
@saudet
Copy link
Contributor

saudet commented Feb 26, 2021

@karllessard I'm trying to find where in the old code it was using weak references, and I can't find. I don't remember removing anything like that myself either. From what I can tell, EagerSession has always been keeping strong references with this map: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/EagerSession.java#L499

@Craigacp
Copy link
Collaborator

NativeReference is a subclass of java.lang.ref.PhantomReference.

@karllessard
Copy link
Collaborator

I checked the source code. It should not be a default EagerSession. The defaultEagerSession is created only when you call initDefault(options) or getDefault()

@stu1130 , even if it is not the default session, if the created session remains open for a long time while creating a lot of ops and tensors, the issue can happen

@karllessard I'm trying to find where in the old code it was using weak references, and I can't find. I don't remember removing anything like that myself either. From what I can tell, EagerSession has always been keeping strong references with this map

@saudet: like @Craigacp pointed out, NativeReference extends PhantomReference and are extended by each type of native resource, like here. So basically how it works is that if a NativeReference was queued by the GC, its delete() method was invoked to release the native resources and it was removed from session the session scope. If it was never queued but the session was closed, all remaining NativeReference in the session scope were deleted explicitly. So what I was saying is that the eager session was holding references acting "weakly" and not necessarily stacking WeakReference objects.

Which patch?

There is none yet :) Ideally, like we discussed, it would be possible to refer weakly to a Pointer in a PointerScope from JavaCPP directly. Otherwise we have to find another way to restore the previous/correct behavior. I was curious to know if @skirdey was already setup to try a patch if we have one but I guess I can easily reproduce the problem on my side and validate this hypothesis.

@saudet
Copy link
Contributor

saudet commented Feb 27, 2021

NativeReference is a subclass of java.lang.ref.PhantomReference.

Ah, that's where I screwed up. I remember getting confused about what NativeReference was exactly and associating it with Pointer instead of more like PhantomReference<Pointer>. Well, let's see, that's a pattern currently supported by neither the Cleaner in JDK 9+ nor the ResourceScope as proposed by Panama, so it's probably not something I want to encourage users to start doing as part of JavaCPP either... See the point I make in pull #188 (comment).

If we still want to have something that behaves like the original implementation, we should probably just replace the PointerScope field in EagerSession with some sort of collection of WeakReference<Pointer>, as currently proposed by @rnett for TensorScope as per pull #188 (review).

@karllessard
Copy link
Collaborator

karllessard commented Feb 27, 2021

Ok, I understand that you want JavaCPP to follow closely the (future) behaviour of the JDK.

I'm personally totally comfortable with what @rnett proposed about weak tensor scopes, which is quite identical to the original behaviour of the eager sessions. So TF could have its own "weak" pointer scope. @rnett , were you planning to change your proposed TensorScope to make it more generic for any kind of native resources, like the eager ops?

If that's reshuffling too much of your work, we can simply keep our own weak references directly in the EagerSession like before, and since JavaCPP will still take care of the cleaner portion, it should really just be a small change to do.

@rnett
Copy link
Contributor

rnett commented Feb 27, 2021

I wasn't planning on doing it as part of TensorScope, but rather as part of Ops or Scope since we're passing it around anyways. But that's all up for discussion, I don't have anything particularly firm yet.

@karllessard
Copy link
Collaborator

Ok let's play it simple for now and I'll try to see how it works by simply replacing the PointerScope in EagerSession by a list of weak references, I'll let you all know of the results.

@karllessard
Copy link
Collaborator

karllessard commented Mar 1, 2021

Ok, so the issue was very easy to reproduce. Starting a JVM with only 256M of memory, this simple loop was hitting a OOM between 30K and 40K iterations:

  public static void main(String[] args) {
    try (EagerSession s = EagerSession.create()) {
      Ops tf = Ops.create(s);
      while (true) {
        tf.math.add(tf.constant(2), tf.constant(2));
      }
    }
  }

I pushed this PR #229 to keep only weak references on eager resources in the session, as proposed earlier, and now the garbage collection allows this loop to run forever. I'm pretty confident this should fix the issue observed earlier by @skirdey when using DJL.

@skirdey
Copy link
Author

skirdey commented Mar 1, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants