Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep weak references to eager resources in session #229

Merged
merged 2 commits into from
Mar 2, 2021

Conversation

karllessard
Copy link
Collaborator

@karllessard karllessard commented Mar 1, 2021

This is a fix for #208

As discussed in this thread, when migrating to JavaCPP, a bug/change of behavior was introduced that was preventing the allocated eager resources to be garbage collected while the session was alive. On long-live sessions (especially when using the default session), the help of the GC is mandatory to prevent OOM.

This PR restores the original behavior where resources will be automatically freed once the session is closed, without preventing them to be garbage-collected upfront when they are unreachable and that memory runs low.

@karllessard
Copy link
Collaborator Author

@rnett , this PR might conflict with your actual work with TensorScope but I suggest we merge it first since it fixes a critical issue present in our library. We can review together afterward how to integrate both concepts together.

@rnett
Copy link
Contributor

rnett commented Mar 1, 2021

Seems fine, I'll check when I rebase. I don't think it will conflict too much from a quick look, I don't really touch the eager handles.

@saudet
Copy link
Contributor

saudet commented Mar 1, 2021

This PR restores the original behavior where resources will be automatically freed once the session is closed, without preventing them to be garbage-collected upfront when they are unreachable and that memory runs low.

There is absolutely no guarantee that GC will run even and especially when memory is low. Your application can and will crash. The GC will not help you. Please stop thinking of GC as something that you should be using. You should not be using it at all.

@karllessard
Copy link
Collaborator Author

@saudet I know there is no guarantee and we can review this whole paradigm later on but what really matters now is to restore the original behavior which, afaik, never caused any issues.

Copy link
Collaborator

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EagerSession.attach and EagerSession.detach methods now behave quite differently to how they used to, and differently from PointerScope.attach/detach, so I think it's worth documenting their current behaviour before we merge this in. In general it might be worth documenting with comments that this is the current behaviour so we don't lose it in another refactor. Otherwise in a year we'll need to go through the git history to figure out why this happened.

}

@Override
public synchronized void close() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This close method doesn't prevent the object from being reused. Do we want it to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is mandatory but it does not make much sense reusing it also. I can reset the pointers to null on close if you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a boolean closed which is checked, but setting the pointers to null is probably fine. The latter will crash if closed multiple times though.

@karllessard
Copy link
Collaborator Author

Ok @Craigacp , I've made the requested changes plus added some basic unit testing on WeakPointerScope itself.

Copy link
Collaborator

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karllessard
Copy link
Collaborator Author

Thanks for the quick review @Craigacp !

@karllessard karllessard merged commit f6024dd into tensorflow:master Mar 2, 2021
@karllessard karllessard deleted the eager-gc-fix branch March 2, 2021 04:18
@saudet
Copy link
Contributor

saudet commented Mar 2, 2021

Actually, this "feature" could cause problems when the user sets "org.bytedeco.javacpp.nopointergc". In that case, we must make sure that deallocation is performed manually or via something like PointerScope. If a reference is lost by a WeakReference, memory leaks will occur. I'm sorry to have proposed this, but this was actually a really bad idea...

@saudet
Copy link
Contributor

saudet commented Mar 2, 2021

Still, I don't think we need to modify JavaCPP to get this working. Instead of keeping a WeakReference to the allocated Pointer, it should suffice to keep strong references to the Deallocator objects that are registered, for example, in the case of Tensor, here: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTF_Tensor.java
When memory has already been deallocated, those will be null, but that may (read "will") end up in a race condition with the deallocator thread... so you'll probably need to provide some additional locking mechanism here.

@Craigacp
Copy link
Collaborator

Craigacp commented Mar 2, 2021

Actually, this "feature" could cause problems when the user sets "org.bytedeco.javacpp.nopointergc". In that case, we must make sure that deallocation is performed manually or via something like PointerScope. If a reference is lost by a WeakReference, memory leaks will occur. I'm sorry to have proposed this, but this was actually a really bad idea...

Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us.

What would be an alternative that allows EagerOperations to be cleaned up when they go out of scope? The initial JavaCPP implementation doesn't work as it unnecessarily holds on to too much memory, and both this and the TF 1.x version look like they should work (assuming nopointergc is false in this version).

@Craigacp
Copy link
Collaborator

Craigacp commented Mar 2, 2021

Still, I don't think we need to modify JavaCPP to get this working. Instead of keeping a WeakReference to the allocated Pointer, it should suffice to keep strong references to the Deallocator objects that are registered, for example, in the case of Tensor, here: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTF_Tensor.java
When memory has already been deallocated, those will be null, but that may (read "will") end up in a race condition with the deallocator thread... so you'll probably need to provide some additional locking mechanism here.

The documentation for the Pointer constructor that bottoms out in says it keeps a reference to the thing that's passed in (even though the code doesn't look like it does). Given the deallocator keeping a strong reference to the object is what we don't want, what does it do? The documented behaviour or what it looks like the Java code does? I ask because I'm not sure if the generate JNI is playing tricks, because IntelliJ isn't clever enough to figure out where pointers get attached to scopes so it's very hard to trace through JavaCPP code.

Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue?

@karllessard
Copy link
Collaborator Author

If a reference is lost by a WeakReference

@saudet can you explain better what you mean by losing a weak reference? Ok so I wanted to keep that conversation as a separate issue but just for the sake of stopping adding quotes around "fix" and "feature", let's clarify a few things :)

I might have expressed myself incorrectly in the description of this PR but all eager resources are actually protected by a scope, independently from the GC, and this scope if hold by EagerSession itself. The problem is that there is no obvious way for the library to control the number of operations a user might do inside a single session, it really depends on the design of their software.

In DJL, it seems that their eager session live for relatively a long time, probably enough to accumulate many thousands of operations. Each operation created within a session remain alive until the session is closed since the library don't know upfront if and when a user will need to access them (to retrieve a result tensor, for example).

So the GC here is our friend, giving us a hand to detect when it is safe to deallocate an operation while the session is still alive and does a pretty good job doing it, especially that native objects referenced by this operations are just small value objects and not large tensors. But if the user manages its eager sessions so that they are closed as soon as any of their operations has been consumed by the software, they won't need the help of the GC. And this is already supported by the actual code.

Something we should probably do is to either get rid of the dangerous default eager session (since that one is never closed) or document it better to warn the users that working with long-live sessions could potentially ends up in OOM (even if chances are high that they will be prevented with the assistance of the GC, we cannot guarantee it).

Also something I'm not too clear about is the usage of eager mode at all in production. Often I see in TF documentation that this mode is useful for debugging but for good performances, distribute training, etc. you should rely on graphs.

@saudet
Copy link
Contributor

saudet commented Mar 3, 2021

Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us.

Disabling the GC features of JavaCPP increases performance. You're the one that says that we should work with the community, so if DJL asks for this, think about what is going to be your reply. If you're basically going to tell them that TF Java doesn't care about performance, what do you think they are going to do? And they're already thinking about it. You're just confirming here their suspicions that you don't care about their needs. That said, it's possible that disabling all this stuff gives us nothing compared to the overhead of TF Core, sure, but be prepared to back up your claims.

What would be an alternative that allows EagerOperations to be cleaned up when they go out of scope? The initial JavaCPP implementation doesn't work as it unnecessarily holds on to too much memory, and both this and the TF 1.x version look like they should work (assuming nopointergc is false in this version).

Users that need it can use TensorScope wherever appropriate, but for users that don't do that, they don't need to do anything, and the GC will do its best to clean up the mess. It's not any different from how TF 1.x was working.

Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue?

There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/java/org/bytedeco/javacpp/Pointer.java#L460-L468

@Craigacp
Copy link
Collaborator

Craigacp commented Mar 3, 2021

Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us.

Disabling the GC features of JavaCPP increases performance. You're the one that says that we should work with the community, so if DJL asks for this, think about what is going to be your reply. If you're basically going to tell them that TF Java doesn't care about performance, what do you think they are going to do? And they're already thinking about it. You're just confirming here their suspicions that you don't care about their needs. That said, it's possible that disabling all this stuff gives us nothing compared to the overhead of TF Core, sure, but be prepared to back up your claims.

In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)?

DJL's main issue seemed to be the memory leak, and this PR fixes that. Their use of constant operations is a worry, but they should revert that and move back to using Tensor.

What would be an alternative that allows EagerOperations to be cleaned up when they go out of scope? The initial JavaCPP implementation doesn't work as it unnecessarily holds on to too much memory, and both this and the TF 1.x version look like they should work (assuming nopointergc is false in this version).

Users that need it can use TensorScope wherever appropriate, but for users that don't do that, they don't need to do anything, and the GC will do its best to clean up the mess. It's not any different from how TF 1.x was working.

TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder.

Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue?

There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/java/org/bytedeco/javacpp/Pointer.java#L460-L468

Ok, so if the pointergc is turned off then it will leak memory if things aren't always enclosed by a pointer scope which is closed?

@karllessard
Copy link
Collaborator Author

I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share?

Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time.

@karllessard
Copy link
Collaborator Author

On the other hand, what @rnett proposes in #188 is to at least enforce the scoping of the tensors that are resulting of an eager operation (i.e. operand.tensor(scope)).

While it adds some additional complexity on the user, it helps him to stay on a safe track without relying too much on the GC, which now seems to be a good trade off.

@rnett
Copy link
Contributor

rnett commented Mar 3, 2021

On the other hand, what @rnett proposes in #188 is to at least enforce the scoping of the tensors that are resulting of an eager operation (i.e. operand.tensor(scope)).

While it adds some additional complexity on the user, it helps him to stay on a safe track without relying too much on the GC, which now seems to be a good trade off.

I'm still in favor of adding a scope for eager operands (well, really all operands to keep it environment agnostic, but it wouldn't do much for Graph), and we could eventually re-add a asTensor() method that uses the operand scope. As @Craigacp said though it's going to be a bit more complicated than tensors (although I don't think it's too bad) and there will be more cases where it's not applicable. Plus for tensors detach() is a thing, so you can still get them scope free, it's just a bit harder (which imo is a feature).

@skirdey
Copy link

skirdey commented Mar 3, 2021

I can speak for the performance issues while enabling GC and using DJL on TF Java 0.0.2 -
The Deallocator thread makes GC spend 20 percent of the time in the cleanup, it is also a blocking thread.
And after a while there is a OOM

When using nopointergc=true - GC gets a lot of breathing room, no more continous churning in cleanup, but OOM happens a lot and a lot faster.

I move somewhere from 1 gigabyte a second to 15 gigabytes a second of data through inference.

I haven't tried the fix merged here as I've moved to use DJL and PyTorch which has no GC or memory leak issues so far.

Once the fix is released and DJl updates its TF dependencies Ill give it a try again as I have lots of use cases for DJL and Tensorflow.

@saudet
Copy link
Contributor

saudet commented Mar 3, 2021

In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)?

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread. It's just for convenience:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java

DJL's main issue seemed to be the memory leak, and this PR fixes that. Their use of constant operations is a worry, but they should revert that and move back to using Tensor.

As @skirdey points out, I don't believe that's the case. We need to keep the conversation going with them instead of assuming that we know what's best for everyone!

TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder.

That paradigm works fine in C++ and Python where "scopes" like that come built-in with the language. Do you have an example of something that sounds hard to do?

There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/java/org/bytedeco/javacpp/Pointer.java#L460-L468

Ok, so if the pointergc is turned off then it will leak memory if things aren't always enclosed by a pointer scope which is closed?

Yes, that's correct. We can also close pointers individually too.

I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share?

Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time.

No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.

Once the fix is released and DJl updates its TF dependencies Ill give it a try again as I have lots of use cases for DJL and Tensorflow.

Would you have a small piece of code to share with us that faithfully demonstrates the issue? It would help make sure that we come up with a solution that everyone likes and that actually meets your needs.

@skirdey
Copy link

skirdey commented Mar 3, 2021 via email

@saudet
Copy link
Contributor

saudet commented Mar 3, 2021

I've described the issue and the way to reproduce it by setting nopointergc=true while using stock DJL benchmark and gradle. You can also run same benchmark using PyTorch engine instead awslabs/djl#690

Thanks! Missed the bit about that being the standard benchmark from DJL. Would you have something that uses TF only though?

@skirdey
Copy link

skirdey commented Mar 3, 2021 via email

@karllessard
Copy link
Collaborator Author

karllessard commented Mar 3, 2021

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread.

The thread does not seem to be blocking synchronized though.

So when the synchronization happens in JavaCPP, when it frees the memory? Could that be prevented or the trick is to deallocate less often, in chunks (scopes), rather than multiple small pieces (which is what the GC would do)?

@Craigacp
Copy link
Collaborator

Craigacp commented Mar 3, 2021

In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)?

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread. It's just for convenience:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java

Yes, but idioms inside the JDK tend to receive performance optimisations from the JVM which might not be visible at the class file level. For example that CleanerImpl class references InnocuousThread which is a JDK internal class that might be treated differently by the runtime. I don't know if this is the case, but I do know that the GC has received a lot of attention in versions after Java 8 and so it's behaviour and associated overheads might be lower.

TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder.

That paradigm works fine in C++ and Python where "scopes" like that come built-in with the language. Do you have an example of something that sounds hard to do?

Well in Python TF they don't scope the eager operations, so we should figure out what the lifetime of them is and see if there is a natural mapping to Java. But more generally it's easy to say when a tensor is being referenced and thus what it's lifetime is. With an eager operation should we scope them at a method level (e.g. if I have a method that implements a single layer of a ResNet, should we scope the ops to that and have them be closed in a try-with-resources on method exit), or does that cause other performance issues when run inside a loop (e.g. to build up the rest of the ResNet). I don't know the answers, and we should try to figure that out before introducing scoping on something.

I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share?
Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time.

No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.

Where's the leak? I went through and couldn't see how a reference could escape the set, and with the pointer gc on it will catch and deallocate everything. I agree with it turned off then this leaks, but again we could just tell people that's not supported.

@saudet
Copy link
Contributor

saudet commented Mar 4, 2021

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread.

The thread does not seem to be blocking synchronized though.

Like I said at #208 (comment) and a bunch of other places, it will not block if you set maxBytes and maxPhysicalBytes to 0.
Or said another way: If you don't like blocking, do like DJL and set them to 0 by default! Sounds good?

Access to the linked list used by the Cleaner is synchronized, yes:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java
That can block too, but it's not usually a problem.

So when the synchronization happens in JavaCPP, when it frees the memory? Could that be prevented or the trick is to deallocate less often, in chunks (scopes), rather than multiple small pieces (which is what the GC would do)?

Like I keep telling you guys, thinking about GC is a dead end. Please stop thinking about anything related to GC!

Yes, but idioms inside the JDK tend to receive performance optimisations from the JVM which might not be visible at the class file level. For example that CleanerImpl class references InnocuousThread which is a JDK internal class that might be treated differently by the runtime. I don't know if this is the case, but I do know that the GC has received a lot of attention in versions after Java 8 and so it's behaviour and associated overheads might be lower.

Yeah, ok, but it still doesn't make GC work any better for native resources. Panama has been having problems with this for over 5 years now, and the JDK in general for over 20 years (!!), and they've just started to get a grip on the situation, whereas PointerScope from JavaCPP has been working fine since 2018: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/ It's by no means perfect, but it gets the job done. It may cause the JVM to crash and what not, but so can TF Core. We could rewrite TF Core in Java, but we're not going to do this, right? ... What are we arguing about exactly?

Well in Python TF they don't scope the eager operations, so we should figure out what the lifetime of them is and see if there is a natural mapping to Java. But more generally it's easy to say when a tensor is being referenced and thus what it's lifetime is. With an eager operation should we scope them at a method level (e.g. if I have a method that implements a single layer of a ResNet, should we scope the ops to that and have them be closed in a try-with-resources on method exit), or does that cause other performance issues when run inside a loop (e.g. to build up the rest of the ResNet). I don't know the answers, and we should try to figure that out before introducing scoping on something.

Well, look, I keep telling you, I'm pretty sure DJL needs something like this, and for the kind of applications they are working on, as @skirdey found out and wrote at #208 (comment), something like PointerScope works just fine at least for some things. So, are you going to ignore this? If so, we can just stop discussing here, and I'll simply go help them get their stuff running with JavaCPP directly.

No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.

Where's the leak? I went through and couldn't see how a reference could escape the set, and with the pointer gc on it will catch and deallocate everything. I agree with it turned off then this leaks, but again we could just tell people that's not supported.

Right, delete() and deleteAll() appear synchronized enough that calling them concurrently should be fine. Looking more closely, I think the only issue is when close() doesn't get called, the context itself never gets deleted. Anyway, if we wanted to do something similar to that with JavaCPP, I'd synchronize the deallocate() method here:
https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTFE_Op.java#L30
That gets assigned to Pointer.deallocator, which we can retrieve with the protected deallocator() method from those subclasses, keep strong references to them, and call deallocate() manually on them when desired. That should work well regardless of whether noPointerGC is enabled or not.

@saudet saudet mentioned this pull request Mar 5, 2021
@karllessard karllessard restored the eager-gc-fix branch April 6, 2021 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants