-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in eager mode when creating keras model in loop #30324
Comments
Does
Help? |
It doesn‘t work!@bionicles |
A similar situation, if I train the model in the main thread and load the model in another thread AT THE SAME TIME. All things are in a loop. However, if I use clear_session() method in one thread, the code in another thread won't work!!! I test pytorch and mxnet , and there is no any memory leak in a loop. why??? amazing tensorflow!!! I think that clear_session shouldn't be necessary. @bionicles @tjume |
That is purely logical. Threads share memory, thus if you call Now, I do agree (after reproducing the issue, which is all the more present in TF 2.0 with Eager execution enabled by default) that the absence of implicit garbage collection inside the loop is a bit annoying. Note that if you use
My humble opinion, however is that it really is not an amazing effort from your end as a programmer to add a couple of memory-freeing instructions now and then in your code... This is actually pretty common, in my experience, when you are creating stuff faster than the garbage collector will take care of. But that will be up to the developers/maintainers to decide! |
Maybe I didn't make it clear. I mean if I train model A in the main thread and load model B and predict in another thread AT THE SAME TIME. All things are in a loop. However, if I use clear_session() method in one thread (to clear model A, e.g), BUT the model B in another thread doesn't work(DOESN'T predict). As you can see, there is no any relationship between model A and model B. It seems that clearing model A can influence model B. why??? That isn't logical. @pandrey-fr |
Oh, I see! Thanks for clarifying the issue. From my understanding (but I might be wrong - I am just a tensorflow user, not an expert, let alone a developer), Examples:
|
That being cleared, I now agree with you that it could be useful to dispose of a generic function to clear out a specific keras model without having to worry about its having been used or not. Let's wait for somebody from the mainteance / development team to actually pick up this issue! |
OK. Thanks in advance @pandrey-fr |
Why is this not fixed in all the github threads I’ve come across? Seems pretty useless to be unable to call model.predict() in a loop without eventually maxing out your memory and dumping the whole thing Looks like I may have to switch to pytorch |
There has been some (great) improvement on those issues, notably in 2.0-rc0, and we can expect some more with the upcoming actual 2.0 (and 1.15) release(s). The issue arises from Eager execution triggering the creation of (sometimes usefully) redundant back-end graphs, which sometimes end up not being properly discarded. It seems to no longer happen when using tf.data.Dataset objects (which feels logical to me: Datasets ensure the homogeneity of samples' specification, hence making it safe to re-use the same back-end graph), but there still are some issues when feeding individual EagerTensors. I would hope it will be fixed at some point, but you have to understand that Eager execution is a big turn compared to how TensorFlow's back-end works, which used to be the normal way of writing TF code until not so long ago. It is therefore bound to take a little time fixing everything, and to be honest I am personally amazed by how fast it is going - when I moved to using Eager a few months ago, I felt like it was a terrible choice leading to huge performance drops and memory leaks issues, while today the former have mostly vanished and the latter are progressively solved. So, my point is, we, as users, have to show a little patience.
The problem is honestly not that great, but yes, in some cases such a problematic behaviour arises. Note, however, that you can work it around, notably by using a Dataset object - and I can hear that this is an effort you would rather not have to make. You could also stick with 1.14 and Eager disabled.
Honestly, I do not believe anyone "has to" switch to PyTorch - but nor to TensorFlow. You should pick up the framework that suits you best at a given point, and be opened to change when relevant (which is less and less hard as their high-level APIs look more and more similar). If you feel like PyTorch works better for you, switch to it, but please do not look at it as a forced thing nor as being part of a "choose your holy side and be verbose about it" decision. Both frameworks' devs are doing their best, they disagree on some points, and there is something of a competition for users between them, but in my humble opinion we, as users, actually tend to benefit from it. Eager is clearly a response to PyTorch having a similar behaviour, but TF devs have also shown their ability to make it great while preserving the back-end specifics of TF, and not just make it a façade filled with bugs (which it kind of felt like in the beginning). So, what I am saying is, if you want to move to PyTorch, do it, but please do not make it sound like TF devs are not doing there job - this is rather disrespectful, and pointless since it is easy to see that they are actually working on solving issues (and rather succeeding to do so). |
I don't follow. Creating models in a loop will increase memory; this is a known issue with the way the keras backend manages state. However I don't see any leak when calling predict in a loop: https://colab.sandbox.google.com/gist/robieta/cc5e2ccb179d97441e08fab3220ca5bf/predict_leak_test.ipynb |
@robieta When I run a similar test (on tf2.0-rc0), I can actually see (using psutils, as in the code initially provided by the person who opened the issue) a small RAM usage increase unless calling |
@robieta In short, I am having this issue when implementing a monte-carlo rollout policy implemented in seqGAN(https://arxiv.org/pdf/1609.05473.pdf), which requires me to complete a sentence N times, i.e. N*sentence_length calls to model.predict() to get the next word, as well as model2.predict() to get the score for that sentence. If I have a sentence of length 30 tokens, and I want 20 example for each word, I end up having to make 9280 total calls to model.predict() to get the rewards I need. Using I cannot even make it through one complete iteration without everything dumping and getting an OOM error. Is the only fix for this at the moment just rolling back to TF-1.15? |
I'm using TensorFlow 2.0.0 and Keras 2.3.1 and still having the same issue. |
I'm going to close this since the original issue (growth from model creation) has been addressed as a known issue and now the thread is kind of drifting. We have also recently made a fix to the internals of TF that eliminates a leak when invoking model.predict many, many times. If you are still seeing issues with predict, feel free to open a new issue with a repro.. Thanks for all of the feedback. |
System information
Describe the current behavior
In eager execution, when creating a
tf.keras.Sequential
model inside a loop and discarding it immediately, the memory increases over time. The following code shows this by printing the used memory at each iteration.Output:
The same result happens when using the Functional API or Model subclassing API. Adding
tf.keras.backend.clear_session()
in the loop solves the leak in all cases like in graph mode. To see this effect better, one should additionally usegc.collect()
in the loop.Describe the expected behavior
While adding
tf.keras.backend.clear_session()
to the loop helps, this should not be necessary because in eager execution there is no graph to clear, which according to the documentation seems to be the only thing this function does:Therefore it is also suprising that this function helps at all during eager execution. The expected behavior is that there is no memory leak even without
tf.keras.backend.clear_session()
.Code to reproduce the issue
Code is in description above.
Other info / logs
Nothing here.
The text was updated successfully, but these errors were encountered: