Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[40171] Limit size of kernel cache #42337

Closed
wants to merge 2 commits into from

Conversation

frreiss
Copy link
Contributor

@frreiss frreiss commented Aug 13, 2020

The EagerContext class caches instances of kernels that have been used by the current session. Because kernel state is opaque to EagerContext, the caching mechanism is very conservative about reusing kernels. Also, there is no limit on the number of kernels cached. As a result of these two factors, the size of the kernel cache grows in an unbounded fashion if the application calls tf.saved_model.load() repeatedly. This growth is the root cause of one of the memory leaks observed in issue #40171 .

This PR puts an upper limit on the size of the kernel cache in EagerContext. If the number of kernels exceeds this limit, the cache discards the least recently used kernel. I have hard-coded the capacity to a conservative number of 10000 kernels, which should be enough to prevent thrashing in normal usage while still providing protection against memory leaks.

I also added a RemoveKernelFromCache() method to go with the AddKernelToCache() method.

Here is a graph showing memory usage for a Python script that repeatedly loads and unloads a toy Keras model:
mem_before_and_after
The blue points show the memory footprint of that script before applying these changes; and the orange points show the memory footprint after applying these changes.

After the changes in this PR, memory usage of my test script increases until the kernel cache reaches its maximum size. After that, the memory usage increases more slowly, because there are additional memory leaks in saved_model.load() that are not addressed in this PR.

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Aug 13, 2020
@jaingaurav
Copy link
Contributor

jaingaurav commented Aug 13, 2020

Thanks for this, but I do not believe this is the correct fix here. The kernel cache constantly growing is a sign of a different bug, i.e. kernels that should not be cached are being cached. By limiting the kernel cache we risk hiding actual bugs with poor performance.

Looking at your thorough investigation, it seems the problem is VarHandleOp is being cached due to the unique shared_name. If so, we should simply remove it from being a kernel cache candidate like we do for the tf.data ops. The change can be made here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/eager/execute.cc#L92

@gbaned gbaned self-assigned this Aug 14, 2020
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Aug 14, 2020
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Aug 14, 2020
@frreiss
Copy link
Contributor Author

frreiss commented Aug 14, 2020

Thanks for the advice. I'll try blacklisting VarHandleOp and see if that fixes the problem. It will take a day or two to recompile.

@frreiss
Copy link
Contributor Author

frreiss commented Aug 14, 2020

Opened a new PR #42377 that excludes VarHandleOp from kernel caching. I'll close the current PR in favor of the new one.

@frreiss frreiss closed this Aug 14, 2020
PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow size:M CL Change Size: Medium
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

4 participants