[40171] Limit size of kernel cache #42337

frreiss · 2020-08-13T21:40:05Z

The EagerContext class caches instances of kernels that have been used by the current session. Because kernel state is opaque to EagerContext, the caching mechanism is very conservative about reusing kernels. Also, there is no limit on the number of kernels cached. As a result of these two factors, the size of the kernel cache grows in an unbounded fashion if the application calls tf.saved_model.load() repeatedly. This growth is the root cause of one of the memory leaks observed in issue #40171 .

This PR puts an upper limit on the size of the kernel cache in EagerContext. If the number of kernels exceeds this limit, the cache discards the least recently used kernel. I have hard-coded the capacity to a conservative number of 10000 kernels, which should be enough to prevent thrashing in normal usage while still providing protection against memory leaks.

I also added a RemoveKernelFromCache() method to go with the AddKernelToCache() method.

Here is a graph showing memory usage for a Python script that repeatedly loads and unloads a toy Keras model:

The blue points show the memory footprint of that script before applying these changes; and the orange points show the memory footprint after applying these changes.

After the changes in this PR, memory usage of my test script increases until the kernel cache reaches its maximum size. After that, the memory usage increases more slowly, because there are additional memory leaks in saved_model.load() that are not addressed in this PR.

jaingaurav · 2020-08-13T23:05:57Z

Thanks for this, but I do not believe this is the correct fix here. The kernel cache constantly growing is a sign of a different bug, i.e. kernels that should not be cached are being cached. By limiting the kernel cache we risk hiding actual bugs with poor performance.

Looking at your thorough investigation, it seems the problem is VarHandleOp is being cached due to the unique shared_name. If so, we should simply remove it from being a kernel cache candidate like we do for the tf.data ops. The change can be made here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/eager/execute.cc#L92

frreiss · 2020-08-14T06:25:27Z

Thanks for the advice. I'll try blacklisting VarHandleOp and see if that fixes the problem. It will take a day or two to recompile.

frreiss · 2020-08-14T21:50:22Z

Opened a new PR #42377 that excludes VarHandleOp from kernel caching. I'll close the current PR in favor of the new one.

frreiss added 2 commits August 12, 2020 16:21

Add LRU-based pruning to kernel cache

cb2e2e8

Adjust locking code to account for restrictions of nsync mutexes

622dec3

frreiss requested review from alextp and jaingaurav as code owners August 13, 2020 21:40

google-ml-butler bot added the size:M CL Change Size: Medium label Aug 13, 2020

googlebot added the cla: yes label Aug 13, 2020

google-ml-butler bot requested a review from mihaimaruseac August 13, 2020 21:40

alextp requested a review from rohan100jain August 13, 2020 22:39

gbaned self-assigned this Aug 14, 2020

gbaned added the comp:core issues related to core part of tensorflow label Aug 14, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Aug 14, 2020

mihaimaruseac removed their request for review August 14, 2020 16:28

frreiss mentioned this pull request Aug 14, 2020

[40171] Add VarHandleOp to set of kernels not to cache #42377

Closed

frreiss closed this Aug 14, 2020

PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[40171] Limit size of kernel cache #42337

[40171] Limit size of kernel cache #42337

frreiss commented Aug 13, 2020

jaingaurav commented Aug 13, 2020 •

edited

frreiss commented Aug 14, 2020

frreiss commented Aug 14, 2020

[40171] Limit size of kernel cache #42337

[40171] Limit size of kernel cache #42337

Conversation

frreiss commented Aug 13, 2020

jaingaurav commented Aug 13, 2020 • edited

frreiss commented Aug 14, 2020

frreiss commented Aug 14, 2020

jaingaurav commented Aug 13, 2020 •

edited