Memory leak in Java API when using GPU #11948

riklopfer · 2017-08-01T18:52:55Z

System information

Custom code: https://github.com/riklopfer/TensorflowJavaGpuMemoryTest
OS: CentOS 7
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): n/a
Python version: n/a
Bazel version (if compiling from source): n/a
CUDA/cuDNN version: 8.0
GPU model and memory: GeForce GTX 1080
Exact command to reproduce: see https://github.com/riklopfer/TensorflowJavaGpuMemoryTest

Describe the problem

Main memory on the machine is continuously consumed when running on the GPU. Memory consumption hovers around 600M when running on the CPU.

Source code / logs

see: https://github.com/riklopfer/TensorflowJavaGpuMemoryTest

shivaniag · 2017-08-02T18:00:17Z

@asimshankar could you please take a look at this.

riklopfer · 2017-08-29T19:10:22Z

Sample log output fwiw,

2017-08-29 14:30:27.963729: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963779: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963788: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963795: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963802: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:29.569904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-08-29 14:30:29.569957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-08-29 14:30:29.569965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-08-29 14:30:29.569981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

riklopfer · 2017-08-30T16:04:36Z

I've added valgrind output to the test repository: https://github.com/riklopfer/TensorflowJavaGpuMemoryTest/blob/master/valgrind.out

I'm not really familiar with this tool, but it seems like it would be useful. The summary makes me think that there definitely is a leak somewhere

==28997== LEAK SUMMARY:
==28997==    definitely lost: 257,022 bytes in 1,028 blocks
==28997==    indirectly lost: 6,840 bytes in 15 blocks
==28997==      possibly lost: 61,716,234 bytes in 14,975 blocks
==28997==    still reachable: 397,427,506 bytes in 261,680 blocks
==28997==                       of which reachable via heuristic:
==28997==                         stdstring          : 2,034,837 bytes in 43,856 blocks
==28997==                         newarray           : 22,536 bytes in 1 blocks
==28997==         suppressed: 0 bytes in 0 blocks
==28997== Reachable blocks (those to which a pointer was found) are not shown.
==28997== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==28997== 
==28997== For counts of detected and suppressed errors, rerun with: -v
==28997== ERROR SUMMARY: 1011246 errors from 460 contexts (suppressed: 0 from 0)

asimshankar · 2017-08-30T16:45:52Z

@riklopfer : Thanks very much for getting that information across. Unfortunately not a lot struck out to me.

I did see 32 bytes of leaks from graph construction, which I will fix, but that happens once - not in a loop so won't explain the increasing usage over time.

riklopfer · 2017-08-30T23:12:20Z

@asimshankar thanks for the fixes. Were you able to reproduce the issue of ever-increasing memory consumption? Any idea what the next steps might be?

riklopfer · 2017-08-31T19:01:20Z

Updating CUDA and Nvidia drivers seems to have greatly mitigated the problem for me. I added updated valgrind output to the test repo.

asimshankar · 2017-09-01T17:06:40Z

Thanks for the update @riklopfer
Sampling the latest output, I'm not sure if there are false positives or actual leaks (e.g., many leaks are reported in CreateJavaVM, which IIUC has nothing to do with TensorFlow, it's just JVM initialization.

When you say "greatly mitigated", are you still seeing a monotonic increase in memory usage over time, or does it stabilize?

@riklopfer

Thanks to @riklopfer for reporting in tensorflow#11948 PiperOrigin-RevId: 167032430

riklopfer · 2017-09-06T15:01:08Z

@asimshankar I no longer see monotonic increase in memory consumption when running my the small test in the linked repo. However, when I run a longer, more complicated graph on the GPU, it is killed by the OOM killer. I wasn't able to get a valgrind dump for that process. When I have time, I will try increasing the complexity of the test graph until it shows the problem again (or not).

tensorflowbutler · 2017-12-20T01:17:05Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

riklopfer · 2017-12-20T15:55:08Z

Running with 1.4.0, I still see a slow, monotonic increase in memory consumption. I haven't had a chance to attempt to minimally reproduce the issue.

tensorflowbutler · 2018-01-04T19:05:47Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

tensorflowbutler · 2018-01-23T23:18:11Z

The original poster has replied to this issue after the stat:awaiting response label was applied.

drpngx · 2018-02-03T01:35:39Z

Closing since the original issue has been fixed. Please file another ticket with a repro if you can. Thanks!

riklopfer mentioned this issue Aug 1, 2017

Memory leak in Java API due to missing TF_DeleteStatus #10119

Closed

shivaniag added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 2, 2017

shivaniag added the type:bug Bug label Aug 2, 2017

tensorflow deleted a comment from Mazecreator Aug 15, 2017

asimshankar added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 1, 2017

kobejean pushed a commit to kobejean/tensorflow that referenced this issue Sep 3, 2017

Java: Fix 8-byte leak when constructing graph operations.

57c2e83

Thanks to @riklopfer for reporting in tensorflow#11948 PiperOrigin-RevId: 167032430

aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 6, 2017

aselle added stat:contribution welcome Status - Contributions welcome stat:awaiting response Status - Awaiting response from author and removed stat:contribution welcome Status - Contributions welcome labels Sep 20, 2017

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jan 23, 2018

drpngx closed this as completed Feb 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in Java API when using GPU #11948

Memory leak in Java API when using GPU #11948

riklopfer commented Aug 1, 2017 •

edited

shivaniag commented Aug 2, 2017

riklopfer commented Aug 29, 2017

riklopfer commented Aug 30, 2017 •

edited

asimshankar commented Aug 30, 2017

riklopfer commented Aug 30, 2017

riklopfer commented Aug 31, 2017

asimshankar commented Sep 1, 2017

riklopfer commented Sep 6, 2017

tensorflowbutler commented Dec 20, 2017

riklopfer commented Dec 20, 2017 •

edited

tensorflowbutler commented Jan 4, 2018

tensorflowbutler commented Jan 23, 2018

drpngx commented Feb 3, 2018

Memory leak in Java API when using GPU #11948

Memory leak in Java API when using GPU #11948

Comments

riklopfer commented Aug 1, 2017 • edited

System information

Describe the problem

Source code / logs

shivaniag commented Aug 2, 2017

riklopfer commented Aug 29, 2017

riklopfer commented Aug 30, 2017 • edited

asimshankar commented Aug 30, 2017

riklopfer commented Aug 30, 2017

riklopfer commented Aug 31, 2017

asimshankar commented Sep 1, 2017

riklopfer commented Sep 6, 2017

tensorflowbutler commented Dec 20, 2017

riklopfer commented Dec 20, 2017 • edited

tensorflowbutler commented Jan 4, 2018

tensorflowbutler commented Jan 23, 2018

drpngx commented Feb 3, 2018

riklopfer commented Aug 1, 2017 •

edited

riklopfer commented Aug 30, 2017 •

edited

riklopfer commented Dec 20, 2017 •

edited