Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in Java API when using GPU #11948

Closed
riklopfer opened this issue Aug 1, 2017 · 13 comments
Closed

Memory leak in Java API when using GPU #11948

riklopfer opened this issue Aug 1, 2017 · 13 comments
Labels

Comments

@riklopfer
Copy link
Contributor

riklopfer commented Aug 1, 2017

System information

Describe the problem

Main memory on the machine is continuously consumed when running on the GPU. Memory consumption hovers around 600M when running on the CPU.

Source code / logs

see: https://github.com/riklopfer/TensorflowJavaGpuMemoryTest

@shivaniag shivaniag added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 2, 2017
@shivaniag
Copy link
Contributor

@asimshankar could you please take a look at this.

@shivaniag shivaniag added the type:bug Bug label Aug 2, 2017
@tensorflow tensorflow deleted a comment from Mazecreator Aug 15, 2017
@riklopfer
Copy link
Contributor Author

Sample log output fwiw,

2017-08-29 14:30:27.963729: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963779: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963788: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963795: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:27.963802: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-29 14:30:29.569904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-08-29 14:30:29.569957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-08-29 14:30:29.569965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-08-29 14:30:29.569981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

@riklopfer
Copy link
Contributor Author

riklopfer commented Aug 30, 2017

I've added valgrind output to the test repository: https://github.com/riklopfer/TensorflowJavaGpuMemoryTest/blob/master/valgrind.out

I'm not really familiar with this tool, but it seems like it would be useful. The summary makes me think that there definitely is a leak somewhere

==28997== LEAK SUMMARY:
==28997==    definitely lost: 257,022 bytes in 1,028 blocks
==28997==    indirectly lost: 6,840 bytes in 15 blocks
==28997==      possibly lost: 61,716,234 bytes in 14,975 blocks
==28997==    still reachable: 397,427,506 bytes in 261,680 blocks
==28997==                       of which reachable via heuristic:
==28997==                         stdstring          : 2,034,837 bytes in 43,856 blocks
==28997==                         newarray           : 22,536 bytes in 1 blocks
==28997==         suppressed: 0 bytes in 0 blocks
==28997== Reachable blocks (those to which a pointer was found) are not shown.
==28997== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==28997== 
==28997== For counts of detected and suppressed errors, rerun with: -v
==28997== ERROR SUMMARY: 1011246 errors from 460 contexts (suppressed: 0 from 0)

@asimshankar
Copy link
Contributor

@riklopfer : Thanks very much for getting that information across. Unfortunately not a lot struck out to me.

I did see 32 bytes of leaks from graph construction, which I will fix, but that happens once - not in a loop so won't explain the increasing usage over time.

@riklopfer
Copy link
Contributor Author

@asimshankar thanks for the fixes. Were you able to reproduce the issue of ever-increasing memory consumption? Any idea what the next steps might be?

@riklopfer
Copy link
Contributor Author

Updating CUDA and Nvidia drivers seems to have greatly mitigated the problem for me. I added updated valgrind output to the test repo.

@asimshankar
Copy link
Contributor

Thanks for the update @riklopfer
Sampling the latest output, I'm not sure if there are false positives or actual leaks (e.g., many leaks are reported in CreateJavaVM, which IIUC has nothing to do with TensorFlow, it's just JVM initialization.

When you say "greatly mitigated", are you still seeing a monotonic increase in memory usage over time, or does it stabilize?

@asimshankar asimshankar added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 1, 2017
kobejean pushed a commit to kobejean/tensorflow that referenced this issue Sep 3, 2017
Thanks to @riklopfer for reporting in tensorflow#11948

PiperOrigin-RevId: 167032430
@riklopfer
Copy link
Contributor Author

@asimshankar I no longer see monotonic increase in memory consumption when running my the small test in the linked repo. However, when I run a longer, more complicated graph on the GPU, it is killed by the OOM killer. I wasn't able to get a valgrind dump for that process. When I have time, I will try increasing the complexity of the test graph until it shows the problem again (or not).

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 6, 2017
@aselle aselle added stat:contribution welcome Status - Contributions welcome stat:awaiting response Status - Awaiting response from author and removed stat:contribution welcome Status - Contributions welcome labels Sep 20, 2017
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

@riklopfer
Copy link
Contributor Author

riklopfer commented Dec 20, 2017

Running with 1.4.0, I still see a slow, monotonic increase in memory consumption. I haven't had a chance to attempt to minimally reproduce the issue.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jan 23, 2018
@tensorflowbutler
Copy link
Member

The original poster has replied to this issue after the stat:awaiting response label was applied.

@drpngx
Copy link
Contributor

drpngx commented Feb 3, 2018

Closing since the original issue has been fixed. Please file another ticket with a repro if you can. Thanks!

@drpngx drpngx closed this as completed Feb 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants