GPU memory not released until Java process terminates #36627

TheSentry · 2020-02-10T16:53:59Z

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 18.04.4 LTS, Kernel 4.15.0-76-generic
TensorFlow installed from (source or binary):
Binary
TensorFlow version (use command below):
1.15.0
Python version:
Python 3.6.9
CUDA/cuDNN version:
10.0.130
GPU model and memory:
GeForce GTX 1080 Ti, 11177MiB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
After closing all Tensors, Graphs and Sessions in our Java programm, the Java process still holds the previously used GPU memory until the Java process terminates.

Describe the expected behavior
After closing all Tensors, Graphs and Sessions, the Java process should release all allocated GPU memory.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

package de.tensorflowtest;

import static org.apache.commons.io.IOUtils.toByteArray;
import java.io.Console;
import java.io.IOException;
import java.io.InputStream;
import org.tensorflow.Graph;
import org.tensorflow.Session;
import org.tensorflow.framework.ConfigProto;
import de.tensorflowtest.Constants;

public class GpuLeakDebug {
    public static void main(String[] args) throws IOException, InterruptedException {

        waitWithMessage("Create Session");

        Session session = null;
        Graph graph = null;
        ConfigProto sessionConfig;
        try {
            byte[] graphdef = null;
            try (InputStream graphStream = Constants.class.getResourceAsStream("/tensorflow/inception_v3.pb")) {
                graphdef = toByteArray(graphStream);
            } catch (IOException e) {
                System.exit(1);
            }
            graph = new Graph();
            graph.importGraphDef(graphdef);
            sessionConfig = ConfigProto.newBuilder().build();
            session = new Session(graph, sessionConfig.toByteArray());

        } catch (UnsatisfiedLinkError e) {

            throw e;
        } finally {
            waitWithMessage("Close Session");
            session.close();
            graph.close();
        }
        session = null;
        graph = null;
        sessionConfig = null;

        waitWithMessage("Terminate");
    }

    private static void waitWithMessage(String message, Object... args) {
        Console c = System.console();
        if (c != null) {
            // printf-like arguments
            if (message != null)
                c.format(message, args);
            c.format(" Press ENTER to proceed.\n");
            c.readLine();
        }
    }
}

Other info / logs
This is the output of nvidia-smi before the session is created ("Create Session") and after the JVM terminates ("Terminate").

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 25%   57C    P5    30W / 250W |      0MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is the output of nvidia-smi after the session has been created and after the session has been closed.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 19%   53C    P5    24W / 250W |    145MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     20221      C   /usr/bin/java                                135MiB |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

gowthamkpr · 2020-02-11T19:30:27Z

@TheSentry By default TensorFlow allocates GPU memory for the lifetime of the process, not the lifetime of the session object. More details at: https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth

Thus, if you want memory to be freed, you'll have to exit the Java interpreter, not just close the session.

For more info, you can refer to the following issue.

TheSentry · 2020-02-12T11:21:10Z

@gowthamkpr Thank you for your response.

@TheSentry By default TensorFlow allocates GPU memory for the lifetime of the process, not the lifetime of the session object. More details at: https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth

Unfortunately, the text in this link doesn't state this as clearly as you did just now. I've interpreted "Note we do not release memory, since it can lead to memory fragmentation." as session-bound, not process-bound.

Thus, if you want memory to be freed, you'll have to exit the Java interpreter, not just close the session.

This is unfortunate as our process is a Tomcat webserver, which is long-running.

For more info, you can refer to the following issue.

Thank you for the link to that issue. I had hoped that this was an old issue and had changed by now, but this other issue also points to the conclusion that there is still no way to release GPU memory except terminating the process.

Is there a way to turn this into a feature request, if it doesn't already exist?

gowthamkpr · 2020-02-12T18:45:37Z

Yes. I can leave it open. As mentioned here pytorch has a way to clear it and is being requested by many users.

@sanjoy Can you PTAL?

sanjoy · 2020-02-13T02:52:09Z

Is there a way to turn this into a feature request, if it doesn't already exist?

This GH issue can serve a feature request. We don't have anyone working on this in Q1 though.

akshayrana30 · 2020-11-17T20:55:35Z

@gowthamkpr Thank you for your response.

@TheSentry By default TensorFlow allocates GPU memory for the lifetime of the process, not the lifetime of the session object. More details at: https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth

Unfortunately, the text in this link doesn't state this as clearly as you did just now. I've interpreted "Note we do not release memory, since it can lead to memory fragmentation." as session-bound, not process-bound.

Thus, if you want memory to be freed, you'll have to exit the Java interpreter, not just close the session.

This is unfortunate as our process is a Tomcat webserver, which is long-running.

For more info, you can refer to the following issue.

Thank you for the link to that issue. I had hoped that this was an old issue and had changed by now, but this other issue also points to the conclusion that there is still no way to release GPU memory except terminating the process.

Is there a way to turn this into a feature request, if it doesn't already exist?

@TheSentry How about creating another sub-process to run Tensorflow instead of using the same process where Tomcat is running. The Tf-heap memory will be released back to CUDA as soon as the sub-process is killed.

TheSentry · 2021-01-07T11:24:11Z

@TheSentry How about creating another sub-process to run Tensorflow instead of using the same process where Tomcat is running. The Tf-heap memory will be released back to CUDA as soon as the sub-process is killed.

@akshayrana30 The is not feasible for us. It would require too much work to extract the TF part of our code and establish the proper inter-process communication, not to mention the impact on processing speed, which is very important in our system.

Right now we mitigated this problem by having no other process on this machine that needs CUDA

tensorflowbutler · 2021-02-01T14:12:45Z

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

tensorflow-bot bot assigned gadagashwini-zz Feb 10, 2020

gadagashwini-zz added TF 1.15 for issues seen on TF 1.15 comp:apis Highlevel API related issues type:bug Bug labels Feb 11, 2020

gadagashwini-zz assigned gowthamkpr and unassigned gadagashwini-zz Feb 11, 2020

gowthamkpr added the stat:awaiting response Status - Awaiting response from author label Feb 11, 2020

gowthamkpr assigned sanjoy and unassigned gowthamkpr Feb 12, 2020

gowthamkpr added comp:gpu GPU related issues type:feature Feature requests stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed type:bug Bug stat:awaiting response Status - Awaiting response from author labels Feb 12, 2020

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 21, 2020

tensorflowbutler closed this as completed Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory not released until Java process terminates #36627

GPU memory not released until Java process terminates #36627

TheSentry commented Feb 10, 2020

gowthamkpr commented Feb 11, 2020

TheSentry commented Feb 12, 2020 •

edited

gowthamkpr commented Feb 12, 2020 •

edited

sanjoy commented Feb 13, 2020

akshayrana30 commented Nov 17, 2020

TheSentry commented Jan 7, 2021

tensorflowbutler commented Feb 1, 2021

GPU memory not released until Java process terminates #36627

GPU memory not released until Java process terminates #36627

Comments

TheSentry commented Feb 10, 2020

gowthamkpr commented Feb 11, 2020

TheSentry commented Feb 12, 2020 • edited

gowthamkpr commented Feb 12, 2020 • edited

sanjoy commented Feb 13, 2020

akshayrana30 commented Nov 17, 2020

TheSentry commented Jan 7, 2021

tensorflowbutler commented Feb 1, 2021

TheSentry commented Feb 12, 2020 •

edited

gowthamkpr commented Feb 12, 2020 •

edited