Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory not released until Java process terminates #36627

Closed
TheSentry opened this issue Feb 10, 2020 · 7 comments
Closed

GPU memory not released until Java process terminates #36627

TheSentry opened this issue Feb 10, 2020 · 7 comments
Assignees
Labels
comp:apis Highlevel API related issues comp:gpu GPU related issues TF 1.15 for issues seen on TF 1.15 type:feature Feature requests

Comments

@TheSentry
Copy link

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Ubuntu 18.04.4 LTS, Kernel 4.15.0-76-generic

  • TensorFlow installed from (source or binary):
    Binary

  • TensorFlow version (use command below):
    1.15.0

  • Python version:
    Python 3.6.9

  • CUDA/cuDNN version:
    10.0.130

  • GPU model and memory:
    GeForce GTX 1080 Ti, 11177MiB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
After closing all Tensors, Graphs and Sessions in our Java programm, the Java process still holds the previously used GPU memory until the Java process terminates.

Describe the expected behavior
After closing all Tensors, Graphs and Sessions, the Java process should release all allocated GPU memory.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

package de.tensorflowtest;

import static org.apache.commons.io.IOUtils.toByteArray;
import java.io.Console;
import java.io.IOException;
import java.io.InputStream;
import org.tensorflow.Graph;
import org.tensorflow.Session;
import org.tensorflow.framework.ConfigProto;
import de.tensorflowtest.Constants;

public class GpuLeakDebug {
    public static void main(String[] args) throws IOException, InterruptedException {

        waitWithMessage("Create Session");

        Session session = null;
        Graph graph = null;
        ConfigProto sessionConfig;
        try {
            byte[] graphdef = null;
            try (InputStream graphStream = Constants.class.getResourceAsStream("/tensorflow/inception_v3.pb")) {
                graphdef = toByteArray(graphStream);
            } catch (IOException e) {
                System.exit(1);
            }
            graph = new Graph();
            graph.importGraphDef(graphdef);
            sessionConfig = ConfigProto.newBuilder().build();
            session = new Session(graph, sessionConfig.toByteArray());

        } catch (UnsatisfiedLinkError e) {

            throw e;
        } finally {
            waitWithMessage("Close Session");
            session.close();
            graph.close();
        }
        session = null;
        graph = null;
        sessionConfig = null;

        waitWithMessage("Terminate");
    }

    private static void waitWithMessage(String message, Object... args) {
        Console c = System.console();
        if (c != null) {
            // printf-like arguments
            if (message != null)
                c.format(message, args);
            c.format(" Press ENTER to proceed.\n");
            c.readLine();
        }
    }
}

Other info / logs
This is the output of nvidia-smi before the session is created ("Create Session") and after the JVM terminates ("Terminate").

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 25%   57C    P5    30W / 250W |      0MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is the output of nvidia-smi after the session has been created and after the session has been closed.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 19%   53C    P5    24W / 250W |    145MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     20221      C   /usr/bin/java                                135MiB |
+-----------------------------------------------------------------------------+
@gadagashwini-zz gadagashwini-zz added TF 1.15 for issues seen on TF 1.15 comp:apis Highlevel API related issues type:bug Bug labels Feb 11, 2020
@gowthamkpr
Copy link

@TheSentry By default TensorFlow allocates GPU memory for the lifetime of the process, not the lifetime of the session object. More details at: https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth

Thus, if you want memory to be freed, you'll have to exit the Java interpreter, not just close the session.

For more info, you can refer to the following issue.

@gowthamkpr gowthamkpr added the stat:awaiting response Status - Awaiting response from author label Feb 11, 2020
@TheSentry
Copy link
Author

TheSentry commented Feb 12, 2020

@gowthamkpr Thank you for your response.

@TheSentry By default TensorFlow allocates GPU memory for the lifetime of the process, not the lifetime of the session object. More details at: https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth

Unfortunately, the text in this link doesn't state this as clearly as you did just now. I've interpreted "Note we do not release memory, since it can lead to memory fragmentation." as session-bound, not process-bound.

Thus, if you want memory to be freed, you'll have to exit the Java interpreter, not just close the session.

This is unfortunate as our process is a Tomcat webserver, which is long-running.

For more info, you can refer to the following issue.

Thank you for the link to that issue. I had hoped that this was an old issue and had changed by now, but this other issue also points to the conclusion that there is still no way to release GPU memory except terminating the process.

Is there a way to turn this into a feature request, if it doesn't already exist?

@gowthamkpr
Copy link

gowthamkpr commented Feb 12, 2020

Yes. I can leave it open. As mentioned here pytorch has a way to clear it and is being requested by many users.

@sanjoy Can you PTAL?

@gowthamkpr gowthamkpr assigned sanjoy and unassigned gowthamkpr Feb 12, 2020
@gowthamkpr gowthamkpr added comp:gpu GPU related issues type:feature Feature requests stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed type:bug Bug stat:awaiting response Status - Awaiting response from author labels Feb 12, 2020
@sanjoy
Copy link
Contributor

sanjoy commented Feb 13, 2020

Is there a way to turn this into a feature request, if it doesn't already exist?

This GH issue can serve a feature request. We don't have anyone working on this in Q1 though.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 21, 2020
@akshayrana30
Copy link

@gowthamkpr Thank you for your response.

@TheSentry By default TensorFlow allocates GPU memory for the lifetime of the process, not the lifetime of the session object. More details at: https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth

Unfortunately, the text in this link doesn't state this as clearly as you did just now. I've interpreted "Note we do not release memory, since it can lead to memory fragmentation." as session-bound, not process-bound.

Thus, if you want memory to be freed, you'll have to exit the Java interpreter, not just close the session.

This is unfortunate as our process is a Tomcat webserver, which is long-running.

For more info, you can refer to the following issue.

Thank you for the link to that issue. I had hoped that this was an old issue and had changed by now, but this other issue also points to the conclusion that there is still no way to release GPU memory except terminating the process.

Is there a way to turn this into a feature request, if it doesn't already exist?

@TheSentry How about creating another sub-process to run Tensorflow instead of using the same process where Tomcat is running. The Tf-heap memory will be released back to CUDA as soon as the sub-process is killed.

@TheSentry
Copy link
Author

@TheSentry How about creating another sub-process to run Tensorflow instead of using the same process where Tomcat is running. The Tf-heap memory will be released back to CUDA as soon as the sub-process is killed.

@akshayrana30 The is not feasible for us. It would require too much work to extract the TF part of our code and establish the proper inter-process communication, not to mention the impact on processing speed, which is very important in our system.

Right now we mitigated this problem by having no other process on this machine that needs CUDA

@tensorflowbutler
Copy link
Member

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues comp:gpu GPU related issues TF 1.15 for issues seen on TF 1.15 type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

6 participants