Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak: tf1 trained saved_model in tf2 for prediction #10759

Open
3 tasks
purvang3 opened this issue Sep 1, 2022 · 6 comments
Open
3 tasks

Memory leak: tf1 trained saved_model in tf2 for prediction #10759

purvang3 opened this issue Sep 1, 2022 · 6 comments
Assignees
Labels

Comments

@purvang3
Copy link

purvang3 commented Sep 1, 2022

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/...

2. Describe the bug

A clear and concise description of what the bug is.

I have previously trained ssd_inception_v2 model in tensorflow 1.14. It has frozen_inference graph and saved_model dir with protobuf files and variables. I am running tensorflow 2.6.0. loading tf 1.14 trained saved_model into tf 2.6 is done without problem and it runs smoothly. But over the period of time, cpu memory keeps increasing and after some time, prediction scrip crashes because of memory full. I have tried to load "frozen graph.pb" instead of saved_model.pb and problem still exists. Any help would be appreciated. Using "htop" command, MEM% column keep increasing over the time with follwing script running.

3. Steps to reproduce

Steps to reproduce the behavior.
Any tensorflow 1 trained model with saved_model dir after training.
sample: wget http://download.tensorflow.org/models/object_detection/ssd_inception_v2_coco_2018_01_28.tar.gz

use saved_model dir.

import numpy as np

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.config.set_soft_device_placement(True)
output_tensor_dict_global = None

model_path = "saved_model_path"

gpus = tf.config.experimental.list_physical_devices('GPU')

def get_output_tensor_dict():
    global output_tensor_dict_global
    if output_tensor_dict_global:
        return output_tensor_dict_global

    ops = tf.get_default_graph().get_operations()
    all_tensor_names = {output.name for op in ops for output in op.outputs}
    tensor_dict = {}
    for key in ['num_detections', 'detection_boxes', 'detection_scores', 'detection_classes', 'detection_masks']:
        tensor_name = key + ':0'
        if tensor_name in all_tensor_names:
            tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(
                tensor_name)
    output_tensor_dict_global = tensor_dict
    return output_tensor_dict_global


if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

with tf.Graph().as_default() as g:
    with tf.Session() as sess:
        detection_graph = tf.saved_model.load(sess, ["serve"], model_path)

        while True:
            img_np = np.random.randn(720,1280,3)
            tensor_dict = get_output_tensor_dict()
            image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')
            output_dict = sess.run(tensor_dict, feed_dict={image_tensor: np.expand_dims(img_np, axis=0)})
            print(output_dict.keys())

I have tested same code with tensorflow 2.9.0 and problem still exists.

4. Expected behavior

A clear and concise description of what you expected to happen.
Memory consumption should be constant.

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Mobile device name if the issue happens on a mobile device:
  • TensorFlow installed from (source or binary): Tensorflow docker 2.6.0
  • TensorFlow version (use command below): 2.6.0
  • Python version: 3.6.9
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory: RTX 3070, 8GB
@purvang3 purvang3 added models:official models that come under official repository type:bug Bug in the code labels Sep 1, 2022
@saberkun saberkun added models:research:odapi ODAPI and removed models:official models that come under official repository labels Sep 4, 2022
@sushreebarsa sushreebarsa self-assigned this Sep 7, 2022
@sushreebarsa sushreebarsa assigned tombstone, jch1 and pkulzc and unassigned sushreebarsa Sep 7, 2022
@muxamilian
Copy link

I encountered the same memory leak. I also tried the same steps as you. Memory leak for saved model, for frozen inference graph. Doesn't matter whether it's in eager mode or graph mode, the memory leak is always there.

@muxamilian
Copy link

muxamilian commented Sep 7, 2022

When disabling the GPU, the memory leak disappears.

try:
    # Disable all GPUS
    tf.config.set_visible_devices([], 'GPU')
    visible_devices = tf.config.get_visible_devices()
    for device in visible_devices:
        assert device.device_type != 'GPU'
except:
    # Invalid device or cannot modify virtual devices once initialized.
    pass

Cuda version is 11.2 and cudnn 8100, tensorflow is 2.7.1. But it also occurs with the newest tensorflow.

@sushreebarsa
Copy link
Contributor

sushreebarsa commented Sep 7, 2022

@purvang3 Could you refer to the comment above and let us know if it helps? I was able to reproduce the issue on Colab using TF v2.9. Please find the gist here for reference .
Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Waiting on input from the contributor label Sep 7, 2022
@muxamilian
Copy link

Well, it "helps" but then the model doesn't run on the GPU anymore... So it's certainly not a fix/workaround.

@sushreebarsa sushreebarsa removed the stat:awaiting response Waiting on input from the contributor label Sep 7, 2022
@seel-channel
Copy link

"Use the tf.config.experimental.set_memory_growth function to allow memory to be allocated as needed instead of allocating all GPU memory at the start."

    # Avoid VRAM Leak
    physical_devices = tf.config.list_physical_devices('GPU') 
    for device in physical_devices:
        tf.config.experimental.set_memory_growth(device, True)
        
    self._model = tf.compat.v2.saved_model.load(model_path)
Before After
image image

@muxamilian
Copy link

This is a memory leak of CPU memory, not GPU memory. It also occurs when disabling the GPU altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants