Allocating all memory, CUDA OOM #64284

dennisushi · 2024-03-22T14:09:19Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

1.13, 1.10

Custom code

Yes

OS platform and distribution

Linux, Ubuntu 20

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.7

GPU model and memory

V100, 34GB

Current behavior?

TF tries to allocate ALL memory despite not calling any functions that should put any data on the GPU.

Standalone code to reproduce the issue

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
from keras import backend as K

tf.config.experimental_run_functions_eagerly(not True)
message = "No GPU found. To actually train on CPU remove this assert."
assert tf.config.experimental.list_physical_devices("GPU"), message

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    print("Found GPUs: ", gpus)
    # Restrict TensorFlow to only use the first GPU
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            print("Setting memory growth for ", gpu)
            tf.config.experimental.set_memory_growth(gpu, True)
        print("Setting visible devices to ", gpus[0])
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        print(e)

Relevant log output

WARNING:tensorflow:From mvmwm/_tf_error_test.py:6: experimental_run_functions_eagerly (from tensorflow.python.eager.polymorphic_function.quarantine) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.
Found GPUs:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Setting memory growth for  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Setting visible devices to  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
2024-03-22 14:06:37.455972: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 34087305216
Aborted (core dumped)

sushreebarsa · 2024-03-27T06:17:37Z

@dennisushi I wasn't able to replicate the issue on colab using TF v2.15, please find the gist here.
Kindly use TF latest Version as TF v1.x is no longer actively supported. Thank you!

github-actions · 2024-04-04T01:47:31Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-04-11T01:47:50Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2024-04-11T01:47:53Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:bug Bug label Mar 22, 2024

google-ml-butler bot assigned sushreebarsa Mar 22, 2024

sushreebarsa added comp:gpu GPU related issues TF 1.13 Issues related to TF 1.13 labels Mar 27, 2024

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 4, 2024

github-actions bot closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocating all memory, CUDA OOM #64284

Allocating all memory, CUDA OOM #64284

dennisushi commented Mar 22, 2024

sushreebarsa commented Mar 27, 2024

github-actions bot commented Apr 4, 2024

github-actions bot commented Apr 11, 2024

google-ml-butler bot commented Apr 11, 2024

Allocating all memory, CUDA OOM #64284

Allocating all memory, CUDA OOM #64284

Comments

dennisushi commented Mar 22, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

sushreebarsa commented Mar 27, 2024

github-actions bot commented Apr 4, 2024

github-actions bot commented Apr 11, 2024

google-ml-butler bot commented Apr 11, 2024