Skip to content

Memory leak in Conv2D/Activation on GPU #46475

@jan-x-marek

Description

@jan-x-marek

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): Binary, the standard docker distribution
  • TensorFlow version (use command below): v2.4.0-rc4-71-g582c8d236cb 2.4.0
  • Python version: 3.6.9
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.0
  • GPU model and memory: GeForce RTX 2070, 8GB

Describe the current behavior
I upgraded to TF 2.4.0 from TF 2.1.2, and training a very simple convolutional network, which worked fine in 2.1.2, started running out of memory during training. I distilled a simple reproducible example that demonstrates the issue. Each training epoch consumes about 50MB of additional memory and, given enough epochs, it grows to infinity (or 32 GB in my case). It only occurs on GPU, the same thing runs fine on CPU.

Describe the expected behavior
Memory not growing, or growing only very little

Standalone code to reproduce the issue

import gc
import os
import psutil
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Conv2D, Flatten, BatchNormalization, Activation

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)


input_tensor = tf.keras.layers.Input(shape=(512,64,1))

x = Conv2D(filters=32, kernel_size=(5,5), strides=(2,2), padding='same')(input_tensor)
# Commented out on purpose - see Note 1 below
# x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Conv2D(filters=64, kernel_size=(4,4), strides=(2,2), padding='same')(x)
# Commented out on purpose - see Note 1 below
# x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Conv2D(filters=128, kernel_size=(4,4), strides=(2,1), padding='same')(x)
# Commented out on purpose - see Note 1 below
# x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Conv2D(filters=128, kernel_size=(4,4), strides=(2,1), padding='same')(x)
# Commented out on purpose - see Note 1 below
# x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Flatten()(x)

x = Dense(5, activation='sigmoid')(x)

model = tf.keras.Model(inputs=input_tensor, outputs=x)


train_x = np.random.random((2048, 512, 64, 1))
train_y = np.random.random((2048, 5))

model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam())

process = psutil.Process(os.getpid())

for i in range(50):
    model.fit(train_x, train_y, epochs=1, batch_size=32, verbose=0)
    gc.collect()
    print(i, process.memory_info().rss // 1000000)

Note 1
Now, if you uncomment the BatchNormalization() layers creation, the memory problem disappears. So, it is somehow caused by the Activation layer following immediately the Conv2D

Note 2
The memory problem also occurs if I train multiple epochs in a single fit() call, such as

model.fit(train_x, train_y, epochs=50, batch_size=32)

I used the for loop only to be able to call garbage collection and print the memory.

Note 3
A Conv2D layer with activation embedded in it, such as

Conv2D(filters=128, kernel_size=(4,4), strides=(2,1), padding='same', activation='relu')

also causes the memory issue

Metadata

Metadata

Assignees

Labels

TF 2.9Issues found in the TF 2.9 release (or RCs)comp:gpuGPU related issuesstat:awaiting tensorflowerStatus - Awaiting response from tensorflowertype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions