Skip to content

Potential memory leak when using LSTM + TimeDistributed #33178

@algrmur

Description

@algrmur

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution: Windows 10
  • TensorFlow installed from: binary
  • TensorFlow version: v2.0.0-rc2-26-g64c3d382ca 2.0.0
  • Python version: 3.6.9
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 10.0 (CUDA) / 7.5 (cuDNN)
  • GPU model and memory: TITAN RTX (24GB)
  • Exact command to reproduce: N/A

Describe the problem

Potential memory leak when using LSTM + TimeDistributed

I have a standard time series model that consists 3 layers of convolutional layers feeding into 2 LSTM layers. Up until now, I have had no problems mapping a Dense layer to the last output of the top LSTM and making a prediction etc. However, I want to implement a model where I use a TimeDistributed(Dense(..)) layer on top of the top LSTM and feed back the error signal at each time point. I have implemented this but after only training a few epochs, I get a resource exhausted error.

It doesn't seem to be affected by how small I make the model, after training for a few epochs. The error I get is: "ResourceExhaustedError: OOM when allocating tensor with shape[25600,9,11,128]". This comes after a call to tape.gradients (full error reported in section below).

In my non-TimeDistributed I monitor the number of objects via len(gc.get_objects())) and during training the object count remains the same (as expected), but when I only change the model to handle this TimeDistributed change (i.e. making sure the labels are correctly repeated and making return_sequences=1 for the top-level LSTM) then all of a sudden at each training epoch, thousands of new variables are being added during each epoch of training.

gc objects: 249861
[TRAIN]: End (epoch 0): loss 0.693269372 ; train accuracy 0.5
[TEST]: End (epoch 0): loss 0.691318274 ; test accuracy 0.500683606
gc objects: 251746 (+ 1885 objects)
[TRAIN]: End (epoch 1): loss 0.691800237 ; train accuracy 0.500202894
[TEST]: End (epoch 1): loss 0.690349817 ; test accuracy 0.502343774
gc objects: 254144 (+ 2398 objects)
[TRAIN]: End (epoch 2): loss 0.690762699 ; train accuracy 0.500456572
[TEST]: End (epoch 2): loss 0.689480364 ; test accuracy 0.504296899
gc objects: 254996 (+852 objects)
[TRAIN]: End (epoch 3): loss 0.692312837 ; train accuracy 0.501090705
[TEST]: End (epoch 3): loss 0.689140499 ; test accuracy 0.505468726
gc objects: 269643 (+ 14647 objects)
[TRAIN]: End (epoch 4): loss 0.688487 ; train accuracy 0.501116097
[TEST]: End (epoch 4): loss 0.686942577 ; test accuracy 0.508886695
gc objects: 270444 (+ 801 objects)

So in 4 epochs of training, while no other process is running, 20,583 new objects were created and I presume resulted in this resource exhausted error.

I've tried to force the garbage collector to collect any unused variables but the object count increases whether this is included or not. I ran a snapshot comparison from the tracemalloc library, which I will include below as it might be helpful (it wasn't to me).

Something is creating variables during every epoch, vastly using up all the memory and not releasing them, leading to this resource exhausted error. This doesn't occur if I don't use TimeDistributed, so I don't think anything about this layer requires the creation of additional memory-hungry variables. It looks more like a leak.

Do you have any idea of what I could do to alleviate this problem? It seems like a bug fix at a technical level. Maybe there is a technical solution. Please let me know if any further info from my end would be useful in looking at this issue.

Source code / logs

tracemalloc top 10 differences between snapshot calls at adjacent epochs

C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\execute.py:61: size=111 KiB (+69.9 KiB), count=677 (+426), average=168 B
:14: size=7464 B (-46.9 KiB), count=107 (-749), average=70 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\tokenize.py:609: size=2944 B (-43.6 KiB), count=46 (-698), average=64 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py:193: size=59.9 KiB (+33.8 KiB), count=1305 (+732), average=47 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\training\tracking\data_structures.py:768: size=54.0 KiB (+31.3 KiB), count=386 (+219), average=143 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py:718: size=55.7 KiB (+30.8 KiB), count=1018 (+564), average=56 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py:776: size=51.0 KiB (+28.7 KiB), count=1235 (+690), average=42 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\keras\utils\generic_utils.py:564: size=40.9 KiB (+25.8 KiB), count=675 (+426), average=62 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\ops.py:1035: size=39.3 KiB (+23.3 KiB), count=950 (+566), average=42 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\training\tracking\data_structures.py:809: size=27.1 KiB (+15.9 KiB), count=3 (+0), average=9248 B

full error

ResourceExhaustedError Traceback (most recent call last)
in
----> 1 best_val, best_epoch, tmp_history = run_model_once(0, 25, epochs=50)

in run_model_once(start, end, epochs)
36 printed_cm = False
37
---> 38 train_loss, val_loss, acc_metric, val_acc_metric = train(RCNN_model, optimizer, train_ds, test_ds, cm)
39 tf.print(f'[TRAIN]: End (epoch {i}): loss', train_loss, '; train accuracy', acc_metric.result())
40 tf.print(f'[TEST]: End (epoch {i}): loss', val_loss, '; test accuracy', val_acc_metric.result())

in train(model, optimizer, train_ds, test_ds, cm)
60 for x_true, y_true in train_ds:
61 if TIME_DISTRIBUTED:
---> 62 train_loss = train_one_step_timedistributed(model, optimizer, x_true, y_true, training=True)
63 else:
64 train_loss = train_one_step(model, optimizer, x_true, y_true, training=True)

in train_one_step_timedistributed(model, optimizer, x_true, y_true, training)
22 print(f'model trainable variables: {len(model.trainable_variables)}')
23
---> 24 gradients = tape.gradient(loss_, model.trainable_variables)
25 optimizer.apply_gradients(zip(gradients, model.trainable_variables))
26

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py in gradient(self, target, sources, output_gradients, unconnected_gradients)
1012 output_gradients=output_gradients,
1013 sources_raw=flat_sources_raw,
-> 1014 unconnected_gradients=unconnected_gradients)
1015
1016 if not self._persistent:

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\imperative_grad.py in imperative_grad(tape, target, sources, output_gradients, sources_raw, unconnected_gradients)
74 output_gradients,
75 sources_raw,
---> 76 compat.as_str(unconnected_gradients.value))

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py in _gradient_function(op_name, attr_tuple, num_inputs, inputs, outputs, out_grads, skip_input_indices)
136 return [None] * num_inputs
137
--> 138 return grad_fn(mock_op, *out_grads)
139
140

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\ops\math_grad.py in _TanhGrad(op, grad)
712 with ops.control_dependencies([grad]):
713 y = math_ops.conj(y)
--> 714 return gen_math_ops.tanh_grad(y, grad)
715
716

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py in tanh_grad(y, dy, name)
11410 else:
11411 message = e.message

11412 _six.raise_from(_core._status_to_exception(e.code, message), None)
11413 # Add nodes to the TensorFlow graph.
11414 _, _, _op = _op_def_lib._apply_op_helper(

~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\six.py in raise_from(value, from_value)

ResourceExhaustedError: OOM when allocating tensor with shape[25600,9,11,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TanhGrad]

Metadata

Metadata

Assignees

Labels

TF 2.0Issues relating to TensorFlow 2.0comp:kerasKeras related issuesstat:awaiting tensorflowerStatus - Awaiting response from tensorflowertype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions