-
Notifications
You must be signed in to change notification settings - Fork 74.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak when using LSTM + TimeDistributed #33178
Comments
@algrmur , |
Hi @algrmur |
Hello @oanush and @akanyaani, Thanks for your replies. I've stripped down all of the non-essential code in my program and set it up to use randomly generated data to recreate the problem. Please see below for (1) code and (2) output given for me. Imports + Data Generation
Model definition
Loss/Optimiser/tf.Dataset/Model instantiation
Training functions
Running the model
The output of this shows that many objects are being created during the runs of the epochs. Output
@akanyaani - Thanks for your comment about it potentially being related to variable length input, but in my case (and also in the case of the dummy code above) there is no variable length input as there are always 100 time points in the data I am using. If you:
then the same code runs while keeping the amount of objects relatively stable, as seen below. This is what made me sure it was related to TimeDistributed. Output without using TimeDistributed
I hope you can also recreate the issue and thereby potentially see where the problem might lie. Kind regards |
Hi @algrmur It works fine on my system could you please try with smaller batch size. |
Hi @akanyaani, Interesting! I tried via notebooks, command line etc. and it always gave the same error. I will try it on my Linux laptop to see if it also breaks there. Do you mind if I ask about your specs so I can see what else I might be able to try (mainly just CUDA / cuDNN version and what version of Python you used)? Then I can try on Windows with the same things running as you, as that might fix my problem. Once I am back at my main workstation tomorrow, I will try with a lower batch size to answer your question as to whether that might be the problem. I didn't think it would be because TimeDistributed(Dense(..)) uses the same weights for each time step so I thought the computation would be equivalent (in terms of the gradient call) to the non-TimeDistributed case (that does work). I could be wrong though. Furthermore, I don't know why it would be fine with the first 5-6 epochs and then fail afterwards. If it can handle the first few, nothing new should be done during the training so I still have no explanation as to how the OOM could occur. More information tomorrow! |
I am also seeing a memory leak. No LSTM though, just TimeDistributed. This model fails after printing 11 on a 2080Ti. Batch Size is always 1, so that's not the problem.
|
So, does this line do what I think it does?
Because that looks like it's permanently storing the inputs in a map that never gets cleared, using a global UUID as the key. |
@algrmur , |
Hi oanush, I tried running a reduced model on a very small batch size (16) and it ran longer than it did last time but there was still a considerable increase of objects in memory on each training loop (a few hundred at each iteration). It just made a bit more space for more epochs to run, but then ran into OOM errors at a later point. I think Tetragramm (above) is having the same issue and further confirms my belief it's with the TimeDistributed layer (as without it - my model runs fine). I wanted to run the same model using the same setup as akanyaani but so far I've not seen what the exact details were in his case / whether he has much more memory available. |
I had this issue in version 2.0.0. Beta1 version is working and running faster per epoch. |
@arthurflor23 Are you sure? The Beta1 code is nearly identical to the release 2.0.0 code. The only change is to regularization, which has nothing that would do a memory leak. Upon further review (actually reading through the code), I'm pretty sure the _input_map variable is both the cause, and useless. I think lines 56, 246, 247, 249, 308, and 309 can be removed, and line 310 replaced with
Unfortunately, I'm having trouble building tensorflow from source to test. |
Hi! I don't know if it's related to this or another module version, but Beta1 doesn't happen and make ~140s per epoch.. I will do more tests with the two models that I'm studying, cause I already have another problem in the recurrent layers and Just to complement, this behavior that I mentioned appears since rc0. (I installed version by version to check, via google colab) |
two weeks have passed since success? |
Not quite. Some complications. Awaiting someone with better understanding of the system than me. |
I also have a memory leak since 1.14 up to 2.0. On 1.13 the leak disappears. |
I`m used TF 1.14 and not have a memory leak |
Hi, I'm using TF-GPU 2.0.0 and having the same issue when using the TimeDistributed wrapper...
before this:
does not result in an increasing gpu memory allocation... But I don't know if it is now still correct? I only saw self._input_map be called at the preparation of the training and only referring to the last input_uid. So I thought that clearing before adding the latest element would not destroy the logic behind it, but still fix the memory leakage. |
I can verify that @arnemoos 's workaround prevents the OOM for me. |
I'm running tf-nightly-gpu today and I have no more error, can anyone confirm? |
I can confirm that using TimeDistributed also runs my model into resource allocation errors for tf 2.0.0. Using the fit_generator() training function with a model that has 3x 2DConv layers, each wrapped in TimeDistributed on a batch of 39,32 MB total memory footprint (batch size=32). @arthurflor23 Will try tf-nightly-gpu now and confirm/not-confirm |
@arthurflor23 I can confirm that the issue has been gone for me as well :) |
@arthurflor23, yes it`s try |
Again, confirm that TimeDistributed is the culprit. In my case, tf-nightly breaks my model. Solved the problem by writing a custom for-loop in subclassed model rather than using TimeDistributed. But this bug has to be fixed for those using non-subclassed model. |
I was able to work around this issue by reshaping my tensor to combine the first two dimensions, applying the convolution / dense layer, and reshaping back to the expected output shape. |
The fix has been merged. |
Thanks @Tetragramm . @algrmur please let us know if your issue has been fixed and we can close this issue. |
Imported from GitHub PR #33441 Documented in [THIS](#33178) thread. Based on the documentation of get() and generic_utils.object_list_uid, this has no functional effect, except to remove an unnecessary map that was growing with every input. Tested using the example program in [THIS](#331... PiperOrigin-RevId: 291277510 Change-Id: I97df3c26850ae460d41e5032bb71edd11c948670
The PR has been rollback due to a test failure. I will try to update the internal code/test to fix the memory leak. |
@qlzh727 So is the problem solved? Can I put the second version? |
yes, the issue is resolved by d064c6f |
Fix tensorflow#33178. PiperOrigin-RevId: 292043221 Change-Id: Ife2fa9a2adf50424bb2c932044fe3db5f4bb42d5
hello, I still have this problem in google colab pro , OOM occured when i was running the code . And once i run ,error occured ,then GPU used 15.08GB/16.00GB and it cannot get cleared, i have to stop my jupyternote to recreate a session to run . Can someone tell me how to solve this problem ? |
Hello, i can confirm that this issue is not fixed, even in resent builds. A simpel model like:
needs ~ 25GB, even with resent nightly builds of tensorflow. |
System information
Describe the problem
Potential memory leak when using LSTM + TimeDistributed
I have a standard time series model that consists 3 layers of convolutional layers feeding into 2 LSTM layers. Up until now, I have had no problems mapping a Dense layer to the last output of the top LSTM and making a prediction etc. However, I want to implement a model where I use a TimeDistributed(Dense(..)) layer on top of the top LSTM and feed back the error signal at each time point. I have implemented this but after only training a few epochs, I get a resource exhausted error.
It doesn't seem to be affected by how small I make the model, after training for a few epochs. The error I get is: "ResourceExhaustedError: OOM when allocating tensor with shape[25600,9,11,128]". This comes after a call to tape.gradients (full error reported in section below).
In my non-TimeDistributed I monitor the number of objects via len(gc.get_objects())) and during training the object count remains the same (as expected), but when I only change the model to handle this TimeDistributed change (i.e. making sure the labels are correctly repeated and making return_sequences=1 for the top-level LSTM) then all of a sudden at each training epoch, thousands of new variables are being added during each epoch of training.
gc objects: 249861
[TRAIN]: End (epoch 0): loss 0.693269372 ; train accuracy 0.5
[TEST]: End (epoch 0): loss 0.691318274 ; test accuracy 0.500683606
gc objects: 251746 (+ 1885 objects)
[TRAIN]: End (epoch 1): loss 0.691800237 ; train accuracy 0.500202894
[TEST]: End (epoch 1): loss 0.690349817 ; test accuracy 0.502343774
gc objects: 254144 (+ 2398 objects)
[TRAIN]: End (epoch 2): loss 0.690762699 ; train accuracy 0.500456572
[TEST]: End (epoch 2): loss 0.689480364 ; test accuracy 0.504296899
gc objects: 254996 (+852 objects)
[TRAIN]: End (epoch 3): loss 0.692312837 ; train accuracy 0.501090705
[TEST]: End (epoch 3): loss 0.689140499 ; test accuracy 0.505468726
gc objects: 269643 (+ 14647 objects)
[TRAIN]: End (epoch 4): loss 0.688487 ; train accuracy 0.501116097
[TEST]: End (epoch 4): loss 0.686942577 ; test accuracy 0.508886695
gc objects: 270444 (+ 801 objects)
So in 4 epochs of training, while no other process is running, 20,583 new objects were created and I presume resulted in this resource exhausted error.
I've tried to force the garbage collector to collect any unused variables but the object count increases whether this is included or not. I ran a snapshot comparison from the tracemalloc library, which I will include below as it might be helpful (it wasn't to me).
Something is creating variables during every epoch, vastly using up all the memory and not releasing them, leading to this resource exhausted error. This doesn't occur if I don't use TimeDistributed, so I don't think anything about this layer requires the creation of additional memory-hungry variables. It looks more like a leak.
Do you have any idea of what I could do to alleviate this problem? It seems like a bug fix at a technical level. Maybe there is a technical solution. Please let me know if any further info from my end would be useful in looking at this issue.
Source code / logs
tracemalloc top 10 differences between snapshot calls at adjacent epochs
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\execute.py:61: size=111 KiB (+69.9 KiB), count=677 (+426), average=168 B
:14: size=7464 B (-46.9 KiB), count=107 (-749), average=70 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\tokenize.py:609: size=2944 B (-43.6 KiB), count=46 (-698), average=64 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py:193: size=59.9 KiB (+33.8 KiB), count=1305 (+732), average=47 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\training\tracking\data_structures.py:768: size=54.0 KiB (+31.3 KiB), count=386 (+219), average=143 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py:718: size=55.7 KiB (+30.8 KiB), count=1018 (+564), average=56 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py:776: size=51.0 KiB (+28.7 KiB), count=1235 (+690), average=42 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\keras\utils\generic_utils.py:564: size=40.9 KiB (+25.8 KiB), count=675 (+426), average=62 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\framework\ops.py:1035: size=39.3 KiB (+23.3 KiB), count=950 (+566), average=42 B
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\training\tracking\data_structures.py:809: size=27.1 KiB (+15.9 KiB), count=3 (+0), average=9248 B
full error
ResourceExhaustedError Traceback (most recent call last)
in
----> 1 best_val, best_epoch, tmp_history = run_model_once(0, 25, epochs=50)
in run_model_once(start, end, epochs)
36 printed_cm = False
37
---> 38 train_loss, val_loss, acc_metric, val_acc_metric = train(RCNN_model, optimizer, train_ds, test_ds, cm)
39 tf.print(f'[TRAIN]: End (epoch {i}): loss', train_loss, '; train accuracy', acc_metric.result())
40 tf.print(f'[TEST]: End (epoch {i}): loss', val_loss, '; test accuracy', val_acc_metric.result())
in train(model, optimizer, train_ds, test_ds, cm)
60 for x_true, y_true in train_ds:
61 if TIME_DISTRIBUTED:
---> 62 train_loss = train_one_step_timedistributed(model, optimizer, x_true, y_true, training=True)
63 else:
64 train_loss = train_one_step(model, optimizer, x_true, y_true, training=True)
in train_one_step_timedistributed(model, optimizer, x_true, y_true, training)
22 print(f'model trainable variables: {len(model.trainable_variables)}')
23
---> 24 gradients = tape.gradient(loss_, model.trainable_variables)
25 optimizer.apply_gradients(zip(gradients, model.trainable_variables))
26
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py in gradient(self, target, sources, output_gradients, unconnected_gradients)
1012 output_gradients=output_gradients,
1013 sources_raw=flat_sources_raw,
-> 1014 unconnected_gradients=unconnected_gradients)
1015
1016 if not self._persistent:
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\imperative_grad.py in imperative_grad(tape, target, sources, output_gradients, sources_raw, unconnected_gradients)
74 output_gradients,
75 sources_raw,
---> 76 compat.as_str(unconnected_gradients.value))
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py in _gradient_function(op_name, attr_tuple, num_inputs, inputs, outputs, out_grads, skip_input_indices)
136 return [None] * num_inputs
137
--> 138 return grad_fn(mock_op, *out_grads)
139
140
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\ops\math_grad.py in _TanhGrad(op, grad)
712 with ops.control_dependencies([grad]):
713 y = math_ops.conj(y)
--> 714 return gen_math_ops.tanh_grad(y, grad)
715
716
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py in tanh_grad(y, dy, name)
11410 else:
11411 message = e.message
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\six.py in raise_from(value, from_value)
ResourceExhaustedError: OOM when allocating tensor with shape[25600,9,11,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TanhGrad]
The text was updated successfully, but these errors were encountered: