Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Describe the problem
Potential memory leak when using LSTM + TimeDistributed
I have a standard time series model that consists 3 layers of convolutional layers feeding into 2 LSTM layers. Up until now, I have had no problems mapping a Dense layer to the last output of the top LSTM and making a prediction etc. However, I want to implement a model where I use a TimeDistributed(Dense(..)) layer on top of the top LSTM and feed back the error signal at each time point. I have implemented this but after only training a few epochs, I get a resource exhausted error.
It doesn't seem to be affected by how small I make the model, after training for a few epochs. The error I get is: "ResourceExhaustedError: OOM when allocating tensor with shape[25600,9,11,128]". This comes after a call to tape.gradients (full error reported in section below).
In my non-TimeDistributed I monitor the number of objects via len(gc.get_objects())) and during training the object count remains the same (as expected), but when I only change the model to handle this TimeDistributed change (i.e. making sure the labels are correctly repeated and making return_sequences=1 for the top-level LSTM) then all of a sudden at each training epoch, thousands of new variables are being added during each epoch of training.
gc objects: 249861
So in 4 epochs of training, while no other process is running, 20,583 new objects were created and I presume resulted in this resource exhausted error.
I've tried to force the garbage collector to collect any unused variables but the object count increases whether this is included or not. I ran a snapshot comparison from the tracemalloc library, which I will include below as it might be helpful (it wasn't to me).
Something is creating variables during every epoch, vastly using up all the memory and not releasing them, leading to this resource exhausted error. This doesn't occur if I don't use TimeDistributed, so I don't think anything about this layer requires the creation of additional memory-hungry variables. It looks more like a leak.
Do you have any idea of what I could do to alleviate this problem? It seems like a bug fix at a technical level. Maybe there is a technical solution. Please let me know if any further info from my end would be useful in looking at this issue.
Source code / logs
tracemalloc top 10 differences between snapshot calls at adjacent epochs
C:\Users\AXM1390\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\execute.py:61: size=111 KiB (+69.9 KiB), count=677 (+426), average=168 B
ResourceExhaustedError Traceback (most recent call last)
in run_model_once(start, end, epochs)
in train(model, optimizer, train_ds, test_ds, cm)
in train_one_step_timedistributed(model, optimizer, x_true, y_true, training)
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py in gradient(self, target, sources, output_gradients, unconnected_gradients)
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\imperative_grad.py in imperative_grad(tape, target, sources, output_gradients, sources_raw, unconnected_gradients)
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\backprop.py in _gradient_function(op_name, attr_tuple, num_inputs, inputs, outputs, out_grads, skip_input_indices)
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\ops\math_grad.py in _TanhGrad(op, grad)
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py in tanh_grad(y, dy, name)
~\AppData\Local\Continuum\anaconda3\envs\tf2\lib\site-packages\six.py in raise_from(value, from_value)
ResourceExhaustedError: OOM when allocating tensor with shape[25600,9,11,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TanhGrad]
Thanks for your replies. I've stripped down all of the non-essential code in my program and set it up to use randomly generated data to recreate the problem. Please see below for (1) code and (2) output given for me.
Imports + Data Generation
Running the model
The output of this shows that many objects are being created during the runs of the epochs.
@akanyaani - Thanks for your comment about it potentially being related to variable length input, but in my case (and also in the case of the dummy code above) there is no variable length input as there are always 100 time points in the data I am using.
then the same code runs while keeping the amount of objects relatively stable, as seen below. This is what made me sure it was related to TimeDistributed.
Output without using TimeDistributed
I hope you can also recreate the issue and thereby potentially see where the problem might lie.
Interesting! I tried via notebooks, command line etc. and it always gave the same error. I will try it on my Linux laptop to see if it also breaks there. Do you mind if I ask about your specs so I can see what else I might be able to try (mainly just CUDA / cuDNN version and what version of Python you used)? Then I can try on Windows with the same things running as you, as that might fix my problem.
Once I am back at my main workstation tomorrow, I will try with a lower batch size to answer your question as to whether that might be the problem. I didn't think it would be because TimeDistributed(Dense(..)) uses the same weights for each time step so I thought the computation would be equivalent (in terms of the gradient call) to the non-TimeDistributed case (that does work). I could be wrong though. Furthermore, I don't know why it would be fine with the first 5-6 epochs and then fail afterwards. If it can handle the first few, nothing new should be done during the training so I still have no explanation as to how the OOM could occur.
More information tomorrow!
I am also seeing a memory leak. No LSTM though, just TimeDistributed. This model fails after printing 11 on a 2080Ti. Batch Size is always 1, so that's not the problem.
I tried running a reduced model on a very small batch size (16) and it ran longer than it did last time but there was still a considerable increase of objects in memory on each training loop (a few hundred at each iteration). It just made a bit more space for more epochs to run, but then ran into OOM errors at a later point.
I think Tetragramm (above) is having the same issue and further confirms my belief it's with the TimeDistributed layer (as without it - my model runs fine). I wanted to run the same model using the same setup as akanyaani but so far I've not seen what the exact details were in his case / whether he has much more memory available.
@arthurflor23 Are you sure? The Beta1 code is nearly identical to the release 2.0.0 code. The only change is to regularization, which has nothing that would do a memory leak.
Upon further review (actually reading through the code), I'm pretty sure the _input_map variable is both the cause, and useless. I think lines 56, 246, 247, 249, 308, and 309 can be removed, and line 310 replaced with
Unfortunately, I'm having trouble building tensorflow from source to test.
I don't know if it's related to this or another module version, but Beta1 doesn't happen and make ~140s per epoch..
I will do more tests with the two models that I'm studying, cause I already have another problem in the recurrent layers and
Just to complement, this behavior that I mentioned appears since rc0. (I installed version by version to check, via google colab)
I'm using TF-GPU 2.0.0 and having the same issue when using the TimeDistributed wrapper...
does not result in an increasing gpu memory allocation... But I don't know if it is now still correct? I only saw self._input_map be called at the preparation of the training and only referring to the last input_uid. So I thought that clearing before adding the latest element would not destroy the logic behind it, but still fix the memory leakage.