New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calling variable.assign() too many times crashes on memory allocation. #2311
Comments
Here's a quick script that should break when you run it (if it helps...). Mine dies on iteration 31. |
The assign op is not consuming memory, but the problem is caused by the fact that each instance of The fix is to rewrite your program somewhat. Instead of doing: for i in range(3000):
print "Assigning i:{}".format(i)
sess.run(w1.assign(new_value_array)) ... you should declare the assign op and a placeholder before the loop, and feed different values to the placeholder in each iteration: assign_placeholder = tf.placeholder(tf.float32, shape=[1000, 1000])
assign_op = w1.assign(assign_placeholder)
for i in range(3000):
print "Assigning i:{}".format(i)
sess.run(assign_op, feed_dict={assign_placeholder: new_value_array}) |
That totally makes sense now. I never would have guessed to do that though. Thanks so much. |
Indeed - it's a difficult error to disallow, because there are many totally valid patterns that involve adding nodes to the graph. One tip is to try calling |
Thanks @mrry. Everything is running great again. |
Glad to hear it! |
Background: I'm working on a set of networks that only share some layers, so I have a parameter server that sends new weights for the different clients to use. These clients accept the new weights and bias for the layers they are using and assign the values to the TF.Variables via
sess.run(self.w1.assign(new_weights))
. However, when I start it up and let it run, it crashes saying(Sometimes it's allocating 16B, other times its 3.9KiB)
To give you an idea of the size of the weights, I have three layers of:
Layer 1(W,b):
(2, 1000), (1000, )
Layer 2(W,b):
(1000, 1000), (1000, )
Layer 3(W,b):
(1000, 4), (4, )
I'm running on a Titan X with 12G memory.
With
per_process_gpu_memory_fraction = 0.01
, the program dies at ~190 assign commands.With
per_process_gpu_memory_fraction = 0.02
, the program dies at ~384 assign commands.With
per_process_gpu_memory_fraction = 0.03
, the program dies at ~780 assign commands.With
per_process_gpu_memory_fraction = 0.04
, the program dies at ~784 assign commands.With
per_process_gpu_memory_fraction = 0.05
, the program dies at ~1582 assign commands.With
per_process_gpu_memory_fraction = 0.06
, the program dies at ~1586 assign commands.I've tried to set
allow_growth=True
, anddeferred_deletion_bytes=1
in the session's GPUOptions after reading issue #1578, but that didn't get me much further. (I have no idea whatdeferred_deletion_bytes
does...) Looking at the numbers just above (GPU%vsAssignmentCommands), it seems to be fairly linear, so it seems to me that the assign operation takes some of the GPU ram and it's never freed up. Is there any sense of GC on the GPU memory allocated durring thevar.assign()
op?It seems that I could delete and create a new session, but that sounds expensive to me, and I'd have to maintain the weights outside of session to be able to restore them correctly. The second idea I had would to use placeholders and ship the weights in every time with the feed_dict, but again, that seems less that ideal and I think it would struggle in the optimizer on knowing what to optimize if they are just placeholders.
Let me know if you would like any other logs or reports from me. I figure this is the first time someone has tried to use assign operations like this, so I want to be helpful in fixing it if it's a bug.
Thanks
Environment info
Operating System: Ubuntu 16.04
Installed version of CUDA and cuDNN:
/usr/local/cuda/lib/libcudart.so -> libcudart.so.7.0
/usr/local/cuda/lib/libcudart.so.7.0 -> libcudart.so.7.0.28
/usr/local/cuda/lib/libcudart.so.7.0.28
/usr/local/cuda/lib/libcudart_static.a
Built from source. Commit hash: 35cd6a3
The text was updated successfully, but these errors were encountered: