Finetuning Errors #32

nickwalton · 2019-09-27T22:31:14Z

Hey I'm getting the following fine tuning errors on a multi gpu machine. I made sure to re-patch keras, but haven't had any luck. Any idea what the issue is?

W0927 22:27:35.617535 140220124120896 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/clip_ops.py:286: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0927 22:27:36.428683 140220124120896 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py:76: calling init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
global_step: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207266: I tensorflow/core/common_runtime/placer.cc:54] global_step: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
global_step/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207316: I tensorflow/core/common_runtime/placer.cc:54] global_step/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
global_step/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207337: I tensorflow/core/common_runtime/placer.cc:54] global_step/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
w/Initializer/random_normal/RandomStandardNormal: (RandomStandardNormal): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207349: I tensorflow/core/common_runtime/placer.cc:54] w/Initializer/random_normal/RandomStandardNormal: (RandomStandardNormal)/job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
File "training.py", line 162, in
estimator_model = tf.keras.estimator.model_to_estimator(keras_model=model, config=run_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/estimator/init.py", line 73, in model_to_estimator
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 450, in model_to_estimator
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 331, in save_first_checkpoint
saver.save(sess, latest_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1173, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation w/Initializer/random_normal/mul: Could not satisfy explicit device specification '' because the node node w/Initializer/random_normal/mul (defined at training.py:90) placed on device Device assignments active during op 'w/Initializer/random_normal/mul' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602> was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3, /job:localhost/replica:0/task:0/device:XLA_GPU:4, /job:localhost/replica:0/task:0/device:XLA_GPU:5, /job:localhost/replica:0/task:0/device:XLA_GPU:6, /job:localhost/replica:0/task:0/device:XLA_GPU:7, /job:localhost/replica:0/task:0/device:XLA_GPU:8, /job:localhost/replica:0/task:0/device:XLA_GPU:9, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3, /job:localhost/replica:0/task:0/device:GPU:4, /job:localhost/replica:0/task:0/device:GPU:5, /job:localhost/replica:0/task:0/device:GPU:6, /job:localhost/replica:0/task:0/device:GPU:7, /job:localhost/replica:0/task:0/device:GPU:8, /job:localhost/replica:0/task:0/device:GPU:9].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index=1 requested_device_name='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
UnsortedSegmentSum: GPU CPU XLA_CPU XLA_GPU
ResourceGather: GPU CPU XLA_CPU XLA_GPU
Shape: GPU CPU XLA_CPU XLA_GPU
Unique: GPU CPU
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
ResourceSparseApplyAdagrad: CPU
StridedSlice: GPU CPU XLA_CPU XLA_GPU
AssignVariableOp: GPU CPU XLA_CPU XLA_GPU
Identity: GPU CPU XLA_CPU XLA_GPU
RandomStandardNormal: GPU CPU XLA_CPU XLA_GPU
Mul: GPU CPU XLA_CPU XLA_GPU
Add: GPU CPU XLA_CPU XLA_GPU
VarHandleOp: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU

Colocation members, user-requested devices, and framework assigned devices, if any:
w/Initializer/random_normal/shape (Const)
w/Initializer/random_normal/mean (Const)
w/Initializer/random_normal/stddev (Const)
w/Initializer/random_normal/RandomStandardNormal (RandomStandardNormal) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Initializer/random_normal/mul (Mul)
w/Initializer/random_normal (Add)
w (VarHandleOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Assign (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Read/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
tied_embedding_softmax/embedding_lookup (ResourceGather) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
tied_embedding_softmax/embedding_lookup/Identity (Identity)
tied_embedding_softmax_1/transpose/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
VarIsInitializedOp_322 (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
AssignVariableOp (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Adagrad/Initializer/Const (Const)
w/Adagrad (VarHandleOp)
w/Adagrad/IsInitialized/VarIsInitializedOp (VarIsInitializedOp)
w/Adagrad/Assign (AssignVariableOp)
w/Adagrad/Read/ReadVariableOp (ReadVariableOp)
training/Adagrad/update_w/Unique (Unique)
training/Adagrad/update_w/Shape (Shape)
training/Adagrad/update_w/strided_slice/stack (Const)
training/Adagrad/update_w/strided_slice/stack_1 (Const)
training/Adagrad/update_w/strided_slice/stack_2 (Const)
training/Adagrad/update_w/strided_slice (StridedSlice)
training/Adagrad/update_w/UnsortedSegmentSum (UnsortedSegmentSum)
training/Adagrad/update_w/ResourceSparseApplyAdagrad (ResourceSparseApplyAdagrad)
save/AssignVariableOp_1542 (AssignVariableOp)
save/AssignVariableOp_1543 (AssignVariableOp)

 [[node w/Initializer/random_normal/mul (defined at training.py:90) ]]Additional information about colocations:No node-device colocations were active during op 'w/Initializer/random_normal/mul' creation.

Device assignments active during op 'w/Initializer/random_normal/mul' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>

Original stack trace for u'w/Initializer/random_normal/mul':
File "training.py", line 162, in
estimator_model = tf.keras.estimator.model_to_estimator(keras_model=model, config=run_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/estimator/init.py", line 73, in model_to_estimator
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 450, in model_to_estimator
config)
File "/usr/local/lib/python2.7/dist-packages/

The text was updated successfully, but these errors were encountered:

keskarnitish · 2019-09-30T16:19:44Z

The logs are a bit hard to parse (& seem incomplete? I can't find the final error). Can you post the entire log file (perhaps a pastebin link)?

nickwalton · 2019-09-30T17:40:56Z

Yep! I'll post a link with the full log when I have a moment. From what I can tell the key parts were:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation w/Initializer/random_normal/mul: Could not satisfy explicit device specification '' because the node node w/Initializer/random_normal/mul (defined at training.py:90) placed on device Device assignments active during op 'w/Initializer/random_normal/mul' creation:
and
resource_device_name='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]

keskarnitish · 2019-10-02T16:04:12Z

I've never seen that before. Are you able to run inference on the same GPU?

nickwalton · 2019-10-02T18:07:05Z

Yep I can run inference just fine. It's just on training that I run into this issue.

…

On Wed, Oct 2, 2019 at 10:04 AM Nitish Shirish Keskar < ***@***.***> wrote: I've never seen that before. Are you able to run inference on the same GPU? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#32?email_source=notifications&email_token=AFJNOQGOASKETIZPHOZ7BI3QMTBABA5CNFSM4I3LPKC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAFJCWA#issuecomment-537563480>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFJNOQC55O4NIKANAQWNJ7TQMTBABANCNFSM4I3LPKCQ> .

nickwalton · 2019-10-22T22:51:56Z

Made a pull request that was able to fix the issue. Not sure if you want to merge it in, but figured I'd put it up if anyone else was looking for a solution #51

nickwalton mentioned this issue Oct 22, 2019

Fixes issues with fine tuning on GPU's #51

Closed

nickwalton closed this as completed Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Errors #32

Finetuning Errors #32

nickwalton commented Sep 27, 2019

keskarnitish commented Sep 30, 2019

nickwalton commented Sep 30, 2019

keskarnitish commented Oct 2, 2019

nickwalton commented Oct 2, 2019 via email

nickwalton commented Oct 22, 2019

Finetuning Errors #32

Finetuning Errors #32

Comments

nickwalton commented Sep 27, 2019

keskarnitish commented Sep 30, 2019

nickwalton commented Sep 30, 2019

keskarnitish commented Oct 2, 2019

nickwalton commented Oct 2, 2019 via email

nickwalton commented Oct 22, 2019