Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning Errors #32

Closed
nickwalton opened this issue Sep 27, 2019 · 5 comments
Closed

Finetuning Errors #32

nickwalton opened this issue Sep 27, 2019 · 5 comments

Comments

@nickwalton
Copy link

Hey I'm getting the following fine tuning errors on a multi gpu machine. I made sure to re-patch keras, but haven't had any luck. Any idea what the issue is?

W0927 22:27:35.617535 140220124120896 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/clip_ops.py:286: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0927 22:27:36.428683 140220124120896 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py:76: calling init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
global_step: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207266: I tensorflow/core/common_runtime/placer.cc:54] global_step: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
global_step/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207316: I tensorflow/core/common_runtime/placer.cc:54] global_step/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
global_step/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207337: I tensorflow/core/common_runtime/placer.cc:54] global_step/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
w/Initializer/random_normal/RandomStandardNormal: (RandomStandardNormal): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207349: I tensorflow/core/common_runtime/placer.cc:54] w/Initializer/random_normal/RandomStandardNormal: (RandomStandardNormal)/job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
File "training.py", line 162, in
estimator_model = tf.keras.estimator.model_to_estimator(keras_model=model, config=run_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/estimator/init.py", line 73, in model_to_estimator
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 450, in model_to_estimator
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 331, in save_first_checkpoint
saver.save(sess, latest_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1173, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation w/Initializer/random_normal/mul: Could not satisfy explicit device specification '' because the node node w/Initializer/random_normal/mul (defined at training.py:90) placed on device Device assignments active during op 'w/Initializer/random_normal/mul' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602> was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3, /job:localhost/replica:0/task:0/device:XLA_GPU:4, /job:localhost/replica:0/task:0/device:XLA_GPU:5, /job:localhost/replica:0/task:0/device:XLA_GPU:6, /job:localhost/replica:0/task:0/device:XLA_GPU:7, /job:localhost/replica:0/task:0/device:XLA_GPU:8, /job:localhost/replica:0/task:0/device:XLA_GPU:9, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3, /job:localhost/replica:0/task:0/device:GPU:4, /job:localhost/replica:0/task:0/device:GPU:5, /job:localhost/replica:0/task:0/device:GPU:6, /job:localhost/replica:0/task:0/device:GPU:7, /job:localhost/replica:0/task:0/device:GPU:8, /job:localhost/replica:0/task:0/device:GPU:9].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index
=1 requested_device_name
='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name
='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name
='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
UnsortedSegmentSum: GPU CPU XLA_CPU XLA_GPU
ResourceGather: GPU CPU XLA_CPU XLA_GPU
Shape: GPU CPU XLA_CPU XLA_GPU
Unique: GPU CPU
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
ResourceSparseApplyAdagrad: CPU
StridedSlice: GPU CPU XLA_CPU XLA_GPU
AssignVariableOp: GPU CPU XLA_CPU XLA_GPU
Identity: GPU CPU XLA_CPU XLA_GPU
RandomStandardNormal: GPU CPU XLA_CPU XLA_GPU
Mul: GPU CPU XLA_CPU XLA_GPU
Add: GPU CPU XLA_CPU XLA_GPU
VarHandleOp: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU

Colocation members, user-requested devices, and framework assigned devices, if any:
w/Initializer/random_normal/shape (Const)
w/Initializer/random_normal/mean (Const)
w/Initializer/random_normal/stddev (Const)
w/Initializer/random_normal/RandomStandardNormal (RandomStandardNormal) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Initializer/random_normal/mul (Mul)
w/Initializer/random_normal (Add)
w (VarHandleOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Assign (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Read/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
tied_embedding_softmax/embedding_lookup (ResourceGather) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
tied_embedding_softmax/embedding_lookup/Identity (Identity)
tied_embedding_softmax_1/transpose/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
VarIsInitializedOp_322 (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
AssignVariableOp (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Adagrad/Initializer/Const (Const)
w/Adagrad (VarHandleOp)
w/Adagrad/IsInitialized/VarIsInitializedOp (VarIsInitializedOp)
w/Adagrad/Assign (AssignVariableOp)
w/Adagrad/Read/ReadVariableOp (ReadVariableOp)
training/Adagrad/update_w/Unique (Unique)
training/Adagrad/update_w/Shape (Shape)
training/Adagrad/update_w/strided_slice/stack (Const)
training/Adagrad/update_w/strided_slice/stack_1 (Const)
training/Adagrad/update_w/strided_slice/stack_2 (Const)
training/Adagrad/update_w/strided_slice (StridedSlice)
training/Adagrad/update_w/UnsortedSegmentSum (UnsortedSegmentSum)
training/Adagrad/update_w/ResourceSparseApplyAdagrad (ResourceSparseApplyAdagrad)
save/AssignVariableOp_1542 (AssignVariableOp)
save/AssignVariableOp_1543 (AssignVariableOp)

 [[node w/Initializer/random_normal/mul (defined at training.py:90) ]]Additional information about colocations:No node-device colocations were active during op 'w/Initializer/random_normal/mul' creation.

Device assignments active during op 'w/Initializer/random_normal/mul' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>

Original stack trace for u'w/Initializer/random_normal/mul':
File "training.py", line 162, in
estimator_model = tf.keras.estimator.model_to_estimator(keras_model=model, config=run_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/estimator/init.py", line 73, in model_to_estimator
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 450, in model_to_estimator
config)
File "/usr/local/lib/python2.7/dist-packages/

@keskarnitish
Copy link
Contributor

The logs are a bit hard to parse (& seem incomplete? I can't find the final error). Can you post the entire log file (perhaps a pastebin link)?

@nickwalton
Copy link
Author

Yep! I'll post a link with the full log when I have a moment. From what I can tell the key parts were:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation w/Initializer/random_normal/mul: Could not satisfy explicit device specification '' because the node node w/Initializer/random_normal/mul (defined at training.py:90) placed on device Device assignments active during op 'w/Initializer/random_normal/mul' creation:
and
resource_device_name='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]

@keskarnitish
Copy link
Contributor

I've never seen that before. Are you able to run inference on the same GPU?

@nickwalton
Copy link
Author

nickwalton commented Oct 2, 2019 via email

@nickwalton
Copy link
Author

Made a pull request that was able to fix the issue. Not sure if you want to merge it in, but figured I'd put it up if anyone else was looking for a solution #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants