New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning Errors #32
Comments
The logs are a bit hard to parse (& seem incomplete? I can't find the final error). Can you post the entire log file (perhaps a pastebin link)? |
Yep! I'll post a link with the full log when I have a moment. From what I can tell the key parts were: |
I've never seen that before. Are you able to run inference on the same GPU? |
Yep I can run inference just fine. It's just on training that I run into
this issue.
…On Wed, Oct 2, 2019 at 10:04 AM Nitish Shirish Keskar < ***@***.***> wrote:
I've never seen that before. Are you able to run inference on the same GPU?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#32?email_source=notifications&email_token=AFJNOQGOASKETIZPHOZ7BI3QMTBABA5CNFSM4I3LPKC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAFJCWA#issuecomment-537563480>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFJNOQC55O4NIKANAQWNJ7TQMTBABANCNFSM4I3LPKCQ>
.
|
Made a pull request that was able to fix the issue. Not sure if you want to merge it in, but figured I'd put it up if anyone else was looking for a solution #51 |
Hey I'm getting the following fine tuning errors on a multi gpu machine. I made sure to re-patch keras, but haven't had any luck. Any idea what the issue is?
W0927 22:27:35.617535 140220124120896 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/clip_ops.py:286: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0927 22:27:36.428683 140220124120896 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py:76: calling init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
global_step: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207266: I tensorflow/core/common_runtime/placer.cc:54] global_step: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
global_step/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207316: I tensorflow/core/common_runtime/placer.cc:54] global_step/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
global_step/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207337: I tensorflow/core/common_runtime/placer.cc:54] global_step/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
w/Initializer/random_normal/RandomStandardNormal: (RandomStandardNormal): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-27 22:27:44.207349: I tensorflow/core/common_runtime/placer.cc:54] w/Initializer/random_normal/RandomStandardNormal: (RandomStandardNormal)/job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
File "training.py", line 162, in
estimator_model = tf.keras.estimator.model_to_estimator(keras_model=model, config=run_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/estimator/init.py", line 73, in model_to_estimator
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 450, in model_to_estimator
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 331, in save_first_checkpoint
saver.save(sess, latest_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1173, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation w/Initializer/random_normal/mul: Could not satisfy explicit device specification '' because the node node w/Initializer/random_normal/mul (defined at training.py:90) placed on device Device assignments active during op 'w/Initializer/random_normal/mul' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602> was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3, /job:localhost/replica:0/task:0/device:XLA_GPU:4, /job:localhost/replica:0/task:0/device:XLA_GPU:5, /job:localhost/replica:0/task:0/device:XLA_GPU:6, /job:localhost/replica:0/task:0/device:XLA_GPU:7, /job:localhost/replica:0/task:0/device:XLA_GPU:8, /job:localhost/replica:0/task:0/device:XLA_GPU:9, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3, /job:localhost/replica:0/task:0/device:GPU:4, /job:localhost/replica:0/task:0/device:GPU:5, /job:localhost/replica:0/task:0/device:GPU:6, /job:localhost/replica:0/task:0/device:GPU:7, /job:localhost/replica:0/task:0/device:GPU:8, /job:localhost/replica:0/task:0/device:GPU:9].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index=1 requested_device_name='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
UnsortedSegmentSum: GPU CPU XLA_CPU XLA_GPU
ResourceGather: GPU CPU XLA_CPU XLA_GPU
Shape: GPU CPU XLA_CPU XLA_GPU
Unique: GPU CPU
ReadVariableOp: GPU CPU XLA_CPU XLA_GPU
ResourceSparseApplyAdagrad: CPU
StridedSlice: GPU CPU XLA_CPU XLA_GPU
AssignVariableOp: GPU CPU XLA_CPU XLA_GPU
Identity: GPU CPU XLA_CPU XLA_GPU
RandomStandardNormal: GPU CPU XLA_CPU XLA_GPU
Mul: GPU CPU XLA_CPU XLA_GPU
Add: GPU CPU XLA_CPU XLA_GPU
VarHandleOp: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU
VarIsInitializedOp: GPU CPU XLA_CPU XLA_GPU
Colocation members, user-requested devices, and framework assigned devices, if any:
w/Initializer/random_normal/shape (Const)
w/Initializer/random_normal/mean (Const)
w/Initializer/random_normal/stddev (Const)
w/Initializer/random_normal/RandomStandardNormal (RandomStandardNormal) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Initializer/random_normal/mul (Mul)
w/Initializer/random_normal (Add)
w (VarHandleOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Assign (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Read/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
tied_embedding_softmax/embedding_lookup (ResourceGather) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
tied_embedding_softmax/embedding_lookup/Identity (Identity)
tied_embedding_softmax_1/transpose/ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
VarIsInitializedOp_322 (VarIsInitializedOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
AssignVariableOp (AssignVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
ReadVariableOp (ReadVariableOp) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
w/Adagrad/Initializer/Const (Const)
w/Adagrad (VarHandleOp)
w/Adagrad/IsInitialized/VarIsInitializedOp (VarIsInitializedOp)
w/Adagrad/Assign (AssignVariableOp)
w/Adagrad/Read/ReadVariableOp (ReadVariableOp)
training/Adagrad/update_w/Unique (Unique)
training/Adagrad/update_w/Shape (Shape)
training/Adagrad/update_w/strided_slice/stack (Const)
training/Adagrad/update_w/strided_slice/stack_1 (Const)
training/Adagrad/update_w/strided_slice/stack_2 (Const)
training/Adagrad/update_w/strided_slice (StridedSlice)
training/Adagrad/update_w/UnsortedSegmentSum (UnsortedSegmentSum)
training/Adagrad/update_w/ResourceSparseApplyAdagrad (ResourceSparseApplyAdagrad)
save/AssignVariableOp_1542 (AssignVariableOp)
save/AssignVariableOp_1543 (AssignVariableOp)
Device assignments active during op 'w/Initializer/random_normal/mul' creation:
with tf.device(None): </usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py:602>
Original stack trace for u'w/Initializer/random_normal/mul':
File "training.py", line 162, in
estimator_model = tf.keras.estimator.model_to_estimator(keras_model=model, config=run_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/estimator/init.py", line 73, in model_to_estimator
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py", line 450, in model_to_estimator
config)
File "/usr/local/lib/python2.7/dist-packages/
The text was updated successfully, but these errors were encountered: