-
Couldn't load subscription status.
- Fork 45.4k
Closed
Description
System information
- Linux Ubuntu 14.04):
- TensorFlow installed from source:
- TensorFlow version 1.6:
- Bazel version (if compiling from source):
- CUDA/cuDNN version:8/5.1:
- GPU model and memory:
- Exact command to reproduce:
Describe the problem
I trained on cityscapes with fine_tune_batch_norm = false . The model is inied from deeplabv3_cityscapes_train_2018_02_06. And it got loss is inf or nan error.
Training logs
INFO:tensorflow:global step 570: loss = 2.4862 (0.517 sec/step)
INFO:tensorflow:global step 580: loss = 1.9100 (0.539 sec/step)
INFO:tensorflow:global step 590: loss = 1.9793 (0.932 sec/step)
INFO:tensorflow:global step 600: loss = 3.4337 (0.525 sec/step)
INFO:tensorflow:global step 610: loss = 84.6659 (0.515 sec/step)
INFO:tensorflow:global step 620: loss = 20.1596 (0.948 sec/step)
INFO:tensorflow:global step 630: loss = 2.8936 (0.525 sec/step)
INFO:tensorflow:global step 640: loss = 1.9785 (0.529 sec/step)
INFO:tensorflow:global step 650: loss = 1.9451 (0.909 sec/step)
INFO:tensorflow:global step 660: loss = 3.2844 (0.532 sec/step)
INFO:tensorflow:global step 670: loss = 1.9610 (0.524 sec/step)
INFO:tensorflow:global step 680: loss = 3.4317 (0.991 sec/step)
INFO:tensorflow:global step 690: loss = 3.5062 (0.544 sec/step)
INFO:tensorflow:global step 700: loss = 2.9824 (0.633 sec/step)
INFO:tensorflow:global step 710: loss = 3.1381 (0.963 sec/step)
INFO:tensorflow:global step 720: loss = 12.0563 (0.521 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/
task:0/device:CPU:0"](total_loss)]]
Caused by op u'CheckNumerics', defined at:
File "train.py", line 347, in <module>
tf.app.run()
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, i
n run
_sys.exit(main(argv))
File "train.py", line 291, in main
total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4
98, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
line 787, in _apply_op_helper
op_def=op_def)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271,
in create_op
op_def=op_def)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650,
in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/
task:0/device:CPU:0"](total_loss)]]
Traceback (most recent call last):
File "train.py", line 347, in <module>
tf.app.run()
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train.py", line 340, in main
save_interval_secs=FLAGS.save_interval_secs)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]
Caused by op u'CheckNumerics', defined at:
File "train.py", line 347, in <module>
tf.app.run()
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train.py", line 291, in main
total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 498, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN valuesMetadata
Metadata
Assignees
Labels
No labels