Skip to content

Deeplab train on cityscapes got error said taht loss is inf or nan #3729

@hypercost

Description

@hypercost

System information

  • Linux Ubuntu 14.04):
  • TensorFlow installed from source:
  • TensorFlow version 1.6:
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version:8/5.1:
  • GPU model and memory:
  • Exact command to reproduce:

Describe the problem

I trained on cityscapes with fine_tune_batch_norm = false . The model is inied from deeplabv3_cityscapes_train_2018_02_06. And it got loss is inf or nan error.

Training logs

INFO:tensorflow:global step 570: loss = 2.4862 (0.517 sec/step)
INFO:tensorflow:global step 580: loss = 1.9100 (0.539 sec/step)
INFO:tensorflow:global step 590: loss = 1.9793 (0.932 sec/step)
INFO:tensorflow:global step 600: loss = 3.4337 (0.525 sec/step)
INFO:tensorflow:global step 610: loss = 84.6659 (0.515 sec/step)
INFO:tensorflow:global step 620: loss = 20.1596 (0.948 sec/step)
INFO:tensorflow:global step 630: loss = 2.8936 (0.525 sec/step)
INFO:tensorflow:global step 640: loss = 1.9785 (0.529 sec/step)
INFO:tensorflow:global step 650: loss = 1.9451 (0.909 sec/step)
INFO:tensorflow:global step 660: loss = 3.2844 (0.532 sec/step)
INFO:tensorflow:global step 670: loss = 1.9610 (0.524 sec/step)
INFO:tensorflow:global step 680: loss = 3.4317 (0.991 sec/step)
INFO:tensorflow:global step 690: loss = 3.5062 (0.544 sec/step)
INFO:tensorflow:global step 700: loss = 2.9824 (0.633 sec/step)
INFO:tensorflow:global step 710: loss = 3.1381 (0.963 sec/step)
INFO:tensorflow:global step 720: loss = 12.0563 (0.521 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/
task:0/device:CPU:0"](total_loss)]]

Caused by op u'CheckNumerics', defined at:
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, i
n run
    _sys.exit(main(argv))
  File "train.py", line 291, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4
98, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
 line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271,
 in create_op
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650,
 in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/
task:0/device:CPU:0"](total_loss)]]

Traceback (most recent call last):
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 340, in main
    save_interval_secs=FLAGS.save_interval_secs)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]

Caused by op u'CheckNumerics', defined at:
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 291, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 498, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions