Training custom model crashes with "ERROR:tensorflow:Model diverged with loss = NaN." #4881

Luca3424 · 2018-07-24T14:38:40Z

System information

What is the top-level directory of the model you are using: Object detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.9.0
Bazel version (if compiling from source): -
CUDA/cuDNN version: CUDA 9.0 / cuDNN 7.0.5
GPU model and memory: Nvidia GeForce GTX 1050 Ti
Exact command to reproduce:
python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --logtostderr

Describe the problem

The model_main.py script crashes before even one training step. Normally I'd say it's because of my GPU, but with the now deprecated train.py script it worked well. I'm training a custom model with the ssd_inception_v2_coco config file and the model as finetune checkpoint.

Source code / logs

`(tensorflow2) c:\tensorflow2\models\research\object_detection>python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --eval_training_data --alsologtostderr
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
C:\tensorflow2\models\research\object_detection\utils\visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was originally set to 'TkAgg' by the following code:
File "model_main.py", line 26, in
from object_detection import model_lib
File "C:\tensorflow2\models\research\object_detection\model_lib.py", line 26, in
from object_detection import eval_util
File "C:\tensorflow2\models\research\object_detection\eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "C:\tensorflow2\models\research\object_detection\metrics\coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "C:\tensorflow2\models\research\object_detection\metrics\coco_tools.py", line 47, in
from pycocotools import coco
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\pycocotools\coco.py", line 49, in
import matplotlib.pyplot as plt
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\backends_init_.py", line 16, in
line for line in traceback.format_stack()

import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn..model_fn at 0x0000013642613C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
2018-07-24 16:28:36.695781: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-07-24 16:28:37.044293: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.4175
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.30GiB
2018-07-24 16:28:37.053468: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-24 16:28:37.688722: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-24 16:28:37.692253: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0
2018-07-24 16:28:37.694498: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2018-07-24 16:28:37.696812: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3025 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "model_main.py", line 101, in
tf.app.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 669, in run_local
hooks=train_hooks)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1135, in _train_model_default
saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1209, in run
run_metadata=run_metadata))
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.`

The text was updated successfully, but these errors were encountered:

kingstarcraft · 2018-07-25T05:46:06Z

Same problem with me on windows. If I add follow commd

-num_train_steps=1 -num_eval_steps=1

it will stop working after Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8799 MB memory)

Luca3424 · 2018-07-25T12:44:56Z

When I'm adding the --num_train_steps=1 and --num_eval_steps=1 commands it crashes because of the following:

tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a directory: training/export\Servo\temp-b'1532522498'; No such file or directory

As soon as I'm increasing the values of these commands it throws the same error I've mentioned above.

Any ideas? Thanks!

jacano · 2018-07-25T15:15:33Z

Same problem here

GuyTraveler · 2018-07-25T16:26:34Z

I am also running into this issue. I was able to execute the model_main.py script against the latest tensorflow cpu package and have it run through a large number of steps but when trying to leverage the tensorflow gpu I keep running into the error "model diverged with loss = NaN" I tried varying my batch size but that did not resolve the issue.

jacano · 2018-07-26T10:11:52Z

Hi guys, I end up using the old train.py from the legacy folder.
I mean, like this:
From models/research/object_detection
python ./legacy/train.py --pipeline_config_path=pipeline_config/ssd_mobilenet_v2_coco.config --train_dir=training/ --logtostderr

yuezhilanyi · 2018-07-31T09:14:36Z

same problem while using model_main.py to train
@jacano do you see duplicated training steps while using legacy train.py? i saw infos like this

INFO:tensorflow:Restoring parameters from /ChinaRS/code/tensorflow/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)

jacano · 2018-07-31T09:41:16Z

Yes, I got duplicated lines yesterday, during a training session. Same as you.
I guess it has to do with the --logtostderr flag. Didn´t had time to investigate further, sorry.

xtianhb · 2018-08-05T01:06:21Z

I have the same problem with a very similar setup/task. I'm training for 1 class, using GPU GTX 1060 6GB.
Command: python model_main.py --num_eval_steps=2000 --num_train_steps=50000 --alsologtostderr --pipeline_config_path=training/ssdlite_mobilenet_v2_coco.config --model_dir=training

Last week I was doing the same task with tensor flow cpu version, on the same system, and worked perfectly. Yesterday I've installed a GPU and found this problem.

I've changed --num_eval_steps=1 --num_train_steps=1 and didn't crash....

xtianhb · 2018-08-07T17:40:12Z

I've updated to the last version, set in my pipe_line_config_file initial_learning_rate: 0 , checked again labels and bounding boxes, and got the same result. With Cpu I don't have this behaviour.

daruai · 2018-08-10T10:49:12Z

Same problem here. With CPU works, with GPU prints the error.

xtianhb · 2018-08-18T20:21:26Z

I've switched to the legacy/train.py script for training, and legacy/eval.py for evaluation.
It works with GPU, no problems. Same setup as commented earlier.

Stukongeluk · 2018-08-19T11:07:58Z

Relying on the legacy scripts is a workaround for this problem, but the main issue still persists. We shouldn't have to switch back to the legacy scripts when we want to train our model with a GPU.

Running the non-legacy script with -num_train_steps=1 -num_eval_steps=1 works after manually adding the Servo directory to the model dir. But adding more steps will crash with the error in the title.

This could be a Cuda related issue, but I'm not sure about that.

xtianhb · 2018-08-19T17:21:46Z

@Stukongeluk Yes, sure I agree with you. I just wanted to isolate possible problems related to dataset, framework setup, platform, pipeline config, etc, and meanwhile mention the workaround. I've seen problems reported similar to this in #4754 #3688
Yes, I've also found that behaviour with -num_train_steps=1 -num_eval_steps=1

mathiasthejsen · 2018-08-24T11:07:54Z

Any news on this???

zishanahmed08 · 2018-09-13T15:27:49Z

@xtianhb - the problem exists fro me even with the legacy script with batch size = 1. However no NAN loss errors with other batch sizes

gloomyfish1998 · 2018-09-16T11:06:20Z

hi all,
OS windows 10 64bit
python 3.6
tensorflow 1.10
cuda 9.0.x
cudnn 7.0.x
run pet data into same issue on GTX1050ti, use my cpu i5 run same dataset and config files it 's okie

looks like this is a bug with object detection api with pet dataset,
please keep on track , let more developer know this issue!

gloomyfish1998 · 2018-09-16T11:07:00Z

more update -->> ERROR:tensorflow:Model diverged with loss = NaN.

gloomyfish1998 · 2018-09-16T12:14:51Z

using legacy train.py can work, while need to change object_detection/utils/variables_helper.py, change like this for import part
#import logging
#import re

#import tensorflow as tf

#slim = tf.contrib.slim
import re
import tensorflow as tf
from tensorflow import logging as logging
slim = tf.contrib.slim

resolve the output two same log output issue, now seems like okay, but still could not save jpg with log
on windows10

cjr0106 · 2018-09-20T11:39:08Z

I meet the same problem, -num_train_steps=1 -num_eval_steps=1 can work,but when i add the num_train_steps,-num_eval_steps, it got the same wrong.

cjr0106 · 2018-09-20T11:49:53Z

@gloomyfish1998 have you deal the problem?

121649982 · 2018-09-20T22:05:43Z

https://yq.aliyun.com/articles/641576

121649982 · 2018-09-20T22:07:39Z

this problem can solve :
python object_detection/legacy/train.py --pipeline_config_path=D:/tensorflow/my_train/models/ssd_mobilenet_v1_pets.config --train_dir=D:/tensorflow/my_train/models/train –alsologtostderr

121649982 · 2018-09-20T22:07:50Z

Windows SET CUDA_VISIBLE_DEVICES=0

Linux export CUDA_VISIBLE_DEVICES=0

cjr0106 · 2018-09-21T14:11:33Z

@121649982
Excuse me please , could you tell me what's the dir points ? "train_dir=D:/tensorflow/my_train/models/train "

gloomyfish1998 · 2018-09-22T03:56:48Z

@cjr0106 just use legacy train.py, train_dir is output directory for your custom training model will be located, can contact with wechat gloomy_fish

121649982 · 2018-09-22T13:53:42Z

指模型训练后，模型文件保存的路径

121649982 · 2018-09-22T13:54:08Z

@cjr0106 指模型训练后，模型文件保存的路径

cjr0106 · 2018-09-25T12:44:18Z

thanks so much ,i solved it ,.
do you see duplicated training steps while using legacy train.py? i also saw infos like this:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)

Victorsoukhov · 2018-10-02T15:39:54Z

Please update to tensorflow 1.11.0. No problem with optimizer in that version. My models now run ok.

cjr0106 · 2018-10-06T06:19:24Z

@yuezhilanyi
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?

yuezhilanyi · 2018-10-06T07:23:24Z

train and eval use different config file 发自我的小米手机在 rongrong <notifications@github.com>，2018年10月6日 14:21写道： @yuezhilanyi<https://github.com/yuezhilanyi> yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#4881 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQfd-3-MbDs9gsu5Y84my2AxmvRMSoFDks5uiEvzgaJpZM4Vc3yg>.

cjr0106 · 2018-10-07T02:06:56Z

The eval config file is rewritten by myself ? or it is produced by executed train.py? 2018-10-06 15:25:10yuezhilanyi <notifications@github.com>写道： train and eval use different config file 发自我的小米手机在 rongrong <notifications@github.com>，2018年10月6日 14:21写道： @yuezhilanyi<https://github.com/yuezhilanyi> yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#4881 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQfd-3-MbDs9gsu5Y84my2AxmvRMSoFDks5uiEvzgaJpZM4Vc3yg>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lan2720 · 2018-10-09T13:10:19Z

The same problem +1

lfydegithub · 2019-02-27T15:51:33Z

Windows SET CUDA_VISIBLE_DEVICES=0

Linux export CUDA_VISIBLE_DEVICES=0

哥，具体怎么操作？还有啊，我怎么设置在训练的时候每隔多少步计算一次准确率？而不仅仅是输出loss？在train.py 后加参数？还是更改 config.py？望指教

jcRisch · 2019-09-23T13:22:39Z

Hi,
The error will appear if you forgot to set the num_classes variable in your pipeline.config.

zychen2016 · 2019-12-05T01:10:12Z

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

121649982 · 2019-12-06T03:35:24Z

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

tensorflow/models就是个坑，训练出来效果并不好，各种显存或内存不足，而且梯度爆炸，我已经弃坑，用作者原代码的模型没有这些乱七八糟的问题

It's a pit, and it doesn't work very well, it's out of memory, it's out of memory, and it's a gradient explosion, and I've abandoned the pit, and I don't have these problems with the model in the author's original code

zychen2016 · 2019-12-06T03:54:07Z

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

tensorflow/models就是个坑，训练出来效果并不好，各种显存或内存不足，而且梯度爆炸，我已经弃坑，用作者原代码的模型没有这些乱七八糟的问题

It's a pit, and it doesn't work very well, it's out of memory, it's out of memory, and it's a gradient explosion, and I've abandoned the pit, and I don't have these problems with the model in the author's original code

哈哈,SSD_mobilenet_v2有其它版本的代码吗？

anshkumar · 2020-02-19T16:22:58Z

In my case num_classes were different from no of classes in .pbtxt file.

djsamyak · 2021-07-18T21:48:09Z

If anyone is still stuck with it, what helped me was double checking my dataset. Some bounding boxes exceeded the image dimensions, leading to the error.

kumariko · 2022-01-06T15:28:26Z

@Luca3424 We are checking to see if you still need help on this issue? We recommend that you upgrade to 2.7 which is latest stable version of TF and have a look on this #4881 (comment)) , and let us know if it helps? Thanks!

google-ml-butler · 2022-01-13T15:58:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2022-01-20T16:11:38Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2022-01-20T16:11:51Z

Are you satisfied with the resolution of your issue?
Yes
No

tensorflowbutler assigned robieta Jul 25, 2018

xtianhb mentioned this issue Aug 10, 2018

LossTensor is inf or nan while training ssd_mobilenet_v1_coco model in my own dataset #3688

Closed

This was referenced Mar 1, 2019

Error during Export after eval on object_detection #6182

Closed

Any update on this issue ? I'm really stuck. It seems to be a current problem. I found this issue which is related #4881 . Thanks in advance. #6290

Closed

robieta removed their assignment Feb 6, 2020

Leslie-Fang mentioned this issue Mar 20, 2020

Support for SSD-Mobilenet fine-tune tensorflow/hub#538

Closed

ravikyram added models:research models that come under research directory type:support labels Jul 10, 2020

ravikyram assigned tombstone, jch1 and pkulzc Jul 10, 2020

jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021

kumariko added the stat:awaiting response Waiting on input from the contributor label Jan 6, 2022

google-ml-butler bot added the stale label Jan 13, 2022

google-ml-butler bot closed this as completed Jan 20, 2022

Training custom model crashes with "ERROR:tensorflow:Model diverged with loss = NaN." #4881

Training custom model crashes with "ERROR:tensorflow:Model diverged with loss = NaN." #4881

Comments

Luca3424 commented Jul 24, 2018

System information

Describe the problem

Source code / logs

kingstarcraft commented Jul 25, 2018

Luca3424 commented Jul 25, 2018 • edited

jacano commented Jul 25, 2018

GuyTraveler commented Jul 25, 2018

jacano commented Jul 26, 2018

yuezhilanyi commented Jul 31, 2018

jacano commented Jul 31, 2018

xtianhb commented Aug 5, 2018

xtianhb commented Aug 7, 2018

daruai commented Aug 10, 2018

xtianhb commented Aug 18, 2018

Stukongeluk commented Aug 19, 2018

xtianhb commented Aug 19, 2018

mathiasthejsen commented Aug 24, 2018

zishanahmed08 commented Sep 13, 2018

gloomyfish1998 commented Sep 16, 2018

gloomyfish1998 commented Sep 16, 2018

gloomyfish1998 commented Sep 16, 2018

cjr0106 commented Sep 20, 2018

cjr0106 commented Sep 20, 2018

121649982 commented Sep 20, 2018

121649982 commented Sep 20, 2018

121649982 commented Sep 20, 2018

cjr0106 commented Sep 21, 2018

gloomyfish1998 commented Sep 22, 2018

121649982 commented Sep 22, 2018

121649982 commented Sep 22, 2018

cjr0106 commented Sep 25, 2018

Victorsoukhov commented Oct 2, 2018 • edited

cjr0106 commented Oct 6, 2018

yuezhilanyi commented Oct 6, 2018 via email

cjr0106 commented Oct 7, 2018 via email

lan2720 commented Oct 9, 2018

lfydegithub commented Feb 27, 2019

jcRisch commented Sep 23, 2019

zychen2016 commented Dec 5, 2019

121649982 commented Dec 6, 2019

zychen2016 commented Dec 6, 2019

anshkumar commented Feb 19, 2020

djsamyak commented Jul 18, 2021

kumariko commented Jan 6, 2022

google-ml-butler bot commented Jan 13, 2022

google-ml-butler bot commented Jan 20, 2022

google-ml-butler bot commented Jan 20, 2022

Luca3424 commented Jul 25, 2018 •

edited

Victorsoukhov commented Oct 2, 2018 •

edited