Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training custom model crashes with "ERROR:tensorflow:Model diverged with loss = NaN." #4881

Closed
Luca3424 opened this issue Jul 24, 2018 · 50 comments
Assignees
Labels

Comments

@Luca3424
Copy link

System information

  • What is the top-level directory of the model you are using: Object detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.9.0
  • Bazel version (if compiling from source): -
  • CUDA/cuDNN version: CUDA 9.0 / cuDNN 7.0.5
  • GPU model and memory: Nvidia GeForce GTX 1050 Ti
  • Exact command to reproduce:
    python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --logtostderr

Describe the problem

The model_main.py script crashes before even one training step. Normally I'd say it's because of my GPU, but with the now deprecated train.py script it worked well. I'm training a custom model with the ssd_inception_v2_coco config file and the model as finetune checkpoint.

Source code / logs

`(tensorflow2) c:\tensorflow2\models\research\object_detection>python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --eval_training_data --alsologtostderr
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
C:\tensorflow2\models\research\object_detection\utils\visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was originally set to 'TkAgg' by the following code:
File "model_main.py", line 26, in
from object_detection import model_lib
File "C:\tensorflow2\models\research\object_detection\model_lib.py", line 26, in
from object_detection import eval_util
File "C:\tensorflow2\models\research\object_detection\eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "C:\tensorflow2\models\research\object_detection\metrics\coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "C:\tensorflow2\models\research\object_detection\metrics\coco_tools.py", line 47, in
from pycocotools import coco
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\pycocotools\coco.py", line 49, in
import matplotlib.pyplot as plt
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\backends_init_.py", line 16, in
line for line in traceback.format_stack()

import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn..model_fn at 0x0000013642613C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
2018-07-24 16:28:36.695781: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-07-24 16:28:37.044293: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.4175
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.30GiB
2018-07-24 16:28:37.053468: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-24 16:28:37.688722: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-24 16:28:37.692253: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0
2018-07-24 16:28:37.694498: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2018-07-24 16:28:37.696812: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3025 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "model_main.py", line 101, in
tf.app.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 669, in run_local
hooks=train_hooks)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1135, in _train_model_default
saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1209, in run
run_metadata=run_metadata))
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.`

@kingstarcraft
Copy link

Same problem with me on windows. If I add follow commd

-num_train_steps=1 -num_eval_steps=1

it will stop working after Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8799 MB memory)

@Luca3424
Copy link
Author

Luca3424 commented Jul 25, 2018

When I'm adding the --num_train_steps=1 and --num_eval_steps=1 commands it crashes because of the following:

tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a directory: training/export\Servo\temp-b'1532522498'; No such file or directory

As soon as I'm increasing the values of these commands it throws the same error I've mentioned above.

Any ideas? Thanks!

@jacano
Copy link

jacano commented Jul 25, 2018

Same problem here

@GuyTraveler
Copy link

I am also running into this issue. I was able to execute the model_main.py script against the latest tensorflow cpu package and have it run through a large number of steps but when trying to leverage the tensorflow gpu I keep running into the error "model diverged with loss = NaN" I tried varying my batch size but that did not resolve the issue.

@jacano
Copy link

jacano commented Jul 26, 2018

Hi guys, I end up using the old train.py from the legacy folder.
I mean, like this:
From models/research/object_detection
python ./legacy/train.py --pipeline_config_path=pipeline_config/ssd_mobilenet_v2_coco.config --train_dir=training/ --logtostderr

@yuezhilanyi
Copy link

same problem while using model_main.py to train
@jacano do you see duplicated training steps while using legacy train.py? i saw infos like this

INFO:tensorflow:Restoring parameters from /ChinaRS/code/tensorflow/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)

@jacano
Copy link

jacano commented Jul 31, 2018

Yes, I got duplicated lines yesterday, during a training session. Same as you.
I guess it has to do with the --logtostderr flag. Didn´t had time to investigate further, sorry.

@xtianhb
Copy link

xtianhb commented Aug 5, 2018

I have the same problem with a very similar setup/task. I'm training for 1 class, using GPU GTX 1060 6GB.
Command: python model_main.py --num_eval_steps=2000 --num_train_steps=50000 --alsologtostderr --pipeline_config_path=training/ssdlite_mobilenet_v2_coco.config --model_dir=training

Last week I was doing the same task with tensor flow cpu version, on the same system, and worked perfectly. Yesterday I've installed a GPU and found this problem.

I've changed --num_eval_steps=1 --num_train_steps=1 and didn't crash....

@xtianhb
Copy link

xtianhb commented Aug 7, 2018

I've updated to the last version, set in my pipe_line_config_file initial_learning_rate: 0 , checked again labels and bounding boxes, and got the same result. With Cpu I don't have this behaviour.

@daruai
Copy link

daruai commented Aug 10, 2018

Same problem here. With CPU works, with GPU prints the error.

@xtianhb
Copy link

xtianhb commented Aug 18, 2018

I've switched to the legacy/train.py script for training, and legacy/eval.py for evaluation.
It works with GPU, no problems. Same setup as commented earlier.

@Stukongeluk
Copy link

Relying on the legacy scripts is a workaround for this problem, but the main issue still persists. We shouldn't have to switch back to the legacy scripts when we want to train our model with a GPU.

Running the non-legacy script with -num_train_steps=1 -num_eval_steps=1 works after manually adding the Servo directory to the model dir. But adding more steps will crash with the error in the title.

This could be a Cuda related issue, but I'm not sure about that.

@xtianhb
Copy link

xtianhb commented Aug 19, 2018

@Stukongeluk Yes, sure I agree with you. I just wanted to isolate possible problems related to dataset, framework setup, platform, pipeline config, etc, and meanwhile mention the workaround. I've seen problems reported similar to this in #4754 #3688
Yes, I've also found that behaviour with -num_train_steps=1 -num_eval_steps=1

@mathiasthejsen
Copy link

Any news on this???

@zishanahmed08
Copy link

@xtianhb - the problem exists fro me even with the legacy script with batch size = 1. However no NAN loss errors with other batch sizes

@gloomyfish1998
Copy link

hi all,
OS windows 10 64bit
python 3.6
tensorflow 1.10
cuda 9.0.x
cudnn 7.0.x
run pet data into same issue on GTX1050ti, use my cpu i5 run same dataset and config files it 's okie

looks like this is a bug with object detection api with pet dataset,
please keep on track , let more developer know this issue!

@gloomyfish1998
Copy link

more update -->> ERROR:tensorflow:Model diverged with loss = NaN.

@gloomyfish1998
Copy link

using legacy train.py can work, while need to change object_detection/utils/variables_helper.py, change like this for import part
#import logging
#import re

#import tensorflow as tf

#slim = tf.contrib.slim
import re
import tensorflow as tf
from tensorflow import logging as logging
slim = tf.contrib.slim

resolve the output two same log output issue, now seems like okay, but still could not save jpg with log
on windows10

@cjr0106
Copy link

cjr0106 commented Sep 20, 2018

I meet the same problem, -num_train_steps=1 -num_eval_steps=1 can work,but when i add the num_train_steps,-num_eval_steps, it got the same wrong.

@cjr0106
Copy link

cjr0106 commented Sep 20, 2018

@gloomyfish1998 have you deal the problem?

@121649982
Copy link

https://yq.aliyun.com/articles/641576

@121649982
Copy link

this problem can solve :
python object_detection/legacy/train.py --pipeline_config_path=D:/tensorflow/my_train/models/ssd_mobilenet_v1_pets.config --train_dir=D:/tensorflow/my_train/models/train –alsologtostderr

@121649982
Copy link

Windows SET CUDA_VISIBLE_DEVICES=0

Linux export CUDA_VISIBLE_DEVICES=0

@cjr0106
Copy link

cjr0106 commented Sep 21, 2018

@121649982
Excuse me please , could you tell me what's the dir points ? "train_dir=D:/tensorflow/my_train/models/train "

@gloomyfish1998
Copy link

@cjr0106 just use legacy train.py, train_dir is output directory for your custom training model will be located, can contact with wechat gloomy_fish

@121649982
Copy link

指模型训练后,模型文件保存的路径

@121649982
Copy link

@cjr0106 指模型训练后,模型文件保存的路径

@cjr0106
Copy link

cjr0106 commented Sep 25, 2018

thanks so much ,i solved it ,.
do you see duplicated training steps while using legacy train.py? i also saw infos like this:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)

@Victorsoukhov
Copy link

Victorsoukhov commented Oct 2, 2018

Please update to tensorflow 1.11.0. No problem with optimizer in that version. My models now run ok.

@cjr0106
Copy link

cjr0106 commented Oct 6, 2018

@yuezhilanyi
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?

@yuezhilanyi
Copy link

yuezhilanyi commented Oct 6, 2018 via email

@cjr0106
Copy link

cjr0106 commented Oct 7, 2018 via email

@lan2720
Copy link

lan2720 commented Oct 9, 2018

The same problem +1

@lfydegithub
Copy link

Windows SET CUDA_VISIBLE_DEVICES=0

Linux export CUDA_VISIBLE_DEVICES=0

哥,具体怎么操作? 还有啊,我怎么设置在训练的时候每隔多少步计算一次准确率?而不仅仅是输出loss? 在train.py 后加参数? 还是更改 config.py? 望指教

@jcRisch
Copy link

jcRisch commented Sep 23, 2019

Hi,
The error will appear if you forgot to set the num_classes variable in your pipeline.config.

@zychen2016
Copy link

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

@121649982
Copy link

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

tensorflow/models就是个坑,训练出来效果并不好,各种显存或内存不足,而且梯度爆炸,我已经弃坑,用作者原代码的模型没有这些乱七八糟的问题

It's a pit, and it doesn't work very well, it's out of memory, it's out of memory, and it's a gradient explosion, and I've abandoned the pit, and I don't have these problems with the model in the author's original code

@zychen2016
Copy link

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

tensorflow/models就是个坑,训练出来效果并不好,各种显存或内存不足,而且梯度爆炸,我已经弃坑,用作者原代码的模型没有这些乱七八糟的问题

It's a pit, and it doesn't work very well, it's out of memory, it's out of memory, and it's a gradient explosion, and I've abandoned the pit, and I don't have these problems with the model in the author's original code

哈哈,SSD_mobilenet_v2有其它版本的代码吗?

@robieta robieta removed their assignment Feb 6, 2020
@anshkumar
Copy link
Contributor

In my case num_classes were different from no of classes in .pbtxt file.

@ravikyram ravikyram added models:research models that come under research directory type:support labels Jul 10, 2020
@jaeyounkim jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021
@djsamyak
Copy link

If anyone is still stuck with it, what helped me was double checking my dataset. Some bounding boxes exceeded the image dimensions, leading to the error.

@kumariko
Copy link

kumariko commented Jan 6, 2022

@Luca3424 We are checking to see if you still need help on this issue? We recommend that you upgrade to 2.7 which is latest stable version of TF and have a look on this #4881 (comment)) , and let us know if it helps? Thanks!

@kumariko kumariko added the stat:awaiting response Waiting on input from the contributor label Jan 6, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests