-
Notifications
You must be signed in to change notification settings - Fork 45.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training custom model crashes with "ERROR:tensorflow:Model diverged with loss = NaN." #4881
Comments
Same problem with me on windows. If I add follow commd
it will stop working after Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8799 MB memory) |
When I'm adding the
As soon as I'm increasing the values of these commands it throws the same error I've mentioned above. Any ideas? Thanks! |
Same problem here |
I am also running into this issue. I was able to execute the model_main.py script against the latest tensorflow cpu package and have it run through a large number of steps but when trying to leverage the tensorflow gpu I keep running into the error "model diverged with loss = NaN" I tried varying my batch size but that did not resolve the issue. |
Hi guys, I end up using the old train.py from the legacy folder. |
same problem while using model_main.py to train
|
Yes, I got duplicated lines yesterday, during a training session. Same as you. |
I have the same problem with a very similar setup/task. I'm training for 1 class, using GPU GTX 1060 6GB. Last week I was doing the same task with tensor flow cpu version, on the same system, and worked perfectly. Yesterday I've installed a GPU and found this problem. I've changed --num_eval_steps=1 --num_train_steps=1 and didn't crash.... |
I've updated to the last version, set in my pipe_line_config_file initial_learning_rate: 0 , checked again labels and bounding boxes, and got the same result. With Cpu I don't have this behaviour. |
Same problem here. With CPU works, with GPU prints the error. |
I've switched to the legacy/train.py script for training, and legacy/eval.py for evaluation. |
Relying on the legacy scripts is a workaround for this problem, but the main issue still persists. We shouldn't have to switch back to the legacy scripts when we want to train our model with a GPU. Running the non-legacy script with This could be a Cuda related issue, but I'm not sure about that. |
@Stukongeluk Yes, sure I agree with you. I just wanted to isolate possible problems related to dataset, framework setup, platform, pipeline config, etc, and meanwhile mention the workaround. I've seen problems reported similar to this in #4754 #3688 |
Any news on this??? |
@xtianhb - the problem exists fro me even with the legacy script with batch size = 1. However no NAN loss errors with other batch sizes |
hi all, looks like this is a bug with object detection api with pet dataset, |
more update -->> ERROR:tensorflow:Model diverged with loss = NaN. |
using legacy train.py can work, while need to change object_detection/utils/variables_helper.py, change like this for import part #import tensorflow as tf #slim = tf.contrib.slim resolve the output two same log output issue, now seems like okay, but still could not save jpg with log |
I meet the same problem, -num_train_steps=1 -num_eval_steps=1 can work,but when i add the num_train_steps,-num_eval_steps, it got the same wrong. |
@gloomyfish1998 have you deal the problem? |
this problem can solve : |
Windows SET CUDA_VISIBLE_DEVICES=0 Linux export CUDA_VISIBLE_DEVICES=0 |
@121649982 |
@cjr0106 just use legacy train.py, train_dir is output directory for your custom training model will be located, can contact with wechat gloomy_fish |
指模型训练后,模型文件保存的路径 |
@cjr0106 指模型训练后,模型文件保存的路径 |
thanks so much ,i solved it ,. |
Please update to tensorflow 1.11.0. No problem with optimizer in that version. My models now run ok. |
@yuezhilanyi |
train and eval use different config file
发自我的小米手机
在 rongrong <notifications@github.com>,2018年10月6日 14:21写道:
@yuezhilanyi<https://github.com/yuezhilanyi>
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4881 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQfd-3-MbDs9gsu5Y84my2AxmvRMSoFDks5uiEvzgaJpZM4Vc3yg>.
|
The eval config file is rewritten by myself ? or it is produced by executed train.py?
2018-10-06 15:25:10yuezhilanyi <notifications@github.com>写道:
train and eval use different config file
发自我的小米手机
在 rongrong <notifications@github.com>,2018年10月6日 14:21写道:
@yuezhilanyi<https://github.com/yuezhilanyi>
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4881 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQfd-3-MbDs9gsu5Y84my2AxmvRMSoFDks5uiEvzgaJpZM4Vc3yg>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
The same problem +1 |
哥,具体怎么操作? 还有啊,我怎么设置在训练的时候每隔多少步计算一次准确率?而不仅仅是输出loss? 在train.py 后加参数? 还是更改 config.py? 望指教 |
Hi, |
Any one help me? |
tensorflow/models就是个坑,训练出来效果并不好,各种显存或内存不足,而且梯度爆炸,我已经弃坑,用作者原代码的模型没有这些乱七八糟的问题 It's a pit, and it doesn't work very well, it's out of memory, it's out of memory, and it's a gradient explosion, and I've abandoned the pit, and I don't have these problems with the model in the author's original code |
哈哈,SSD_mobilenet_v2有其它版本的代码吗? |
In my case |
If anyone is still stuck with it, what helped me was double checking my dataset. Some bounding boxes exceeded the image dimensions, leading to the error. |
@Luca3424 We are checking to see if you still need help on this issue? We recommend that you upgrade to 2.7 which is latest stable version of TF and have a look on this #4881 (comment)) , and let us know if it helps? Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --logtostderr
Describe the problem
The model_main.py script crashes before even one training step. Normally I'd say it's because of my GPU, but with the now deprecated train.py script it worked well. I'm training a custom model with the ssd_inception_v2_coco config file and the model as finetune checkpoint.
Source code / logs
`(tensorflow2) c:\tensorflow2\models\research\object_detection>python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --eval_training_data --alsologtostderr
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
C:\tensorflow2\models\research\object_detection\utils\visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.
The backend was originally set to 'TkAgg' by the following code:
File "model_main.py", line 26, in
from object_detection import model_lib
File "C:\tensorflow2\models\research\object_detection\model_lib.py", line 26, in
from object_detection import eval_util
File "C:\tensorflow2\models\research\object_detection\eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "C:\tensorflow2\models\research\object_detection\metrics\coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "C:\tensorflow2\models\research\object_detection\metrics\coco_tools.py", line 47, in
from pycocotools import coco
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\pycocotools\coco.py", line 49, in
import matplotlib.pyplot as plt
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\backends_init_.py", line 16, in
line for line in traceback.format_stack()
import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn..model_fn at 0x0000013642613C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
2018-07-24 16:28:36.695781: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-07-24 16:28:37.044293: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.4175
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.30GiB
2018-07-24 16:28:37.053468: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-24 16:28:37.688722: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-24 16:28:37.692253: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0
2018-07-24 16:28:37.694498: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2018-07-24 16:28:37.696812: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3025 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "model_main.py", line 101, in
tf.app.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 669, in run_local
hooks=train_hooks)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1135, in _train_model_default
saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1209, in run
run_metadata=run_metadata))
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.`
The text was updated successfully, but these errors were encountered: